Re: [PATCH net] netfilter: nf_conntrack: Use net_mutex for helper unregistration.
On 6 May 2016 at 04:03, Pablo Neira Ayusowrote: > Hi Joe, > > On Thu, May 05, 2016 at 03:50:37PM -0700, Joe Stringer wrote: >> diff --git a/net/netfilter/nf_conntrack_helper.c >> b/net/netfilter/nf_conntrack_helper.c >> index 3b40ec575cd5..6860b19be406 100644 >> --- a/net/netfilter/nf_conntrack_helper.c >> +++ b/net/netfilter/nf_conntrack_helper.c >> @@ -449,10 +449,10 @@ void nf_conntrack_helper_unregister(struct >> nf_conntrack_helper *me) >>*/ >> synchronize_rcu(); >> >> - rtnl_lock(); >> + mutex_lock(_mutex); >> for_each_net(net) >> __nf_conntrack_helper_unregister(me, net); >> - rtnl_unlock(); >> + mutex_unlock(_mutex); > > This simple solution works because we have no .exit callbacks in any > of our helpers. Otherwise, the helper code may be already gone by when > the worker has a chance to run to release the netns. > > If so, probably I can append this as comment to this function so we > don't forget. If we ever have .exit callbacks (I don't expect so), we > would need to wait for worker completion. Hi Pablo, Did you want me to re-spin this patch or look into another approach?
[PATCH] asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
In testing with HiKey, we found that since commit 3f30b158eba5 ("asix: On RX avoid creating bad Ethernet frames"), we're seeing lots of noise during network transfers: [ 239.027993] asix 1-1.1:1.0 eth0: asix_rx_fixup() Data Header synchronisation was lost, remaining 988 [ 239.037310] asix 1-1.1:1.0 eth0: asix_rx_fixup() Bad Header Length 0x54ebb5ec, offset 4 [ 239.045519] asix 1-1.1:1.0 eth0: asix_rx_fixup() Bad Header Length 0xcdffe7a2, offset 4 [ 239.275044] asix 1-1.1:1.0 eth0: asix_rx_fixup() Data Header synchronisation was lost, remaining 988 [ 239.284355] asix 1-1.1:1.0 eth0: asix_rx_fixup() Bad Header Length 0x1d36f59d, offset 4 [ 239.292541] asix 1-1.1:1.0 eth0: asix_rx_fixup() Bad Header Length 0xaef3c1e9, offset 4 [ 239.518996] asix 1-1.1:1.0 eth0: asix_rx_fixup() Data Header synchronisation was lost, remaining 988 [ 239.528300] asix 1-1.1:1.0 eth0: asix_rx_fixup() Bad Header Length 0x2881912, offset 4 [ 239.536413] asix 1-1.1:1.0 eth0: asix_rx_fixup() Bad Header Length 0x5638f7e2, offset 4 And network throughput ends up being pretty bursty and slow with a overall throughput of at best ~30kB/s (where as previously we got 1.1MB/s with the slower USB1.1 "full speed" host). We found the issue also was reproducible on a x86_64 system, using a "high-speed" USB2.0 port but the throughput did not measurably drop (possibly due to the scp transfer being cpu bound on my slow test hardware). After lots of debugging, I found the check added in the problematic commit seems to be calculating the offset incorrectly. In the normal case, in the main loop of the function, we do: (where offset is zero, or set to "offset += (copy_length + 1) & 0xfffe" in the previous loop) rx->header = get_unaligned_le32(skb->data + offset); offset += sizeof(u32); But the problematic patch calculates: offset = ((rx->remaining + 1) & 0xfffe) + sizeof(u32); rx->header = get_unaligned_le32(skb->data + offset); Adding some debug logic to check those offset calculation used to find rx->header, the one in problematic code is always too large by sizeof(u32). Thus, this patch removes the incorrect " + sizeof(u32)" addition in the problematic calculation, and resolves the issue. Cc: Dean JenkinsCc: "David B. Robins" Cc: Mark Craske Cc: Emil Goode Cc: "David S. Miller" Cc: YongQin Liu Cc: Guodong Xu Cc: Ivan Vecera Cc: linux-...@vger.kernel.org Cc: netdev@vger.kernel.org Cc: stable #4.4+ Reported-by: Yongqin Liu Signed-off-by: John Stultz --- drivers/net/usb/asix_common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/usb/asix_common.c b/drivers/net/usb/asix_common.c index 0c5c22b..7de5ab5 100644 --- a/drivers/net/usb/asix_common.c +++ b/drivers/net/usb/asix_common.c @@ -66,7 +66,7 @@ int asix_rx_fixup_internal(struct usbnet *dev, struct sk_buff *skb, * buffer. */ if (rx->remaining && (rx->remaining + sizeof(u32) <= skb->len)) { - offset = ((rx->remaining + 1) & 0xfffe) + sizeof(u32); + offset = ((rx->remaining + 1) & 0xfffe); rx->header = get_unaligned_le32(skb->data + offset); offset = 0; -- 1.9.1
[PATCH 2/2] net: Fix coding style warnings and errors.
Clean up checkpatch warnings and errors: * WARNING: Block comments use * on subsequent lines * WARNING: Missing a blank line after declarations * WARNING: networking block comments don't use an empty /* line, use /* * ERROR: code indent should use tabs where possible * WARNING: please, no space before tabs * WARNING: please, no spaces at the start of a line * WARNING: line over 80 characters * ERROR: space prohibited after that open parenthesis '(' Signed-off-by: Amit Ghadge--- drivers/net/Space.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/drivers/net/Space.c b/drivers/net/Space.c index 67977f1..b5e92a6 100644 --- a/drivers/net/Space.c +++ b/drivers/net/Space.c @@ -35,8 +35,8 @@ #include /* A unified ethernet device probe. This is the easiest way to have every - ethernet adaptor have the name "eth[0123...]". - */ + * ethernet adaptor have the name "eth[0123...]". + */ struct devprobe2 { struct net_device *(*probe)(int unit); @@ -46,6 +46,7 @@ struct devprobe2 { static int __init probe_list2(int unit, struct devprobe2 *p, int autoprobe) { struct net_device *dev; + for (; p->probe; p++) { if (autoprobe && p->status) continue; @@ -58,8 +59,7 @@ static int __init probe_list2(int unit, struct devprobe2 *p, int autoprobe) return -ENODEV; } -/* - * ISA probes that touch addresses < 0x400 (including those that also +/*ISA probes that touch addresses < 0x400 (including those that also * look for EISA/PCI cards in addition to ISA cards). */ static struct devprobe2 isa_probes[] __initdata = { @@ -86,11 +86,11 @@ static struct devprobe2 isa_probes[] __initdata = { #endif #ifdef CONFIG_CS89x0 #ifndef CONFIG_CS89x0_PLATFORM - {cs89x0_probe, 0}, + {cs89x0_probe, 0}, #endif #endif -#if defined(CONFIG_MVME16x_NET) || defined(CONFIG_BVME6000_NET)/* Intel I82596 */ - {i82596_probe, 0}, +#if defined(CONFIG_MVME16x_NET) || defined(CONFIG_BVME6000_NET)/* Intel */ + {i82596_probe, 0}, /* I82596 */ #endif #ifdef CONFIG_NI65 {ni65_probe, 0}, @@ -118,13 +118,12 @@ static struct devprobe2 m68k_probes[] __initdata = { {mac8390_probe, 0}, #endif #ifdef CONFIG_MAC89x0 - {mac89x0_probe, 0}, + {mac89x0_probe, 0}, #endif {NULL, 0}, }; -/* - * Unified ethernet device probe, segmented per architecture and +/* Unified ethernet device probe, segmented per architecture and * per bus interface. This drives the legacy devices only for now. */ @@ -135,7 +134,7 @@ static void __init ethif_probe2(int unit) if (base_addr == 1) return; - (void)( probe_list2(unit, m68k_probes, base_addr == 0) && + (void)(probe_list2(unit, m68k_probes, base_addr == 0) && probe_list2(unit, isa_probes, base_addr == 0)); } -- 2.5.5
RE: [PATCH v11 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)
> From: David Miller [mailto:da...@davemloft.net] > Sent: Monday, May 16, 2016 1:16 > To: Dexuan Cui> Cc: gre...@linuxfoundation.org; netdev@vger.kernel.org; linux- > ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de; > a...@canonical.com; jasow...@redhat.com; cav...@redhat.com; KY > Srinivasan ; Haiyang Zhang ; > j...@perches.com; vkuzn...@redhat.com > Subject: Re: [PATCH v11 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock) > > From: Dexuan Cui > Date: Sun, 15 May 2016 09:52:42 -0700 > > > Changes since v10 > > > > 1) add module params: send_ring_page, recv_ring_page. They can be used to > > enlarge the ringbuffer size to get better performance, e.g., > > # modprobe hv_sock recv_ring_page=16 send_ring_page=16 > > By default, recv_ring_page is 3 and send_ring_page is 2. > > > > 2) add module param max_socket_number (the default is 1024). > > A user can enlarge the number to create more than 1024 hv_sock sockets. > > By default, 1024 sockets take about 1024 * (3+2+1+1) * 4KB = 28M bytes. > > (Here 1+1 means 1 page for send/recv buffers per connection, respectively.) > > This is papering around my objections, and create module parameters which > I am fundamentally against. > > You're making the facility unusable by default, just to work around my > memory consumption concerns. > > What will end up happening is that everyone will simply increase the > values. > > You're not really addressing the core issue, and I will be ignoring you > future submissions of this change until you do. David, I am sorry I came across as ignoring your feedback; that was not my intention. The current host side design for this feature is such that each socket connection needs its own channel, which consists of 1.A ring buffer for host to guest communication 2.A ring buffer for guest to host communication The memory for the ring buffers has to be pinned down as this will be accessed both from interrupt level in Linux guest and from the host OS at any time. To address your concerns, I am planning to re-implement both the receive path and the send path so that no additional pinned memory will be needed. Receive Path: When the application does a read on the socket, we will dynamically allocate the buffer and perform the read operation on the incoming ring buffer. Since we will be in the process context, we can sleep here and will set the "GFP_KERNEL | __GFP_NOFAIL" flags. This buffer will be freed once the application consumes all the data. Send Path: On the send side, we will construct the payload to be sent directly on the outgoing ringbuffer. So, with these changes, the only memory that will be pinned down will be the memory for the ring buffers on a per-connection basis and this memory will be pinned down until the connection is torn down. Please let me know if this addresses your concerns. Thanks, -- Dexuan
Re: [PATCH iproute2] ss: Tell user about -EOPNOTSUPP for SOCK_DESTROY
On Tue, May 17, 2016 at 11:24 AM, David Ahernwrote: > As I mentioned we can print the unsupported once or per socket matched and > with the socket params. e.g., > > + } else if (errno == EOPNOTSUPP) { > + printf("Operation not supported for:\n"); > + inet_show_sock(h, diag_arg->f, diag_arg->protocol); > > Actively suppressing all error messages is just wrong. I get the flooding > issue so I'm fine with just printing it once. I disagree, but then I'm the one who wrote it in the first place, so you wouldn't expect me to agree. :-) Let's see what Stephen says.
[net-next 1/2] ixgbe: use correct mask when enabling sriov
From: Emil TantilovSwap the parameters in GENMASK in order to generate the correct mask. This change fixes Tx hangs when enabling SRIOV. Signed-off-by: Emil Tantilov Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index d08fbcf..7bbf9b1 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -3767,9 +3767,9 @@ static void ixgbe_configure_virtualization(struct ixgbe_adapter *adapter) reg_offset = (VMDQ_P(0) >= 32) ? 1 : 0; /* Enable only the PF's pool for Tx/Rx */ - IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), GENMASK(vf_shift, 31)); + IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), GENMASK(31, vf_shift)); IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset ^ 1), reg_offset - 1); - IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), GENMASK(vf_shift, 31)); + IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), GENMASK(31, vf_shift)); IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset ^ 1), reg_offset - 1); if (adapter->bridge_mode == BRIDGE_MODE_VEB) IXGBE_WRITE_REG(hw, IXGBE_PFDTXGSWC, IXGBE_PFDTXGSWC_VT_LBEN); -- 2.5.5
[net-next 0/2][pull request] 10GbE Intel Wired LAN Driver Updates 2016-05-16
This series contains 2 fixes to ixgbe only. Emil fixes transmit hangs when enabling SRIOV by swapping the parameters in GENMASK in order to generate the correct mask. Alex fixes his previous patch b83e30104bd9 ("ixgbe/ixgbevf: Add support for GSO partial") where he somehow transposed the location of setting the VLAN features in netdev->features and the configuration of the vlan_features. The following are changes since commit 7e2c3aea4398d079745b9faa2c17b6cbd010f221: net: also make sch_handle_egress() drop monitor ready and are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 10GbE Alexander Duyck (1): ixgbe: Fix VLAN features error Emil Tantilov (1): ixgbe: use correct mask when enabling sriov drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) -- 2.5.5
[net-next 2/2] ixgbe: Fix VLAN features error
From: Alexander DuyckIt looks like at some point I somehow transposed the location of setting the VLAN features in netdev->features and the configuration of the vlan_features. As a result the driver is now generating a warning about vlan_features being setup incorrectly. This patch corrects that by placing the update of netdev->features to include the VLAN features so that it is after the point where we write netdev->features into netdev->vlan_features. Fixes: b83e30104bd9 ("ixgbe/ixgbevf: Add support for GSO partial") Signed-off-by: Alexander Duyck Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 7bbf9b1..9f3677c 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -9508,15 +9508,15 @@ skip_sriov: if (pci_using_dac) netdev->features |= NETIF_F_HIGHDMA; + netdev->vlan_features |= netdev->features | NETIF_F_TSO_MANGLEID; + netdev->hw_enc_features |= netdev->vlan_features; + netdev->mpls_features |= NETIF_F_HW_CSUM; + /* set this bit last since it cannot be part of vlan_features */ netdev->features |= NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_CTAG_TX; - netdev->vlan_features |= netdev->features | NETIF_F_TSO_MANGLEID; - netdev->hw_enc_features |= netdev->vlan_features; - netdev->mpls_features |= NETIF_F_HW_CSUM; - netdev->priv_flags |= IFF_UNICAST_FLT; netdev->priv_flags |= IFF_SUPP_NOFCS; -- 2.5.5
Re: [PATCH iproute2] ss: Tell user about -EOPNOTSUPP for SOCK_DESTROY
On 5/16/16 8:04 PM, Lorenzo Colitti wrote: Given that the filter can specify a number of sockets, some of which can and some of which can't be closed, and that whether a given socket can be closed is only known at the time we attempt to close it, there is a choice between two bad outcomes: 1. Users try to use "ss -K" with a kernel that doesn't support it, and get confused about why it does nothing and doesn't print an error message. 2. Users use "ss -K" with a kernel that does support it, and get irritated by seeing one error message per TCP_TIME_WAIT socket, UDP socket, etc. As I mentioned we can print the unsupported once or per socket matched and with the socket params. e.g., + } else if (errno == EOPNOTSUPP) { + printf("Operation not supported for:\n"); + inet_show_sock(h, diag_arg->f, diag_arg->protocol); Actively suppressing all error messages is just wrong. I get the flooding issue so I'm fine with just printing it once.
Re: [PATCH 1/2] net: ethernet: fec-mpc52xx: use phydev from struct net_device
From: Philippe ReynesDate: Tue, 17 May 2016 00:32:33 +0200 > The private structure contain a pointer to phydev, but the structure > net_device already contain such pointer. So we can remove the pointer > phydev in the private structure, and update the driver to use the > one contained in struct net_device. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH 2/2] net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
From: Philippe ReynesDate: Tue, 17 May 2016 00:32:34 +0200 > There are two generics functions phy_ethtool_{get|set}_link_ksettings, > so we can use them instead of defining the same code in the driver. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH net-next] bpf, doc: fix typo on bpf_asm descriptions
From: Daniel BorkmannDate: Mon, 16 May 2016 23:06:53 +0200 > Fix description of some of the bpf_asm tool related jump instructions > and generally move them to format A k. > > Reported-by: Sebastian Amend > Signed-off-by: Daniel Borkmann Applied.
Re: [PATCH] stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
From: Ezequiel GarciaDate: Mon, 16 May 2016 12:41:07 -0300 > Commit f748be531d70 ("stmmac: support new GMAC4") reverted a previous fix > by mistake. This commit re-applies said fix: > > commit dec2165ff38a99f937fe61875d102c6c8596c815 > Author: Sonic Zhang > Date: Thu Jan 22 14:55:57 2015 +0800 > stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set > > Clear the TX COE bit when force_thresh_dma_mode is set even hardware > dma capability says support. > > Tested on BF609. > > Signed-off-by: Sonic Zhang > Acked-by: Giuseppe Cavallaro > Signed-off-by: David S. Miller > > Tested on LPC4350 Hitex board. > > Fixes: f748be531d70 ("stmmac: support new GMAC4") > Signed-off-by: Ezequiel Garcia Applied.
Re: [PATCH 2/2] net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
From: Philippe ReynesDate: Mon, 16 May 2016 16:52:37 +0200 > There are two generics functions phy_ethtool_{get|set}_link_ksettings, > so we can use them instead of defining the same code in the driver. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH 1/2] net: ethernet: fs-enet: use phydev from struct net_device
From: Philippe ReynesDate: Mon, 16 May 2016 16:52:36 +0200 > The private structure contain a pointer to phydev, but the structure > net_device already contain such pointer. So we can remove the pointer > phydev in the private structure, and update the driver to use the > one contained in struct net_device. > > Signed-off-by: Philippe Reynes Applied.
Re: BUG: use-after-free in netlink_dump
From: Herbert XuDate: Mon, 16 May 2016 17:28:16 +0800 > Subject: netlink: Fix dump skb leak/double free > > When we free cb->skb after a dump, we do it after releasing the > lock. This means that a new dump could have started in the time > being and we'll end up freeing their skb instead of ours. > > This patch saves the skb and module before we unlock so we free > the right memory. > > Fixes: 16b304f3404f ("netlink: Eliminate kmalloc in netlink dump operation.") > Reported-by: Baozeng Ding > Signed-off-by: Herbert Xu Applied and queued up for -stable, thanks.
Re: [PATCH iproute2] ss: Tell user about -EOPNOTSUPP for SOCK_DESTROY
On Tue, May 17, 2016 at 10:52 AM, David Ahernwrote: > code is not setup to handle that. Only option seems to be at least dump an > error message, but the message can not relate any of the specifics about the > filter. So something like this though it dumps the message per socket > matched by the filter. Could throttle it to once. > [...] > if (diag_arg->f->kill && kill_inet_sock(h, arg) != 0) { > - if (errno == EOPNOTSUPP || errno == ENOENT) { > - /* Socket can't be closed, or is already closed. */ > + if (errno == ENOENT) { > + /* socket is already closed. */ > + return 0; > + /* Socket can't be closed OR config is not enabled */ > + } else if (errno == EOPNOTSUPP) { > + perror("SOCK_DESTROY answers"); The reason the code was written like that is that I didn't want to print one error message for every socket that can't be closed - such as TIME_WAIT sockets or UDP sockets. Given that the filter can specify a number of sockets, some of which can and some of which can't be closed, and that whether a given socket can be closed is only known at the time we attempt to close it, there is a choice between two bad outcomes: 1. Users try to use "ss -K" with a kernel that doesn't support it, and get confused about why it does nothing and doesn't print an error message. 2. Users use "ss -K" with a kernel that does support it, and get irritated by seeing one error message per TCP_TIME_WAIT socket, UDP socket, etc. Personally I think it's more important to avoid #2 than #1, because #1 is one time (only if you're compiling your own kernel), but #2 is forever. Also, I think it's consistent with other behaviours in ss - for example, if the kernel doesn't support SOCK_DIAG for UDP, you just get nothing back if you run "ss -u". That said, I'm not the maintainer of this code. Stephen, any thoughts?
[PATCH 3.14 14/17] VSOCK: do not disconnect socket when peer has shutdown SEND only
3.14-stable review patch. If anyone has any objections, please let me know. -- From: Ian Campbell[ Upstream commit dedc58e067d8c379a15a8a183c5db318201295bb ] The peer may be expecting a reply having sent a request and then done a shutdown(SHUT_WR), so tearing down the whole socket at this point seems wrong and breaks for me with a client which does a SHUT_WR. Looking at other socket family's stream_recvmsg callbacks doing a shutdown here does not seem to be the norm and removing it does not seem to have had any adverse effects that I can see. I'm using Stefan's RFC virtio transport patches, I'm unsure of the impact on the vmci transport. Signed-off-by: Ian Campbell Cc: "David S. Miller" Cc: Stefan Hajnoczi Cc: Claudio Imbrenda Cc: Andy King Cc: Dmitry Torokhov Cc: Jorgen Hansen Cc: Adit Ranadive Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/vmw_vsock/af_vsock.c | 21 + 1 file changed, 1 insertion(+), 20 deletions(-) --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -1796,27 +1796,8 @@ vsock_stream_recvmsg(struct kiocb *kiocb else if (sk->sk_shutdown & RCV_SHUTDOWN) err = 0; - if (copied > 0) { - /* We only do these additional bookkeeping/notification steps -* if we actually copied something out of the queue pair -* instead of just peeking ahead. -*/ - - if (!(flags & MSG_PEEK)) { - /* If the other side has shutdown for sending and there -* is nothing more to read, then modify the socket -* state. -*/ - if (vsk->peer_shutdown & SEND_SHUTDOWN) { - if (vsock_stream_has_data(vsk) <= 0) { - sk->sk_state = SS_UNCONNECTED; - sock_set_flag(sk, SOCK_DONE); - sk->sk_state_change(sk); - } - } - } + if (copied > 0) err = copied; - } out_wait: finish_wait(sk_sleep(sk), );
Re: [PATCH net-next] tipc: check nl sock before parsing nested attributes
From: Richard AlpeDate: Mon, 16 May 2016 11:14:54 +0200 > Make sure the socket for which the user is listing publication exists > before parsing the socket netlink attributes. > > Prior to this patch a call without any socket caused a NULL pointer > dereference in tipc_nl_publ_dump(). > > Tested-and-reported-by: Baozeng Ding > Signed-off-by: Richard Alpe Applied and queued up for -stable.
Re: [PATCH net-next] fq_codel: fix memory limitation drift
From: Eric DumazetDate: Sun, 15 May 2016 18:16:38 -0700 > From: Eric Dumazet > > memory_usage must be decreased in dequeue_func(), not in > fq_codel_dequeue(), otherwise packets dropped by Codel algo > are missing this decrease. > > Also we need to clear memory_usage in fq_codel_reset() > > Fixes: 95b58430abe7 ("fq_codel: add memory limitation per queue") > Signed-off-by: Eric Dumazet Applied.
[PATCH 4.4 31/73] VSOCK: do not disconnect socket when peer has shutdown SEND only
4.4-stable review patch. If anyone has any objections, please let me know. -- From: Ian Campbell[ Upstream commit dedc58e067d8c379a15a8a183c5db318201295bb ] The peer may be expecting a reply having sent a request and then done a shutdown(SHUT_WR), so tearing down the whole socket at this point seems wrong and breaks for me with a client which does a SHUT_WR. Looking at other socket family's stream_recvmsg callbacks doing a shutdown here does not seem to be the norm and removing it does not seem to have had any adverse effects that I can see. I'm using Stefan's RFC virtio transport patches, I'm unsure of the impact on the vmci transport. Signed-off-by: Ian Campbell Cc: "David S. Miller" Cc: Stefan Hajnoczi Cc: Claudio Imbrenda Cc: Andy King Cc: Dmitry Torokhov Cc: Jorgen Hansen Cc: Adit Ranadive Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/vmw_vsock/af_vsock.c | 21 + 1 file changed, 1 insertion(+), 20 deletions(-) --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -1794,27 +1794,8 @@ vsock_stream_recvmsg(struct socket *sock else if (sk->sk_shutdown & RCV_SHUTDOWN) err = 0; - if (copied > 0) { - /* We only do these additional bookkeeping/notification steps -* if we actually copied something out of the queue pair -* instead of just peeking ahead. -*/ - - if (!(flags & MSG_PEEK)) { - /* If the other side has shutdown for sending and there -* is nothing more to read, then modify the socket -* state. -*/ - if (vsk->peer_shutdown & SEND_SHUTDOWN) { - if (vsock_stream_has_data(vsk) <= 0) { - sk->sk_state = SS_UNCONNECTED; - sock_set_flag(sk, SOCK_DONE); - sk->sk_state_change(sk); - } - } - } + if (copied > 0) err = copied; - } out_wait: finish_wait(sk_sleep(sk), );
Re: [PATCH 1/2] net: ethernet: gianfar: use phydev from struct net_device
From: Philippe ReynesDate: Mon, 16 May 2016 01:30:08 +0200 > The private structure contain a pointer to phydev, but the structure > net_device already contain such pointer. So we can remove the pointer > phydev in the private structure, and update the driver to use the > one contained in struct net_device. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH iproute2] ss: Tell user about -EOPNOTSUPP for SOCK_DESTROY
On 5/16/16 7:20 PM, Lorenzo Colitti wrote: On Tue, May 17, 2016 at 10:14 AM, David Ahernwrote: For example, EOPNOTSUPP can just mean "this socket can't be closed because it's a timewait or NEW_SYN_RECV socket". In hindsight it might have been better to return EBADFD in those cases, but that still doesn't solve the UI problem. If the user does something like "ss -K dport = :443", the user would expect the command to kill all TCP sockets and not just abort if there happens to be a UDP socket to port 443 (which can't be closed because UDP doesn't currently implement SOCK_DESTROY). Silently doing nothing is just as bad - or worse. I was running in circles trying to figure out why nothing was happening and ss was exiting 0. At least that's documented to be the case in the man page. On the other hand, if your patch is applied, there will be no way to close more than one socket if one of them returns EOPNOTSUPP. On a busy server where things go into TIME_WAIT all the time, you might never be able to close all sockets. If you want to inform the user, then you could do so via the return value of ss - e.g., return 0 if at least one socket was printed and closed, or 1 otherwise. code is not setup to handle that. Only option seems to be at least dump an error message, but the message can not relate any of the specifics about the filter. So something like this though it dumps the message per socket matched by the filter. Could throttle it to once. diff --git a/misc/ss.c b/misc/ss.c index 23fff19d9199..1925c6fd9c36 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -2264,8 +2264,12 @@ static int show_one_inet_sock(const struct sockaddr_nl *addr, if (!(diag_arg->f->families & (1 << r->idiag_family))) return 0; if (diag_arg->f->kill && kill_inet_sock(h, arg) != 0) { - if (errno == EOPNOTSUPP || errno == ENOENT) { - /* Socket can't be closed, or is already closed. */ + if (errno == ENOENT) { + /* socket is already closed. */ + return 0; + /* Socket can't be closed OR config is not enabled */ + } else if (errno == EOPNOTSUPP) { + perror("SOCK_DESTROY answers"); return 0; } else { perror("SOCK_DESTROY answers");
Re: [PATCH 1/2] net: ethernet: ftgmac100: use phydev from struct net_device
From: Philippe ReynesDate: Mon, 16 May 2016 01:35:13 +0200 > The private structure contain a pointer to phydev, but the structure > net_device already contain such pointer. So we can remove the pointer > phydev in the private structure, and update the driver to use the > one contained in struct net_device. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH 2/2] net: ethernet: ftgmac100: use phy_ethtool_{get|set}_link_ksettings
From: Philippe ReynesDate: Mon, 16 May 2016 01:35:14 +0200 > There are two generics functions phy_ethtool_{get|set}_link_ksettings, > so we can use them instead of defining the same code in the driver. > > Signed-off-by: Philippe Reynes Applied.
Regarding vxlan unicast configuration
Hi, I am trying vxlan unicast configuration on a back to back connected interfaces of VM1 and VM2 running fedora 23 with 4.2 version of kernel. Below is my configuration VM1 ip address add 100.1.1.1/24 dev enp0s8 ifconfig enp0s8 up ip link add name vxlan42 type vxlan id 42 dev enp0s8 remote 50.1.1.2 local 50.1.1.1 dstport 4789 ip address add 50.1.1.1/24 dev vxlan42 ip link set up vxlan42 VM2 ip address add 100.1.1.2/24 dev enp0s8 ifconfig enp0s8 up ip link add name vxlan42 type vxlan id 42 dev enp0s8 remote 50.1.1.1 local 50.1.1.2 dstport 4789 ip address add 50.1.1.2/24 dev vxlan42 ip link set up vxlan42 Now when I try to ping 50.1.1.1 from VM2, I am receiving ARP packets on VM1 which are not vxlan tagged. As a result ping is not working. I am able to successfully configure multicast based vxlan but having issues with vxlan unicast. Is there something wrong with my configuration ? Regards, Ajith
Re: [PATCH 2/2] net: ethernet: gianfar: use phy_ethtool_{get|set}_link_ksettings
From: Philippe ReynesDate: Mon, 16 May 2016 01:30:09 +0200 > There are two generics functions phy_ethtool_{get|set}_link_ksettings, > so we can use them instead of defining the same code in the driver. > > Signed-off-by: Philippe Reynes Applied.
Re: [PATCH net-next] tuntap: introduce tx skb ring
On 2016年05月16日 16:08, Michael S. Tsirkin wrote: On Mon, May 16, 2016 at 03:52:11PM +0800, Jason Wang wrote: On 2016年05月16日 12:23, Michael S. Tsirkin wrote: On Mon, May 16, 2016 at 09:17:01AM +0800, Jason Wang wrote: We used to queue tx packets in sk_receive_queue, this is less efficient since it requires spinlocks to synchronize between producer and consumer. This patch tries to address this by using circular buffer which allows lockless synchronization. This is done by switching from sk_receive_queue to a tx skb ring with a new flag IFF_TX_RING and when this is set: Why do we need a new flag? Is there a userspace-visible behaviour change? Probably yes since tx_queue_length does not work. So the flag name should reflect the behaviour somehow, not the implementation. - store pointer to skb in circular buffer in tun_net_xmit(), and read it from the circular buffer in tun_do_read(). - introduce a new proto_ops peek which could be implemented by specific socket which does not use sk_receive_queue. - store skb length in circular buffer too, and implement a lockless peek for tuntap. - change vhost_net to use proto_ops->peek() instead - new spinlocks were introduced to synchronize among producers (and so did for consumers). Pktgen test shows about 9% improvement on guest receiving pps: Before: ~148pps After : ~161pps (I'm not sure noblocking read is still needed, so it was not included in this patch) How do you mean? Of course we must support blocking and non-blocking read - userspace uses it. Ok, will add this. Signed-off-by: Jason Wang--- --- drivers/net/tun.c | 157 +--- drivers/vhost/net.c | 16 - include/linux/net.h | 1 + include/uapi/linux/if_tun.h | 1 + 4 files changed, 165 insertions(+), 10 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 425e983..6001ece 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -71,6 +71,7 @@ #include #include #include +#include #include @@ -130,6 +131,8 @@ struct tap_filter { #define MAX_TAP_FLOWS 4096 #define TUN_FLOW_EXPIRE (3 * HZ) +#define TUN_RING_SIZE 256 Can we resize this according to tx_queue_len set by user? We can, but it needs lots of other changes, e.g being notified when tx_queue_len was changed by user. Some kind of notifier? Yes, maybe. Probably better than a new user interface. Ok. And if tx_queue_length is not power of 2, we probably need modulus to calculate the capacity. Is that really that important for speed? Not sure, I can test. If yes, round it up to next power of two. Right, this sounds a good solution. You can also probably wrap it with a conditional instead. +#define TUN_RING_MASK (TUN_RING_SIZE - 1) struct tun_pcpu_stats { u64 rx_packets; @@ -142,6 +145,11 @@ struct tun_pcpu_stats { u32 rx_frame_errors; }; +struct tun_desc { + struct sk_buff *skb; + int len; /* Cached skb len for peeking */ +}; + /* A tun_file connects an open character device to a tuntap netdevice. It * also contains all socket related structures (except sock_fprog and tap_filter) * to serve as one transmit queue for tuntap device. The sock_fprog and @@ -167,6 +175,13 @@ struct tun_file { }; struct list_head next; struct tun_struct *detached; + /* reader lock */ + spinlock_t rlock; + unsigned long tail; + struct tun_desc tx_descs[TUN_RING_SIZE]; + /* writer lock */ + spinlock_t wlock; + unsigned long head; }; struct tun_flow_entry { @@ -515,7 +530,27 @@ static struct tun_struct *tun_enable_queue(struct tun_file *tfile) static void tun_queue_purge(struct tun_file *tfile) { + unsigned long head, tail; + struct tun_desc *desc; + struct sk_buff *skb; skb_queue_purge(>sk.sk_receive_queue); + spin_lock(>rlock); + + head = ACCESS_ONCE(tfile->head); + tail = tfile->tail; + + /* read tail before reading descriptor at tail */ + smp_rmb(); I think you mean read *head* here Right. + + while (CIRC_CNT(head, tail, TUN_RING_SIZE) >= 1) { + desc = >tx_descs[tail]; + skb = desc->skb; + kfree_skb(skb); + tail = (tail + 1) & TUN_RING_MASK; + /* read descriptor before incrementing tail. */ + smp_store_release(>tail, tail & TUN_RING_MASK); + } + spin_unlock(>rlock); skb_queue_purge(>sk.sk_error_queue); } Barrier pairing seems messed up. Could you tag each barrier with its pair pls? E.g. add /* Barrier A for pairing */ Before barrier and its pair. Ok. for both tun_queue_purge() and tun_do_read(): smp_rmb() is paired with smp_store_release() in tun_net_xmit(). this seems at least an overkill. rmb would normally be paired with wmb, not a full mb within release. wmb is not enough here. We need
[PATCH 4.5 036/101] VSOCK: do not disconnect socket when peer has shutdown SEND only
4.5-stable review patch. If anyone has any objections, please let me know. -- From: Ian Campbell[ Upstream commit dedc58e067d8c379a15a8a183c5db318201295bb ] The peer may be expecting a reply having sent a request and then done a shutdown(SHUT_WR), so tearing down the whole socket at this point seems wrong and breaks for me with a client which does a SHUT_WR. Looking at other socket family's stream_recvmsg callbacks doing a shutdown here does not seem to be the norm and removing it does not seem to have had any adverse effects that I can see. I'm using Stefan's RFC virtio transport patches, I'm unsure of the impact on the vmci transport. Signed-off-by: Ian Campbell Cc: "David S. Miller" Cc: Stefan Hajnoczi Cc: Claudio Imbrenda Cc: Andy King Cc: Dmitry Torokhov Cc: Jorgen Hansen Cc: Adit Ranadive Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/vmw_vsock/af_vsock.c | 21 + 1 file changed, 1 insertion(+), 20 deletions(-) --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -1789,27 +1789,8 @@ vsock_stream_recvmsg(struct socket *sock else if (sk->sk_shutdown & RCV_SHUTDOWN) err = 0; - if (copied > 0) { - /* We only do these additional bookkeeping/notification steps -* if we actually copied something out of the queue pair -* instead of just peeking ahead. -*/ - - if (!(flags & MSG_PEEK)) { - /* If the other side has shutdown for sending and there -* is nothing more to read, then modify the socket -* state. -*/ - if (vsk->peer_shutdown & SEND_SHUTDOWN) { - if (vsock_stream_has_data(vsk) <= 0) { - sk->sk_state = SS_UNCONNECTED; - sock_set_flag(sk, SOCK_DONE); - sk->sk_state_change(sk); - } - } - } + if (copied > 0) err = copied; - } out: release_sock(sk);
Re: [PATCH iproute2] ss: Tell user about -EOPNOTSUPP for SOCK_DESTROY
On Tue, May 17, 2016 at 10:14 AM, David Ahernwrote: >> >> For example, EOPNOTSUPP can just mean "this socket can't be closed >> because it's a timewait or NEW_SYN_RECV socket". In hindsight it might >> have been better to return EBADFD in those cases, but that still >> doesn't solve the UI problem. If the user does something like "ss -K >> dport = :443", the user would expect the command to kill all TCP >> sockets and not just abort if there happens to be a UDP socket to port >> 443 (which can't be closed because UDP doesn't currently implement >> SOCK_DESTROY). > > > Silently doing nothing is just as bad - or worse. I was running in circles > trying to figure out why nothing was happening and ss was exiting 0. At least that's documented to be the case in the man page. On the other hand, if your patch is applied, there will be no way to close more than one socket if one of them returns EOPNOTSUPP. On a busy server where things go into TIME_WAIT all the time, you might never be able to close all sockets. If you want to inform the user, then you could do so via the return value of ss - e.g., return 0 if at least one socket was printed and closed, or 1 otherwise.
Re: [PATCH iproute2] ss: Tell user about -EOPNOTSUPP for SOCK_DESTROY
On 5/16/16 7:01 PM, Lorenzo Colitti wrote: On Tue, May 17, 2016 at 8:53 AM, David Ahernwrote: @@ -2264,7 +2264,7 @@ static int show_one_inet_sock(const struct sockaddr_nl *addr, if (!(diag_arg->f->families & (1 << r->idiag_family))) return 0; if (diag_arg->f->kill && kill_inet_sock(h, arg) != 0) { - if (errno == EOPNOTSUPP || errno == ENOENT) { + if (errno == ENOENT) { /* Socket can't be closed, or is already closed. */ return 0; } else { I don't think you can do this without breaking the functionality of -K. The else branch will cause show_one_inet_sock to return -1, which will cause rtnl_dump_filter to abort and not close any other sockets that the user requested killing. That's incorrect, because getting EOPNOTSUPP on one socket doesn't necessarily mean we'll get EOPNOTSUPP on any future sockets in the same dump. For example, EOPNOTSUPP can just mean "this socket can't be closed because it's a timewait or NEW_SYN_RECV socket". In hindsight it might have been better to return EBADFD in those cases, but that still doesn't solve the UI problem. If the user does something like "ss -K dport = :443", the user would expect the command to kill all TCP sockets and not just abort if there happens to be a UDP socket to port 443 (which can't be closed because UDP doesn't currently implement SOCK_DESTROY). Silently doing nothing is just as bad - or worse. I was running in circles trying to figure out why nothing was happening and ss was exiting 0.
Re: [PATCH] net: diag: Tell user if support for destroying TCP sockets is not enabled
On 5/16/16 6:49 PM, Lorenzo Colitti wrote: On Tue, May 17, 2016 at 8:53 AM, David Ahernwrote: +#else +static int tcp_diag_destroy(struct sk_buff *in_skb, + const struct inet_diag_req_v2 *req) +{ + return -EOPNOTSUPP; +} #endif I don't understand why you need this. inet_diag_cmd_exact already returns EOPNOTSUPP if tcp_diag_handler.destroy is NULL: else if (cmd == SOCK_DIAG_BY_FAMILY) err = handler->dump_one(in_skb, nlh, req); else if (cmd == SOCK_DESTROY && handler->destroy) err = handler->destroy(in_skb, req); else err = -EOPNOTSUPP; Is this not working for some reason? hmmm kernel patch is not needed. Suppression was happening in ss.
Re: [PATCH iproute2] ss: Tell user about -EOPNOTSUPP for SOCK_DESTROY
On Tue, May 17, 2016 at 8:53 AM, David Ahernwrote: > @@ -2264,7 +2264,7 @@ static int show_one_inet_sock(const struct sockaddr_nl > *addr, > if (!(diag_arg->f->families & (1 << r->idiag_family))) > return 0; > if (diag_arg->f->kill && kill_inet_sock(h, arg) != 0) { > - if (errno == EOPNOTSUPP || errno == ENOENT) { > + if (errno == ENOENT) { > /* Socket can't be closed, or is already closed. */ > return 0; > } else { I don't think you can do this without breaking the functionality of -K. The else branch will cause show_one_inet_sock to return -1, which will cause rtnl_dump_filter to abort and not close any other sockets that the user requested killing. That's incorrect, because getting EOPNOTSUPP on one socket doesn't necessarily mean we'll get EOPNOTSUPP on any future sockets in the same dump. For example, EOPNOTSUPP can just mean "this socket can't be closed because it's a timewait or NEW_SYN_RECV socket". In hindsight it might have been better to return EBADFD in those cases, but that still doesn't solve the UI problem. If the user does something like "ss -K dport = :443", the user would expect the command to kill all TCP sockets and not just abort if there happens to be a UDP socket to port 443 (which can't be closed because UDP doesn't currently implement SOCK_DESTROY).
Re: [PATCH] net: diag: Tell user if support for destroying TCP sockets is not enabled
On Tue, May 17, 2016 at 8:53 AM, David Ahernwrote: > +#else > +static int tcp_diag_destroy(struct sk_buff *in_skb, > + const struct inet_diag_req_v2 *req) > +{ > + return -EOPNOTSUPP; > +} > #endif I don't understand why you need this. inet_diag_cmd_exact already returns EOPNOTSUPP if tcp_diag_handler.destroy is NULL: else if (cmd == SOCK_DIAG_BY_FAMILY) err = handler->dump_one(in_skb, nlh, req); else if (cmd == SOCK_DESTROY && handler->destroy) err = handler->destroy(in_skb, req); else err = -EOPNOTSUPP; Is this not working for some reason?
linux-next: manual merge of the net-next tree with the arm64 tree
Hi all, Today's linux-next merge of the net-next tree got a conflict in: arch/arm64/Kconfig between commit: 8ee708792e1c ("arm64: Kconfig: remove redundant HAVE_ARCH_TRANSPARENT_HUGEPAGE definition") from the arm64 tree and commit: 606b5908 ("bpf: split HAVE_BPF_JIT into cBPF and eBPF variant") from the net-next tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc arch/arm64/Kconfig index 8845c0d100d7,e6761ea2feec.. --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@@ -59,9 -58,7 +59,9 @@@ config ARM6 select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT select HAVE_ARCH_SECCOMP_FILTER select HAVE_ARCH_TRACEHOOK + select HAVE_ARCH_TRANSPARENT_HUGEPAGE + select HAVE_ARM_SMCCC - select HAVE_BPF_JIT + select HAVE_EBPF_JIT select HAVE_C_RECORDMCOUNT select HAVE_CC_STACKPROTECTOR select HAVE_CMPXCHG_DOUBLE
Re: [REGRESSION] asix: Lots of asix_rx_fixup() errors and slow transmissions
On Wed, May 11, 2016 at 3:00 PM, Dean Jenkinswrote: > > Your observations are consistent with missing URBs from the USB host > controller. > > Here is a summary of what I think is happening in your case: > > Good case: > URB #1: 1514 octets of 1514 Ethernet frame (A) > URB #2: 1514 octets of 1514 Ethernet frame (B) + 526 octets of 1514 Ethernet > frame (C) > URB #3: 988 octets of 1514 Ethernet frame (C) > URB #4: 1514 octets of 1514 Ethernet frame (D) > > Therefore, Ethernet frame (C) is spanning URBs #2 and #3. > > Bad case, URB #3 is lost: > URB #1: 1514 octets of 1514 Ethernet frame (A) > URB #2: 1514 octets of 1514 Ethernet frame (B) + 526 octets of 1514 Ethernet > frame (C) > Remaining is 988 > URB #4: 1514 octets of 1514 Ethernet frame (D) > > But when URB #4 is analysed the 32-bit Header word is not found after 988 > octets in the URB buffer so "sync lost". > The end of Ethernet frame (C) is missing so drop the Ethernet frame. > Now look at the start of the URB #4 buffer and find a 32-bit header word so > Ethernet frame (D) can be consumed. > > So I think the commit is acting as intended and you are suffering from lost > URBs. No. I went digging on this for a bit longer, and it looks like its just that you're calculating the offset wrong in your check. I was wondering why without your patch we wouldn't see "Bad Header Length" messages, since if the remaining was 988 and the skb->len was 2048 as seen in my logs, without your patch we should copy the 988 bytes out clear remaining and then continue processing the rest of the skb, which calculates the header and checks the size. If we really lost the URB, we should throw an error at that point, since really we'd be midway through the following frame. But we just don't see that with your patch removed. Looking more closely, in the main loop, we do: (where offset is zero, or set to "offset += (copy_length + 1) & 0xfffe" in the previous loop) rx->header = get_unaligned_le32(skb->data + offset); offset += sizeof(u32); But your check calculates: offset = ((rx->remaining + 1) & 0xfffe) + sizeof(u32); rx->header = get_unaligned_le32(skb->data + offset); Adding some debug logic to check those offset calculation used to find rx->header, the one in your code is always too large by sizeof(u32). So removing the extra addition in your offset calculation seems to solve this for me. I'll send out a patch here shortly. thanks -john
[PATCH v2 net-next] bpf: arm64: remove callee-save registers use for tmp registers
In the current implementation of ARM64 eBPF JIT, R23 and R24 are used for tmp registers, which are callee-saved registers. This leads to variable size of JIT prologue and epilogue. The latest blinding constant change prefers to constant size of prologue and epilogue. AAPCS reserves R9 ~ R15 for temp registers which not need to be saved/restored during function call. So, replace R23 and R24 to R10 and R11, and remove tmp_used flag to save 2 instructions for some jited BPF program. CC: Daniel BorkmannAcked-by: Zi Shen Lim Signed-off-by: Yang Shi --- Changelog v1 --> v2: * Updated stack diagram * Added the comment from Zi for the commit log * Added Zi's Acked-by Apply on top of Daniel's blinding constant patchset arch/arm64/net/bpf_jit_comp.c | 34 +- 1 file changed, 5 insertions(+), 29 deletions(-) diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index d0d5190..49ba37e 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -51,9 +51,9 @@ static const int bpf2a64[] = { [BPF_REG_9] = A64_R(22), /* read-only frame pointer to access stack */ [BPF_REG_FP] = A64_R(25), - /* temporary register for internal BPF JIT */ - [TMP_REG_1] = A64_R(23), - [TMP_REG_2] = A64_R(24), + /* temporary registers for internal BPF JIT */ + [TMP_REG_1] = A64_R(10), + [TMP_REG_2] = A64_R(11), /* temporary register for blinding constants */ [BPF_REG_AX] = A64_R(9), }; @@ -61,7 +61,6 @@ static const int bpf2a64[] = { struct jit_ctx { const struct bpf_prog *prog; int idx; - int tmp_used; int epilogue_offset; int *offset; u32 *image; @@ -154,8 +153,6 @@ static void build_prologue(struct jit_ctx *ctx) const u8 r8 = bpf2a64[BPF_REG_8]; const u8 r9 = bpf2a64[BPF_REG_9]; const u8 fp = bpf2a64[BPF_REG_FP]; - const u8 tmp1 = bpf2a64[TMP_REG_1]; - const u8 tmp2 = bpf2a64[TMP_REG_2]; /* * BPF prog stack layout @@ -167,7 +164,7 @@ static void build_prologue(struct jit_ctx *ctx) *| ... | callee saved registers *+-+ *| | x25/x26 -* BPF fp register => -80:+-+ <= (BPF_FP) +* BPF fp register => -64:+-+ <= (BPF_FP) *| | *| ... | BPF prog stack *| | @@ -189,8 +186,6 @@ static void build_prologue(struct jit_ctx *ctx) /* Save callee-saved register */ emit(A64_PUSH(r6, r7, A64_SP), ctx); emit(A64_PUSH(r8, r9, A64_SP), ctx); - if (ctx->tmp_used) - emit(A64_PUSH(tmp1, tmp2, A64_SP), ctx); /* Save fp (x25) and x26. SP requires 16 bytes alignment */ emit(A64_PUSH(fp, A64_R(26), A64_SP), ctx); @@ -210,8 +205,6 @@ static void build_epilogue(struct jit_ctx *ctx) const u8 r8 = bpf2a64[BPF_REG_8]; const u8 r9 = bpf2a64[BPF_REG_9]; const u8 fp = bpf2a64[BPF_REG_FP]; - const u8 tmp1 = bpf2a64[TMP_REG_1]; - const u8 tmp2 = bpf2a64[TMP_REG_2]; /* We're done with BPF stack */ emit(A64_ADD_I(1, A64_SP, A64_SP, STACK_SIZE), ctx); @@ -220,8 +213,6 @@ static void build_epilogue(struct jit_ctx *ctx) emit(A64_POP(fp, A64_R(26), A64_SP), ctx); /* Restore callee-saved register */ - if (ctx->tmp_used) - emit(A64_POP(tmp1, tmp2, A64_SP), ctx); emit(A64_POP(r8, r9, A64_SP), ctx); emit(A64_POP(r6, r7, A64_SP), ctx); @@ -317,7 +308,6 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx) emit(A64_UDIV(is64, dst, dst, src), ctx); break; case BPF_MOD: - ctx->tmp_used = 1; emit(A64_UDIV(is64, tmp, dst, src), ctx); emit(A64_MUL(is64, tmp, tmp, src), ctx); emit(A64_SUB(is64, dst, dst, tmp), ctx); @@ -390,49 +380,41 @@ emit_bswap_uxt: /* dst = dst OP imm */ case BPF_ALU | BPF_ADD | BPF_K: case BPF_ALU64 | BPF_ADD | BPF_K: - ctx->tmp_used = 1; emit_a64_mov_i(is64, tmp, imm, ctx); emit(A64_ADD(is64, dst, dst, tmp), ctx); break; case BPF_ALU | BPF_SUB | BPF_K: case BPF_ALU64 | BPF_SUB | BPF_K: - ctx->tmp_used = 1; emit_a64_mov_i(is64, tmp, imm, ctx); emit(A64_SUB(is64, dst, dst, tmp), ctx); break; case BPF_ALU | BPF_AND | BPF_K: case BPF_ALU64 | BPF_AND | BPF_K: - ctx->tmp_used = 1; emit_a64_mov_i(is64, tmp, imm, ctx); emit(A64_AND(is64, dst, dst, tmp), ctx);
Re: [PATCH v6 net-next 14/14] ip4ip6: Support for GSO/GRO
On Mon, May 16, 2016 at 2:33 PM, Tom Herbertwrote: > Signed-off-by: Tom Herbert > --- > include/net/inet_common.h | 5 + > net/ipv4/af_inet.c| 12 +++- > net/ipv6/ip6_offload.c| 33 - > net/ipv6/ip6_tunnel.c | 3 +++ > 4 files changed, 47 insertions(+), 6 deletions(-) > > diff --git a/include/net/inet_common.h b/include/net/inet_common.h > index 109e3ee..5d68342 100644 > --- a/include/net/inet_common.h > +++ b/include/net/inet_common.h > @@ -39,6 +39,11 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short > family, > int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, > int *addr_len); > > +struct sk_buff **inet_gro_receive(struct sk_buff **head, struct sk_buff > *skb); > +int inet_gro_complete(struct sk_buff *skb, int nhoff); > +struct sk_buff *inet_gso_segment(struct sk_buff *skb, > +netdev_features_t features); > + > static inline void inet_ctl_sock_destroy(struct sock *sk) > { > if (sk) > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > index 25040b1..377424e 100644 > --- a/net/ipv4/af_inet.c > +++ b/net/ipv4/af_inet.c > @@ -1192,8 +1192,8 @@ int inet_sk_rebuild_header(struct sock *sk) > } > EXPORT_SYMBOL(inet_sk_rebuild_header); > > -static struct sk_buff *inet_gso_segment(struct sk_buff *skb, > - netdev_features_t features) > +struct sk_buff *inet_gso_segment(struct sk_buff *skb, > +netdev_features_t features) > { > bool udpfrag = false, fixedid = false, encap; > struct sk_buff *segs = ERR_PTR(-EINVAL); > @@ -1280,9 +1280,9 @@ static struct sk_buff *inet_gso_segment(struct sk_buff > *skb, > out: > return segs; > } > +EXPORT_SYMBOL(inet_gso_segment); > > -static struct sk_buff **inet_gro_receive(struct sk_buff **head, > -struct sk_buff *skb) > +struct sk_buff **inet_gro_receive(struct sk_buff **head, struct sk_buff *skb) > { > const struct net_offload *ops; > struct sk_buff **pp = NULL; > @@ -1398,6 +1398,7 @@ out: > > return pp; > } > +EXPORT_SYMBOL(inet_gro_receive); > > static struct sk_buff **ipip_gro_receive(struct sk_buff **head, > struct sk_buff *skb) > @@ -1449,7 +1450,7 @@ int inet_recv_error(struct sock *sk, struct msghdr > *msg, int len, int *addr_len) > return -EINVAL; > } > > -static int inet_gro_complete(struct sk_buff *skb, int nhoff) > +int inet_gro_complete(struct sk_buff *skb, int nhoff) > { > __be16 newlen = htons(skb->len - nhoff); > struct iphdr *iph = (struct iphdr *)(skb->data + nhoff); > @@ -1479,6 +1480,7 @@ out_unlock: > > return err; > } > +EXPORT_SYMBOL(inet_gro_complete); > > static int ipip_gro_complete(struct sk_buff *skb, int nhoff) > { > diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c > index 332d6a0..22e90e5 100644 > --- a/net/ipv6/ip6_offload.c > +++ b/net/ipv6/ip6_offload.c > @@ -16,6 +16,7 @@ > > #include > #include > +#include > > #include "ip6_offload.h" > > @@ -268,6 +269,21 @@ static struct sk_buff **sit_ip6ip6_gro_receive(struct > sk_buff **head, > return ipv6_gro_receive(head, skb); > } > > +static struct sk_buff **ip4ip6_gro_receive(struct sk_buff **head, > + struct sk_buff *skb) > +{ > + /* Common GRO receive for SIT and IP6IP6 */ > + > + if (NAPI_GRO_CB(skb)->encap_mark) { > + NAPI_GRO_CB(skb)->flush = 1; > + return NULL; > + } > + > + NAPI_GRO_CB(skb)->encap_mark = 1; > + > + return inet_gro_receive(head, skb); > +} > + > static int ipv6_gro_complete(struct sk_buff *skb, int nhoff) > { > const struct net_offload *ops; > @@ -307,6 +323,13 @@ static int ip6ip6_gro_complete(struct sk_buff *skb, int > nhoff) > return ipv6_gro_complete(skb, nhoff); > } > > +static int ip4ip6_gro_complete(struct sk_buff *skb, int nhoff) > +{ > + skb->encapsulation = 1; > + skb_shinfo(skb)->gso_type |= SKB_GSO_IPXIP6; > + return inet_gro_complete(skb, nhoff); > +} > + > static struct packet_offload ipv6_packet_offload __read_mostly = { > .type = cpu_to_be16(ETH_P_IPV6), > .callbacks = { > @@ -324,6 +347,14 @@ static const struct net_offload sit_offload = { > }, > }; > > +static const struct net_offload ip4ip6_offload = { > + .callbacks = { > + .gso_segment= inet_gso_segment, > + .gro_receive= ip4ip6_gro_receive, > + .gro_complete = ip4ip6_gro_complete, > + }, > +}; > + > static const struct net_offload ip6ip6_offload = { > .callbacks = { > .gso_segment= ipv6_gso_segment, > @@ -331,7 +362,6 @@ static const struct net_offload ip6ip6_offload = { >
[PATCH iproute2] ss: Tell user about -EOPNOTSUPP for SOCK_DESTROY
Silent failures are not friendly to the user. If a command is not supported tell the user about it. Signed-off-by: David Ahern--- misc/ss.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/misc/ss.c b/misc/ss.c index 23fff19d9199..bd7214c85938 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -2264,7 +2264,7 @@ static int show_one_inet_sock(const struct sockaddr_nl *addr, if (!(diag_arg->f->families & (1 << r->idiag_family))) return 0; if (diag_arg->f->kill && kill_inet_sock(h, arg) != 0) { - if (errno == EOPNOTSUPP || errno == ENOENT) { + if (errno == ENOENT) { /* Socket can't be closed, or is already closed. */ return 0; } else { -- 2.1.4
[PATCH] net: diag: Tell user if support for destroying TCP sockets is not enabled
Commit c1e64e298b8c added support for destroying TCP sockets but it is wrapped in a config option. If the option is not enabled the user is given no feedback and ss for example just exits 0 which is not a friendly UI: $ ss -4 state established sport = :22 Netid Recv-Q Send-Q Local Address:Port Peer Address:Port tcp0 0 10.1.1.2:ssh 192.168.2.50:47438 $ ss -4 -K state established sport = :22 dport = :47438 Netid Recv-Q Send-Q Local Address:Port Peer Address:Port (nothing else in the output and the connection lives on). Fix by returning an error to the user if the config option is not enabled: $ ss -4 -K state established sport = :22 dport = :47450 Netid Recv-Q Send-Q Local Address:Port Peer Address:Port SOCK_DESTROY answers: Operation not supported Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.") Signed-off-by: David Ahern--- net/ipv4/tcp_diag.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c index 4d610934fb39..99590423d468 100644 --- a/net/ipv4/tcp_diag.c +++ b/net/ipv4/tcp_diag.c @@ -60,6 +60,12 @@ static int tcp_diag_destroy(struct sk_buff *in_skb, return sock_diag_destroy(sk, ECONNABORTED); } +#else +static int tcp_diag_destroy(struct sk_buff *in_skb, + const struct inet_diag_req_v2 *req) +{ + return -EOPNOTSUPP; +} #endif static const struct inet_diag_handler tcp_diag_handler = { @@ -68,9 +74,7 @@ static const struct inet_diag_handler tcp_diag_handler = { .idiag_get_info = tcp_diag_get_info, .idiag_type = IPPROTO_TCP, .idiag_info_size = sizeof(struct tcp_info), -#ifdef CONFIG_INET_DIAG_DESTROY .destroy = tcp_diag_destroy, -#endif }; static int __init tcp_diag_init(void) -- 2.1.4
Re: [PATCH v6 net-next 13/14] ip6ip6: Support for GSO/GRO
On Mon, May 16, 2016 at 2:33 PM, Tom Herbertwrote: > Signed-off-by: Tom Herbert > --- > net/ipv6/ip6_offload.c | 24 +--- > net/ipv6/ip6_tunnel.c | 3 +++ > 2 files changed, 24 insertions(+), 3 deletions(-) > > diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c > index 787e55f..332d6a0 100644 > --- a/net/ipv6/ip6_offload.c > +++ b/net/ipv6/ip6_offload.c > @@ -253,9 +253,11 @@ out: > return pp; > } > > -static struct sk_buff **sit_gro_receive(struct sk_buff **head, > - struct sk_buff *skb) > +static struct sk_buff **sit_ip6ip6_gro_receive(struct sk_buff **head, > + struct sk_buff *skb) > { > + /* Common GRO receive for SIT and IP6IP6 */ > + > if (NAPI_GRO_CB(skb)->encap_mark) { > NAPI_GRO_CB(skb)->flush = 1; > return NULL; > @@ -298,6 +300,13 @@ static int sit_gro_complete(struct sk_buff *skb, int > nhoff) > return ipv6_gro_complete(skb, nhoff); > } > > +static int ip6ip6_gro_complete(struct sk_buff *skb, int nhoff) > +{ > + skb->encapsulation = 1; > + skb_shinfo(skb)->gso_type |= SKB_GSO_IPXIP6; > + return ipv6_gro_complete(skb, nhoff); > +} > + > static struct packet_offload ipv6_packet_offload __read_mostly = { > .type = cpu_to_be16(ETH_P_IPV6), > .callbacks = { > @@ -310,11 +319,19 @@ static struct packet_offload ipv6_packet_offload > __read_mostly = { > static const struct net_offload sit_offload = { > .callbacks = { > .gso_segment= ipv6_gso_segment, > - .gro_receive= sit_gro_receive, > + .gro_receive= sit_ip6ip6_gro_receive, > .gro_complete = sit_gro_complete, > }, > }; > > +static const struct net_offload ip6ip6_offload = { > + .callbacks = { > + .gso_segment= ipv6_gso_segment, > + .gro_receive= sit_ip6ip6_gro_receive, > + .gro_complete = ip6ip6_gro_complete, > + }, > +}; > + > static int __init ipv6_offload_init(void) > { > > @@ -326,6 +343,7 @@ static int __init ipv6_offload_init(void) > dev_add_offload(_packet_offload); > > inet_add_offload(_offload, IPPROTO_IPV6); > + inet6_add_offload(_offload, IPPROTO_IPV6); > > return 0; > } > diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c > index 8076c7a..d205f17 100644 > --- a/net/ipv6/ip6_tunnel.c > +++ b/net/ipv6/ip6_tunnel.c > @@ -1238,6 +1238,9 @@ ip6ip6_tnl_xmit(struct sk_buff *skb, struct net_device > *dev) > if (t->parms.flags & IP6_TNL_F_USE_ORIG_FWMARK) > fl6.flowi6_mark = skb->mark; > > + if (iptunnel_handle_offloads(skb, SKB_GSO_IPXIP6)) > + return -1; > + > err = ip6_tnl_xmit(skb, dev, dsfield, , encap_limit, , >IPPROTO_IPV6); > if (err != 0) { So one piece you are missing here is skb_set_inner_ipproto(IPPROTO_IPV6). Without that the tunnel offload could be a bit confused as the inner protocol type defaults to ENCAP_TYPE_ETHER. - Alex
Re: [PATCH net-next] bpf: arm64: remove callee-save registers use for tmp registers
On 5/16/2016 4:45 PM, Z Lim wrote: Hi Yang, On Mon, May 16, 2016 at 4:09 PM, Yang Shiwrote: In the current implementation of ARM64 eBPF JIT, R23 and R24 are used for tmp registers, which are callee-saved registers. This leads to variable size of JIT prologue and epilogue. The latest blinding constant change prefers to constant size of prologue and epilogue. AAPCS reserves R9 ~ R15 for temp registers which not need to be saved/restored during function call. So, replace R23 and R24 to R10 and R11, and remove tmp_used flag. CC: Zi Shen Lim CC: Daniel Borkmann Signed-off-by: Yang Shi --- Couple suggestions, but otherwise: Acked-by: Zi Shen Lim 1. Update the diagram. I think it should now be: -* BPF fp register => -80:+-+ <= (BPF_FP) +* BPF fp register => -64:+-+ <= (BPF_FP) Nice catch. I forgot the stack diagram. 2. Add a comment in commit log along the lines of: this is an optimization saving 2 instructions per jited BPF program. Sure, will address in V2. Thanks, Yang Thanks :) z Apply on top of Daniel's blinding constant patchset. arch/arm64/net/bpf_jit_comp.c | 32 1 file changed, 4 insertions(+), 28 deletions(-)
Re: [ethtool 0/3][pull request] Intel Wired LAN Driver Updates 2016-05-03
On Wed, 2016-05-04 at 09:44 -0700, Jeff Kirsher wrote: > This series contains updates to ixgbe in ethtool. > > Preethi adds missing device IDs and mac_type definitions, also updated > the display registers for x550, x550em_x/a. Cleaned up the format string > storage by taking advantage of "for" loops. > > The following are changes since commit > deb1c6613ec14fd828d321e38c7bea45fe559bd5: > Release version 4.5. > and are available in the git repository at: > git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/ethtool master > > Preethi Banala (3): > ethtool/ixgbe: Add device ID and mac_type definitions > ethtool/ixgbe: Correct offsets and support x550, x550em_x, x550em_a > ethtool/ixgbe: Reduce format string storage > > ixgbe.c | 173 +++--- > -- > 1 file changed, 95 insertions(+), 78 deletions(-) > Ping? Ben do you have these changes queued up for ethtool? signature.asc Description: This is a digitally signed message part
Re: [PATCH net-next] bpf: arm64: remove callee-save registers use for tmp registers
Hi Yang, On Mon, May 16, 2016 at 4:09 PM, Yang Shiwrote: > In the current implementation of ARM64 eBPF JIT, R23 and R24 are used for > tmp registers, which are callee-saved registers. This leads to variable size > of JIT prologue and epilogue. The latest blinding constant change prefers to > constant size of prologue and epilogue. AAPCS reserves R9 ~ R15 for temp > registers which not need to be saved/restored during function call. So, > replace > R23 and R24 to R10 and R11, and remove tmp_used flag. > > CC: Zi Shen Lim > CC: Daniel Borkmann > Signed-off-by: Yang Shi > --- Couple suggestions, but otherwise: Acked-by: Zi Shen Lim 1. Update the diagram. I think it should now be: -* BPF fp register => -80:+-+ <= (BPF_FP) +* BPF fp register => -64:+-+ <= (BPF_FP) 2. Add a comment in commit log along the lines of: this is an optimization saving 2 instructions per jited BPF program. Thanks :) z > Apply on top of Daniel's blinding constant patchset. > > arch/arm64/net/bpf_jit_comp.c | 32 > 1 file changed, 4 insertions(+), 28 deletions(-) >
[PATCH net-next] bpf: arm64: remove callee-save registers use for tmp registers
In the current implementation of ARM64 eBPF JIT, R23 and R24 are used for tmp registers, which are callee-saved registers. This leads to variable size of JIT prologue and epilogue. The latest blinding constant change prefers to constant size of prologue and epilogue. AAPCS reserves R9 ~ R15 for temp registers which not need to be saved/restored during function call. So, replace R23 and R24 to R10 and R11, and remove tmp_used flag. CC: Zi Shen LimCC: Daniel Borkmann Signed-off-by: Yang Shi --- Apply on top of Daniel's blinding constant patchset. arch/arm64/net/bpf_jit_comp.c | 32 1 file changed, 4 insertions(+), 28 deletions(-) diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index d0d5190..ef3055a 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -51,9 +51,9 @@ static const int bpf2a64[] = { [BPF_REG_9] = A64_R(22), /* read-only frame pointer to access stack */ [BPF_REG_FP] = A64_R(25), - /* temporary register for internal BPF JIT */ - [TMP_REG_1] = A64_R(23), - [TMP_REG_2] = A64_R(24), + /* temporary registers for internal BPF JIT */ + [TMP_REG_1] = A64_R(10), + [TMP_REG_2] = A64_R(11), /* temporary register for blinding constants */ [BPF_REG_AX] = A64_R(9), }; @@ -61,7 +61,6 @@ static const int bpf2a64[] = { struct jit_ctx { const struct bpf_prog *prog; int idx; - int tmp_used; int epilogue_offset; int *offset; u32 *image; @@ -154,8 +153,6 @@ static void build_prologue(struct jit_ctx *ctx) const u8 r8 = bpf2a64[BPF_REG_8]; const u8 r9 = bpf2a64[BPF_REG_9]; const u8 fp = bpf2a64[BPF_REG_FP]; - const u8 tmp1 = bpf2a64[TMP_REG_1]; - const u8 tmp2 = bpf2a64[TMP_REG_2]; /* * BPF prog stack layout @@ -189,8 +186,6 @@ static void build_prologue(struct jit_ctx *ctx) /* Save callee-saved register */ emit(A64_PUSH(r6, r7, A64_SP), ctx); emit(A64_PUSH(r8, r9, A64_SP), ctx); - if (ctx->tmp_used) - emit(A64_PUSH(tmp1, tmp2, A64_SP), ctx); /* Save fp (x25) and x26. SP requires 16 bytes alignment */ emit(A64_PUSH(fp, A64_R(26), A64_SP), ctx); @@ -210,8 +205,6 @@ static void build_epilogue(struct jit_ctx *ctx) const u8 r8 = bpf2a64[BPF_REG_8]; const u8 r9 = bpf2a64[BPF_REG_9]; const u8 fp = bpf2a64[BPF_REG_FP]; - const u8 tmp1 = bpf2a64[TMP_REG_1]; - const u8 tmp2 = bpf2a64[TMP_REG_2]; /* We're done with BPF stack */ emit(A64_ADD_I(1, A64_SP, A64_SP, STACK_SIZE), ctx); @@ -220,8 +213,6 @@ static void build_epilogue(struct jit_ctx *ctx) emit(A64_POP(fp, A64_R(26), A64_SP), ctx); /* Restore callee-saved register */ - if (ctx->tmp_used) - emit(A64_POP(tmp1, tmp2, A64_SP), ctx); emit(A64_POP(r8, r9, A64_SP), ctx); emit(A64_POP(r6, r7, A64_SP), ctx); @@ -317,7 +308,6 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx) emit(A64_UDIV(is64, dst, dst, src), ctx); break; case BPF_MOD: - ctx->tmp_used = 1; emit(A64_UDIV(is64, tmp, dst, src), ctx); emit(A64_MUL(is64, tmp, tmp, src), ctx); emit(A64_SUB(is64, dst, dst, tmp), ctx); @@ -390,49 +380,41 @@ emit_bswap_uxt: /* dst = dst OP imm */ case BPF_ALU | BPF_ADD | BPF_K: case BPF_ALU64 | BPF_ADD | BPF_K: - ctx->tmp_used = 1; emit_a64_mov_i(is64, tmp, imm, ctx); emit(A64_ADD(is64, dst, dst, tmp), ctx); break; case BPF_ALU | BPF_SUB | BPF_K: case BPF_ALU64 | BPF_SUB | BPF_K: - ctx->tmp_used = 1; emit_a64_mov_i(is64, tmp, imm, ctx); emit(A64_SUB(is64, dst, dst, tmp), ctx); break; case BPF_ALU | BPF_AND | BPF_K: case BPF_ALU64 | BPF_AND | BPF_K: - ctx->tmp_used = 1; emit_a64_mov_i(is64, tmp, imm, ctx); emit(A64_AND(is64, dst, dst, tmp), ctx); break; case BPF_ALU | BPF_OR | BPF_K: case BPF_ALU64 | BPF_OR | BPF_K: - ctx->tmp_used = 1; emit_a64_mov_i(is64, tmp, imm, ctx); emit(A64_ORR(is64, dst, dst, tmp), ctx); break; case BPF_ALU | BPF_XOR | BPF_K: case BPF_ALU64 | BPF_XOR | BPF_K: - ctx->tmp_used = 1; emit_a64_mov_i(is64, tmp, imm, ctx); emit(A64_EOR(is64, dst, dst, tmp), ctx); break; case BPF_ALU | BPF_MUL | BPF_K: case BPF_ALU64 | BPF_MUL | BPF_K: - ctx->tmp_used = 1;
Re: [PATCH v2] r8169: default to 64-bit DMA on recent PCIe chips
Ard Biesheuvel: [...] > This is a followup to 'r8169: default to 64-bit DMA on systems without memory > below 4 GB' [1]. At the request of Francois, this version bases the decision > whether to use 64-bit DMA by default on whether the device is PCIe and > sufficiently recent, rather than whether the platform requires 64-bit DMA > because it does not have any memory below 4 GB to begin with. This is safer, > since it will prevent the use of such problematic cards on these platforms. Testing has not been conclusive. It apparently works but I have not been able to set addresses above 4Gb for the Rx or Tx descriptor rings yet. -- Ueimor
Re: [PATCH v6 net-next 09/14] ip6_tun: Add infrastructure for doing encapsulation
On Mon, May 16, 2016 at 3:37 PM, Tom Herbertwrote: > On Mon, May 16, 2016 at 3:25 PM, Alexander Duyck > wrote: >> On Mon, May 16, 2016 at 2:33 PM, Tom Herbert wrote: >>> Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions >>> for getting encap hlen, setting up encap on a tunnel, performing >>> encapsulation operation. >>> >>> Signed-off-by: Tom Herbert >>> --- >>> include/net/ip6_tunnel.h | 58 >>> net/ipv4/ip_tunnel_core.c | 5 +++ >>> net/ipv6/ip6_tunnel.c | 85 >>> ++- >>> 3 files changed, 139 insertions(+), 9 deletions(-) >>> >> >> So it looks like you completely dropped the two spots that were >> updating mtu and max_headroom with the t->hlen. I thought you needed >> to at least have a check that used t->encap_hlen here in order to >> avoid overflowing the buffer or exceeding skb_headroom, or am I >> missing something? >> > Sorry, you're probably right. max_headroom seems to be an absolute > value. mtu being calculated seems relative to what is in skbuff > already. The second invocation of max_headroom is an absolute value. The first one is used to measure if there is enough space for the headers we will need to add. My thought is that is why we need encap_hlen to be added to the first case so we make sure there is enough room for the UDP header if one is present plus the IPv6 header. Also I just found another issue in this patch. In ip6_tnl_dev_setup you can probably just drop all references to "t" since you only assign the pointer but you never actually access it. I only noticed because I was looking at adding support for TSO to the tunnel itself. - Alex
Re: task_diag: add a new interface to get information about processes
On Wed, May 04, 2016 at 08:39:51PM -0700, Andy Lutomirski wrote: > > Linus, this is Yet Another Credential Fuckup, except that it hasn't > happened yet, so it's okay. The tl;dr is that Andrey wants to add an > interface to ask a pidns some questions, and netlink looks natural, > except that using netlink sockets to interrogate a pidns seems rather > problematic. I would also love to see a decent interface for > interrogating user namespaces, and again, netlink would be great, > except that it's a socket and makes no sense in this context. > > Netlink had, and possibly still has, tons of serious security bugs > involving code checking send() callers' creds. I found and fixed a > few a couple years ago. To reiterate once again, send() CANNOT use > caller creds safely. (I feel like I say this once every few weeks. > It's getting old.) > > I realize that it's convenient to use a socket as a context to keep > state between syscalls, but it has some annoying side effects: > > - It makes people want to rely on send()'s caller's creds. > > - It's miserable in combination with seccomp. > > - It doesn't play nicely with namespaces. > > - It makes me wonder why things like task_diag, which have nothing to > do with networking, seem to get tangled up with networking. > > > Would it be worth considering adding a parallel interface, using it > for new things, and slowly migrating old use cases over? > > int issue_kernel_command(int ns, int command, const struct iovec *iov, > int iovcnt, int flags); > > ns is an actual namespace fd or: > > KERNEL_COMMAND_CURRENT_NETNS > KERNEL_COMMAND_CURRENT_PIDNS > etc, or a special one: > KERNEL_COMMAND_GLOBAL. KERNEL_COMMAND_GLOBAL can't be used in a > non-root namespace. An request can depend on a few namespaces. For example, we can request credentials for a specified task. In this case we may want to specify pid and user namespace. > > KERNEL_COMMAND_GLOBAL works even for namespaced things, if the > relevant current ns is the init namespace. (This feature is optional, > but it would allow gradually namespacing global things.) > > command is an enumerated command. Each command implies a namespace > type, and, if you feed this thing the wrong namespace type, you get > EINVAL. The high bit of command indicates whether it's read-only > command. > > iov gives a command in the format expected, which, for the most part, > would be a netlink message. > > The return value is an fd that you can call read/readv on to read the > response. It's not a socket (or at least you can't do normal socket > operations on it if it is a socket behind the scenes). The > implementation of read() promises *not* to look at caller creds. The > returned fd is unconditionally cloexec -- it's 2016 already. Sheesh. > > When you've read all the data, all you can do is close the fd. You > can't issue another command on the same fd. You also can't call > write() or send() on the fd unless someone has a good reason why you > should be able to and why it's safe. You can't issue another command > on the same fd. > > > I imagine that the implementation could re-use a bunch of netlink code > under the hood. I'm agree with this interface. For me it's interesting to know an opinion from the other side. Stephen, could you share you comments about these netlink issues and this new interface? Thanks, Andrew > > > --Andy
Re: [PATCH v6 net-next 09/14] ip6_tun: Add infrastructure for doing encapsulation
On Mon, May 16, 2016 at 3:25 PM, Alexander Duyckwrote: > On Mon, May 16, 2016 at 2:33 PM, Tom Herbert wrote: >> Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions >> for getting encap hlen, setting up encap on a tunnel, performing >> encapsulation operation. >> >> Signed-off-by: Tom Herbert >> --- >> include/net/ip6_tunnel.h | 58 >> net/ipv4/ip_tunnel_core.c | 5 +++ >> net/ipv6/ip6_tunnel.c | 85 >> ++- >> 3 files changed, 139 insertions(+), 9 deletions(-) >> > > So it looks like you completely dropped the two spots that were > updating mtu and max_headroom with the t->hlen. I thought you needed > to at least have a check that used t->encap_hlen here in order to > avoid overflowing the buffer or exceeding skb_headroom, or am I > missing something? > Sorry, you're probably right. max_headroom seems to be an absolute value. mtu being calculated seems relative to what is in skbuff already. Tom > - Alex
[PATCH] net: don't lose features in netdev_add_tso_features()
The goal of netdev_add_tso_features() is to enable all TSO features but it unintentionally loses NETIF_F_ALL_FOR_ALL features. This is because the netdev_increment_features() it calls clears any NETIF_F_ALL_FOR_ALL bits that aren't included in the incremental features and none of them are included in NETIF_F_ALL_TSO. The behavior can be seen by enabling tx-nocache-copy on the slaves and noticing the feature remains off at the master. Fix this by including NETIF_F_ALL_FOR_ALL in the incremental features. Signed-off-by: Dave PlattSigned-off-by: Dimitris Michailidis --- include/linux/netdevice.h | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index c2f5112..da45388 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3978,7 +3978,12 @@ netdev_features_t netdev_increment_features(netdev_features_t all, static inline netdev_features_t netdev_add_tso_features(netdev_features_t features, netdev_features_t mask) { - return netdev_increment_features(features, NETIF_F_ALL_TSO, mask); + /* OR in NETIF_F_ALL_FOR_ALL to preserve any of its bits already present +* in features +*/ + return netdev_increment_features(features, +NETIF_F_ALL_TSO | NETIF_F_ALL_FOR_ALL, +mask); } int __netdev_update_features(struct net_device *dev); -- 2.8.0.rc3.226.g39d4020
[PATCH 1/2] net: ethernet: fec-mpc52xx: use phydev from struct net_device
The private structure contain a pointer to phydev, but the structure net_device already contain such pointer. So we can remove the pointer phydev in the private structure, and update the driver to use the one contained in struct net_device. Signed-off-by: Philippe Reynes--- drivers/net/ethernet/freescale/fec_mpc52xx.c | 43 -- 1 files changed, 20 insertions(+), 23 deletions(-) diff --git a/drivers/net/ethernet/freescale/fec_mpc52xx.c b/drivers/net/ethernet/freescale/fec_mpc52xx.c index f444714..bcf0600 100644 --- a/drivers/net/ethernet/freescale/fec_mpc52xx.c +++ b/drivers/net/ethernet/freescale/fec_mpc52xx.c @@ -66,7 +66,6 @@ struct mpc52xx_fec_priv { /* MDIO link details */ unsigned int mdio_speed; struct device_node *phy_node; - struct phy_device *phydev; enum phy_state link; int seven_wire_mode; }; @@ -165,7 +164,7 @@ static int mpc52xx_fec_alloc_rx_buffers(struct net_device *dev, struct bcom_task static void mpc52xx_fec_adjust_link(struct net_device *dev) { struct mpc52xx_fec_priv *priv = netdev_priv(dev); - struct phy_device *phydev = priv->phydev; + struct phy_device *phydev = dev->phydev; int new_state = 0; if (phydev->link != PHY_DOWN) { @@ -215,16 +214,17 @@ static void mpc52xx_fec_adjust_link(struct net_device *dev) static int mpc52xx_fec_open(struct net_device *dev) { struct mpc52xx_fec_priv *priv = netdev_priv(dev); + struct phy_device *phydev = NULL; int err = -EBUSY; if (priv->phy_node) { - priv->phydev = of_phy_connect(priv->ndev, priv->phy_node, - mpc52xx_fec_adjust_link, 0, 0); - if (!priv->phydev) { + phydev = of_phy_connect(priv->ndev, priv->phy_node, + mpc52xx_fec_adjust_link, 0, 0); + if (!phydev) { dev_err(>dev, "of_phy_connect failed\n"); return -ENODEV; } - phy_start(priv->phydev); + phy_start(phydev); } if (request_irq(dev->irq, mpc52xx_fec_interrupt, IRQF_SHARED, @@ -268,10 +268,9 @@ static int mpc52xx_fec_open(struct net_device *dev) free_ctrl_irq: free_irq(dev->irq, dev); free_phy: - if (priv->phydev) { - phy_stop(priv->phydev); - phy_disconnect(priv->phydev); - priv->phydev = NULL; + if (phydev) { + phy_stop(phydev); + phy_disconnect(phydev); } return err; @@ -280,6 +279,7 @@ static int mpc52xx_fec_open(struct net_device *dev) static int mpc52xx_fec_close(struct net_device *dev) { struct mpc52xx_fec_priv *priv = netdev_priv(dev); + struct phy_device *phydev = dev->phydev; netif_stop_queue(dev); @@ -291,11 +291,10 @@ static int mpc52xx_fec_close(struct net_device *dev) free_irq(priv->r_irq, dev); free_irq(priv->t_irq, dev); - if (priv->phydev) { + if (phydev) { /* power down phy */ - phy_stop(priv->phydev); - phy_disconnect(priv->phydev); - priv->phydev = NULL; + phy_stop(phydev); + phy_disconnect(phydev); } return 0; @@ -766,10 +765,9 @@ static void mpc52xx_fec_reset(struct net_device *dev) static int mpc52xx_fec_get_ksettings(struct net_device *dev, struct ethtool_link_ksettings *cmd) { - struct mpc52xx_fec_priv *priv = netdev_priv(dev); - struct phy_device *phydev = priv->phydev; + struct phy_device *phydev = dev->phydev; - if (!priv->phydev) + if (!phydev) return -ENODEV; return phy_ethtool_ksettings_get(phydev, cmd); @@ -778,10 +776,9 @@ static int mpc52xx_fec_get_ksettings(struct net_device *dev, static int mpc52xx_fec_set_ksettings(struct net_device *dev, const struct ethtool_link_ksettings *cmd) { - struct mpc52xx_fec_priv *priv = netdev_priv(dev); - struct phy_device *phydev = priv->phydev; + struct phy_device *phydev = dev->phydev; - if (!priv->phydev) + if (!phydev) return -ENODEV; return phy_ethtool_ksettings_set(phydev, cmd); @@ -811,12 +808,12 @@ static const struct ethtool_ops mpc52xx_fec_ethtool_ops = { static int mpc52xx_fec_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) { - struct mpc52xx_fec_priv *priv = netdev_priv(dev); + struct phy_device *phydev = dev->phydev; - if (!priv->phydev) + if (!phydev) return -ENOTSUPP; - return phy_mii_ioctl(priv->phydev, rq, cmd); + return phy_mii_ioctl(phydev, rq, cmd); } static const struct net_device_ops mpc52xx_fec_netdev_ops = { -- 1.7.4.4
[PATCH 2/2] net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
There are two generics functions phy_ethtool_{get|set}_link_ksettings, so we can use them instead of defining the same code in the driver. Signed-off-by: Philippe Reynes--- drivers/net/ethernet/freescale/fec_mpc52xx.c | 26 ++ 1 files changed, 2 insertions(+), 24 deletions(-) diff --git a/drivers/net/ethernet/freescale/fec_mpc52xx.c b/drivers/net/ethernet/freescale/fec_mpc52xx.c index bcf0600..446ae9d 100644 --- a/drivers/net/ethernet/freescale/fec_mpc52xx.c +++ b/drivers/net/ethernet/freescale/fec_mpc52xx.c @@ -762,28 +762,6 @@ static void mpc52xx_fec_reset(struct net_device *dev) /* ethtool interface */ -static int mpc52xx_fec_get_ksettings(struct net_device *dev, -struct ethtool_link_ksettings *cmd) -{ - struct phy_device *phydev = dev->phydev; - - if (!phydev) - return -ENODEV; - - return phy_ethtool_ksettings_get(phydev, cmd); -} - -static int mpc52xx_fec_set_ksettings(struct net_device *dev, -const struct ethtool_link_ksettings *cmd) -{ - struct phy_device *phydev = dev->phydev; - - if (!phydev) - return -ENODEV; - - return phy_ethtool_ksettings_set(phydev, cmd); -} - static u32 mpc52xx_fec_get_msglevel(struct net_device *dev) { struct mpc52xx_fec_priv *priv = netdev_priv(dev); @@ -801,8 +779,8 @@ static const struct ethtool_ops mpc52xx_fec_ethtool_ops = { .get_msglevel = mpc52xx_fec_get_msglevel, .set_msglevel = mpc52xx_fec_set_msglevel, .get_ts_info = ethtool_op_get_ts_info, - .get_link_ksettings = mpc52xx_fec_get_ksettings, - .set_link_ksettings = mpc52xx_fec_set_ksettings, + .get_link_ksettings = phy_ethtool_get_link_ksettings, + .set_link_ksettings = phy_ethtool_set_link_ksettings, }; -- 1.7.4.4
RE: [Intel-wired-lan] [PATCH] e1000e: prevent division by zero if TIMINCA is zero
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On > Behalf Of Denys Vlasenko > Sent: Friday, May 6, 2016 12:42 PM > To: Kirsher, Jeffrey T> Cc: intel-wired-...@lists.osuosl.org; Denys Vlasenko > ; LKML ; > netdev@vger.kernel.org > Subject: [Intel-wired-lan] [PATCH] e1000e: prevent division by zero if > TIMINCA is zero > > Users report that under VMWare, er32(TIMINCA) returns zero. > This causes division by zero at init time as follows: > > ==>incvalue = er32(TIMINCA) & E1000_TIMINCA_INCVALUE_MASK; > for (i = 0; i < E1000_MAX_82574_SYSTIM_REREADS; i++) { > /* latch SYSTIMH on read of SYSTIML */ > systim_next = (cycle_t)er32(SYSTIML); > systim_next |= (cycle_t)er32(SYSTIMH) << 32; > > time_delta = systim_next - systim; > temp = time_delta; > > rem = do_div(temp, incvalue); > > This change makes kernel survive this, and users report that > NIC does work after this change. > > Since on real hardware incvalue is never zero, this should not affect > real hardware use case. > > Signed-off-by: Denys Vlasenko > CC: Jeff Kirsher > CC: "Ruinskiy, Dima" > CC: intel-wired-...@lists.osuosl.org > CC: netdev@vger.kernel.org > CC: LKML > --- > drivers/net/ethernet/intel/e1000e/netdev.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) As Mark Rustad pointed out I recall this was earlier rejected as something that is a VMWare error and it should be fixed there so that existing VMs will start working without installing a new driver. Having said that, it does not seem to be causing any harm in my testing, so... Tested-by: Aaron Brown
Re: [PATCH v6 net-next 09/14] ip6_tun: Add infrastructure for doing encapsulation
On Mon, May 16, 2016 at 2:33 PM, Tom Herbertwrote: > Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions > for getting encap hlen, setting up encap on a tunnel, performing > encapsulation operation. > > Signed-off-by: Tom Herbert > --- > include/net/ip6_tunnel.h | 58 > net/ipv4/ip_tunnel_core.c | 5 +++ > net/ipv6/ip6_tunnel.c | 85 > ++- > 3 files changed, 139 insertions(+), 9 deletions(-) > So it looks like you completely dropped the two spots that were updating mtu and max_headroom with the t->hlen. I thought you needed to at least have a check that used t->encap_hlen here in order to avoid overflowing the buffer or exceeding skb_headroom, or am I missing something? - Alex
[Patch net] net_sched: close another race condition in tcf_mirred_release()
We saw the following extra refcount release on veth device: kernel: [7957821.463992] unregister_netdevice: waiting for mesos50284 to become free. Usage count = -1 Since we heavily use mirred action to redirect packets to veth, I think this is caused by the following race condition: CPU0: tcf_mirred_release(): (in RCU callback) struct net_device *dev = rcu_dereference_protected(m->tcfm_dev, 1); CPU1: mirred_device_event(): spin_lock_bh(_list_lock); list_for_each_entry(m, _list, tcfm_list) { if (rcu_access_pointer(m->tcfm_dev) == dev) { dev_put(dev); /* Note : no rcu grace period necessary, as * net_device are already rcu protected. */ RCU_INIT_POINTER(m->tcfm_dev, NULL); } } spin_unlock_bh(_list_lock); CPU0: tcf_mirred_release(): spin_lock_bh(_list_lock); list_del(>tcfm_list); spin_unlock_bh(_list_lock); if (dev) // < Stil refers to the old m->tcfm_dev dev_put(dev); // < dev_put() is called on it again The action init code path is good because it is impossible to modify an action that is being removed. So, fix this by moving everything under the spinlock. Fixes: 2ee22a90c7af ("net_sched: act_mirred: remove spinlock in fast path") Fixes: 6bd00b850635 ("act_mirred: fix a race condition on mirred_list") Cc: Jamal Hadi SalimSigned-off-by: Cong Wang --- net/sched/act_mirred.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c index 8f3948d..78db6d4 100644 --- a/net/sched/act_mirred.c +++ b/net/sched/act_mirred.c @@ -36,14 +36,15 @@ static DEFINE_SPINLOCK(mirred_list_lock); static void tcf_mirred_release(struct tc_action *a, int bind) { struct tcf_mirred *m = to_mirred(a); - struct net_device *dev = rcu_dereference_protected(m->tcfm_dev, 1); + struct net_device *dev; /* We could be called either in a RCU callback or with RTNL lock held. */ spin_lock_bh(_list_lock); list_del(>tcfm_list); - spin_unlock_bh(_list_lock); + dev = rcu_dereference_protected(m->tcfm_dev, 1); if (dev) dev_put(dev); + spin_unlock_bh(_list_lock); } static const struct nla_policy mirred_policy[TCA_MIRRED_MAX + 1] = { -- 2.1.0
Re: [PATCH net-next] bpf, doc: fix typo on bpf_asm descriptions
On Mon, May 16, 2016 at 11:06:53PM +0200, Daniel Borkmann wrote: > Fix description of some of the bpf_asm tool related jump instructions > and generally move them to format A k. > > Reported-by: Sebastian Amend> Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov
[PATCH v6 net-next 01/14] gso: Remove arbitrary checks for unsupported GSO
In several gso_segment functions there are checks of gso_type against a seemingly arbitrary list of SKB_GSO_* flags. This seems like an attempt to identify unsupported GSO types, but since the stack is the one that set these GSO types in the first place this seems unnecessary to do. If a combination isn't valid in the first place that stack should not allow setting it. This is a code simplication especially for add new GSO types. Signed-off-by: Tom Herbert--- net/ipv4/af_inet.c | 18 -- net/ipv4/gre_offload.c | 14 -- net/ipv4/tcp_offload.c | 19 --- net/ipv4/udp_offload.c | 10 -- net/ipv6/ip6_offload.c | 18 -- net/ipv6/udp_offload.c | 13 - net/mpls/mpls_gso.c| 11 +-- 7 files changed, 1 insertion(+), 102 deletions(-) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 2e6e65f..7f08d45 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1205,24 +1205,6 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb, int ihl; int id; - if (unlikely(skb_shinfo(skb)->gso_type & -~(SKB_GSO_TCPV4 | - SKB_GSO_UDP | - SKB_GSO_DODGY | - SKB_GSO_TCP_ECN | - SKB_GSO_GRE | - SKB_GSO_GRE_CSUM | - SKB_GSO_IPIP | - SKB_GSO_SIT | - SKB_GSO_TCPV6 | - SKB_GSO_UDP_TUNNEL | - SKB_GSO_UDP_TUNNEL_CSUM | - SKB_GSO_TCP_FIXEDID | - SKB_GSO_TUNNEL_REMCSUM | - SKB_GSO_PARTIAL | - 0))) - goto out; - skb_reset_network_header(skb); nhoff = skb_network_header(skb) - skb_mac_header(skb); if (unlikely(!pskb_may_pull(skb, sizeof(*iph diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c index e88190a..ecd1e09 100644 --- a/net/ipv4/gre_offload.c +++ b/net/ipv4/gre_offload.c @@ -26,20 +26,6 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb, int gre_offset, outer_hlen; bool need_csum, ufo; - if (unlikely(skb_shinfo(skb)->gso_type & - ~(SKB_GSO_TCPV4 | - SKB_GSO_TCPV6 | - SKB_GSO_UDP | - SKB_GSO_DODGY | - SKB_GSO_TCP_ECN | - SKB_GSO_TCP_FIXEDID | - SKB_GSO_GRE | - SKB_GSO_GRE_CSUM | - SKB_GSO_IPIP | - SKB_GSO_SIT | - SKB_GSO_PARTIAL))) - goto out; - if (!skb->encapsulation) goto out; diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index 02737b6..5c59649 100644 --- a/net/ipv4/tcp_offload.c +++ b/net/ipv4/tcp_offload.c @@ -83,25 +83,6 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb, if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) { /* Packet is from an untrusted source, reset gso_segs. */ - int type = skb_shinfo(skb)->gso_type; - - if (unlikely(type & -~(SKB_GSO_TCPV4 | - SKB_GSO_DODGY | - SKB_GSO_TCP_ECN | - SKB_GSO_TCP_FIXEDID | - SKB_GSO_TCPV6 | - SKB_GSO_GRE | - SKB_GSO_GRE_CSUM | - SKB_GSO_IPIP | - SKB_GSO_SIT | - SKB_GSO_UDP_TUNNEL | - SKB_GSO_UDP_TUNNEL_CSUM | - SKB_GSO_TUNNEL_REMCSUM | - 0) || -!(type & (SKB_GSO_TCPV4 | - SKB_GSO_TCPV6 - goto out; skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss); diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index 6b7459c..81f253b 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -209,16 +209,6 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb, if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) { /* Packet is from an untrusted source, reset gso_segs. */ - int type = skb_shinfo(skb)->gso_type; - - if (unlikely(type & ~(SKB_GSO_UDP | SKB_GSO_DODGY | - SKB_GSO_UDP_TUNNEL | - SKB_GSO_UDP_TUNNEL_CSUM | - SKB_GSO_TUNNEL_REMCSUM | - SKB_GSO_IPIP | -
[PATCH v6 net-next 10/14] fou: Add encap ops for IPv6 tunnels
This patch add a new fou6 module that provides encapsulation operations for IPv6. Signed-off-by: Tom Herbert--- include/net/fou.h | 2 +- net/ipv6/Makefile | 1 + net/ipv6/fou6.c | 140 ++ 3 files changed, 142 insertions(+), 1 deletion(-) create mode 100644 net/ipv6/fou6.c diff --git a/include/net/fou.h b/include/net/fou.h index 7d2fda2..f5cc691 100644 --- a/include/net/fou.h +++ b/include/net/fou.h @@ -9,7 +9,7 @@ #include size_t fou_encap_hlen(struct ip_tunnel_encap *e); -static size_t gue_encap_hlen(struct ip_tunnel_encap *e); +size_t gue_encap_hlen(struct ip_tunnel_encap *e); int __fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, u8 *protocol, __be16 *sport, int type); diff --git a/net/ipv6/Makefile b/net/ipv6/Makefile index 5e9d6bf..7ec3129 100644 --- a/net/ipv6/Makefile +++ b/net/ipv6/Makefile @@ -42,6 +42,7 @@ obj-$(CONFIG_IPV6_VTI) += ip6_vti.o obj-$(CONFIG_IPV6_SIT) += sit.o obj-$(CONFIG_IPV6_TUNNEL) += ip6_tunnel.o obj-$(CONFIG_IPV6_GRE) += ip6_gre.o +obj-$(CONFIG_NET_FOU) += fou6.o obj-y += addrconf_core.o exthdrs_core.o ip6_checksum.o ip6_icmp.o obj-$(CONFIG_INET) += output_core.o protocol.o $(ipv6-offload) diff --git a/net/ipv6/fou6.c b/net/ipv6/fou6.c new file mode 100644 index 000..c972d0b --- /dev/null +++ b/net/ipv6/fou6.c @@ -0,0 +1,140 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static void fou6_build_udp(struct sk_buff *skb, struct ip_tunnel_encap *e, + struct flowi6 *fl6, u8 *protocol, __be16 sport) +{ + struct udphdr *uh; + + skb_push(skb, sizeof(struct udphdr)); + skb_reset_transport_header(skb); + + uh = udp_hdr(skb); + + uh->dest = e->dport; + uh->source = sport; + uh->len = htons(skb->len); + udp6_set_csum(!(e->flags & TUNNEL_ENCAP_FLAG_CSUM6), skb, + >saddr, >daddr, skb->len); + + *protocol = IPPROTO_UDP; +} + +int fou6_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, + u8 *protocol, struct flowi6 *fl6) +{ + __be16 sport; + int err; + int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM6 ? + SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL; + + err = __fou_build_header(skb, e, protocol, , type); + if (err) + return err; + + fou6_build_udp(skb, e, fl6, protocol, sport); + + return 0; +} +EXPORT_SYMBOL(fou6_build_header); + +int gue6_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, + u8 *protocol, struct flowi6 *fl6) +{ + __be16 sport; + int err; + int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM6 ? + SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL; + + err = __gue_build_header(skb, e, protocol, , type); + if (err) + return err; + + fou6_build_udp(skb, e, fl6, protocol, sport); + + return 0; +} +EXPORT_SYMBOL(gue6_build_header); + +#ifdef CONFIG_NET_FOU_IP_TUNNELS + +static const struct ip6_tnl_encap_ops fou_ip6tun_ops = { + .encap_hlen = fou_encap_hlen, + .build_header = fou6_build_header, +}; + +static const struct ip6_tnl_encap_ops gue_ip6tun_ops = { + .encap_hlen = gue_encap_hlen, + .build_header = gue6_build_header, +}; + +static int ip6_tnl_encap_add_fou_ops(void) +{ + int ret; + + ret = ip6_tnl_encap_add_ops(_ip6tun_ops, TUNNEL_ENCAP_FOU); + if (ret < 0) { + pr_err("can't add fou6 ops\n"); + return ret; + } + + ret = ip6_tnl_encap_add_ops(_ip6tun_ops, TUNNEL_ENCAP_GUE); + if (ret < 0) { + pr_err("can't add gue6 ops\n"); + ip6_tnl_encap_del_ops(_ip6tun_ops, TUNNEL_ENCAP_FOU); + return ret; + } + + return 0; +} + +static void ip6_tnl_encap_del_fou_ops(void) +{ + ip6_tnl_encap_del_ops(_ip6tun_ops, TUNNEL_ENCAP_FOU); + ip6_tnl_encap_del_ops(_ip6tun_ops, TUNNEL_ENCAP_GUE); +} + +#else + +static int ip6_tnl_encap_add_fou_ops(void) +{ + return 0; +} + +static void ip6_tnl_encap_del_fou_ops(void) +{ +} + +#endif + +static int __init fou6_init(void) +{ + int ret; + + ret = ip6_tnl_encap_add_fou_ops(); + + return ret; +} + +static void __exit fou6_fini(void) +{ + ip6_tnl_encap_del_fou_ops(); +} + +module_init(fou6_init); +module_exit(fou6_fini); +MODULE_AUTHOR("Tom Herbert "); +MODULE_LICENSE("GPL"); -- 2.8.0.rc2
[PATCH v6 net-next 11/14] ip6_gre: Add support for fou/gue encapsulation
Add netlink and setup for encapsulation Signed-off-by: Tom Herbert--- net/ipv6/ip6_gre.c | 79 +++--- 1 file changed, 75 insertions(+), 4 deletions(-) diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c index 4541fa5..6fb1b89 100644 --- a/net/ipv6/ip6_gre.c +++ b/net/ipv6/ip6_gre.c @@ -729,7 +729,7 @@ static void ip6gre_tnl_link_config(struct ip6_tnl *t, int set_mtu) t->tun_hlen = gre_calc_hlen(t->parms.o_flags); - t->hlen = t->tun_hlen; + t->hlen = t->encap_hlen + t->tun_hlen; t_hlen = t->hlen + sizeof(struct ipv6hdr); @@ -1022,9 +1022,7 @@ static int ip6gre_tunnel_init_common(struct net_device *dev) } tunnel->tun_hlen = gre_calc_hlen(tunnel->parms.o_flags); - - tunnel->hlen = tunnel->tun_hlen; - + tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen; t_hlen = tunnel->hlen + sizeof(struct ipv6hdr); dev->hard_header_len = LL_MAX_HEADER + t_hlen; @@ -1290,15 +1288,57 @@ static void ip6gre_tap_setup(struct net_device *dev) dev->priv_flags &= ~IFF_TX_SKB_SHARING; } +static bool ip6gre_netlink_encap_parms(struct nlattr *data[], + struct ip_tunnel_encap *ipencap) +{ + bool ret = false; + + memset(ipencap, 0, sizeof(*ipencap)); + + if (!data) + return ret; + + if (data[IFLA_GRE_ENCAP_TYPE]) { + ret = true; + ipencap->type = nla_get_u16(data[IFLA_GRE_ENCAP_TYPE]); + } + + if (data[IFLA_GRE_ENCAP_FLAGS]) { + ret = true; + ipencap->flags = nla_get_u16(data[IFLA_GRE_ENCAP_FLAGS]); + } + + if (data[IFLA_GRE_ENCAP_SPORT]) { + ret = true; + ipencap->sport = nla_get_be16(data[IFLA_GRE_ENCAP_SPORT]); + } + + if (data[IFLA_GRE_ENCAP_DPORT]) { + ret = true; + ipencap->dport = nla_get_be16(data[IFLA_GRE_ENCAP_DPORT]); + } + + return ret; +} + static int ip6gre_newlink(struct net *src_net, struct net_device *dev, struct nlattr *tb[], struct nlattr *data[]) { struct ip6_tnl *nt; struct net *net = dev_net(dev); struct ip6gre_net *ign = net_generic(net, ip6gre_net_id); + struct ip_tunnel_encap ipencap; int err; nt = netdev_priv(dev); + + if (ip6gre_netlink_encap_parms(data, )) { + int err = ip6_tnl_encap_setup(nt, ); + + if (err < 0) + return err; + } + ip6gre_netlink_parms(data, >parms); if (ip6gre_tunnel_find(net, >parms, dev->type)) @@ -1345,10 +1385,18 @@ static int ip6gre_changelink(struct net_device *dev, struct nlattr *tb[], struct net *net = nt->net; struct ip6gre_net *ign = net_generic(net, ip6gre_net_id); struct __ip6_tnl_parm p; + struct ip_tunnel_encap ipencap; if (dev == ign->fb_tunnel_dev) return -EINVAL; + if (ip6gre_netlink_encap_parms(data, )) { + int err = ip6_tnl_encap_setup(nt, ); + + if (err < 0) + return err; + } + ip6gre_netlink_parms(data, ); t = ip6gre_tunnel_locate(net, , 0); @@ -1400,6 +1448,14 @@ static size_t ip6gre_get_size(const struct net_device *dev) nla_total_size(4) + /* IFLA_GRE_FLAGS */ nla_total_size(4) + + /* IFLA_GRE_ENCAP_TYPE */ + nla_total_size(2) + + /* IFLA_GRE_ENCAP_FLAGS */ + nla_total_size(2) + + /* IFLA_GRE_ENCAP_SPORT */ + nla_total_size(2) + + /* IFLA_GRE_ENCAP_DPORT */ + nla_total_size(2) + 0; } @@ -1422,6 +1478,17 @@ static int ip6gre_fill_info(struct sk_buff *skb, const struct net_device *dev) nla_put_be32(skb, IFLA_GRE_FLOWINFO, p->flowinfo) || nla_put_u32(skb, IFLA_GRE_FLAGS, p->flags)) goto nla_put_failure; + + if (nla_put_u16(skb, IFLA_GRE_ENCAP_TYPE, + t->encap.type) || + nla_put_be16(skb, IFLA_GRE_ENCAP_SPORT, +t->encap.sport) || + nla_put_be16(skb, IFLA_GRE_ENCAP_DPORT, +t->encap.dport) || + nla_put_u16(skb, IFLA_GRE_ENCAP_FLAGS, + t->encap.flags)) + goto nla_put_failure; + return 0; nla_put_failure: @@ -1440,6 +1507,10 @@ static const struct nla_policy ip6gre_policy[IFLA_GRE_MAX + 1] = { [IFLA_GRE_ENCAP_LIMIT] = { .type = NLA_U8 }, [IFLA_GRE_FLOWINFO]= { .type = NLA_U32 }, [IFLA_GRE_FLAGS] = { .type = NLA_U32 }, + [IFLA_GRE_ENCAP_TYPE] = { .type = NLA_U16 }, + [IFLA_GRE_ENCAP_FLAGS] = { .type = NLA_U16 }, + [IFLA_GRE_ENCAP_SPORT] = { .type = NLA_U16 }, +
[PATCH v6 net-next 03/14] ipv6: Fix nexthdr for reinjection
In ip6_input_finish the nexthdr protocol is retrieved from the next header offset that is returned in the cb of the skb. This method does not work for UDP encapsulation that may not even have a concept of a nexthdr field (e.g. FOU). This patch checks for a final protocol (INET6_PROTO_FINAL) when a protocol handler returns > 0. If the protocol is not final then resubmission is performed on nhoff value. If the protocol is final then the nexthdr is taken to be the return value. Signed-off-by: Tom Herbert--- net/ipv6/ip6_input.c | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c index f185cbc..d35dff2 100644 --- a/net/ipv6/ip6_input.c +++ b/net/ipv6/ip6_input.c @@ -236,6 +236,7 @@ resubmit: nhoff = IP6CB(skb)->nhoff; nexthdr = skb_network_header(skb)[nhoff]; +resubmit_final: raw = raw6_local_deliver(skb, nexthdr); ipprot = rcu_dereference(inet6_protos[nexthdr]); if (ipprot) { @@ -263,10 +264,21 @@ resubmit: goto discard; ret = ipprot->handler(skb); - if (ret > 0) - goto resubmit; - else if (ret == 0) + if (ret > 0) { + if (ipprot->flags & INET6_PROTO_FINAL) { + /* Not an extension header, most likely UDP +* encapsulation. Use return value as nexthdr +* protocol not nhoff (which presumably is +* not set by handler). +*/ + nexthdr = ret; + goto resubmit_final; + } else { + goto resubmit; + } + } else if (ret == 0) { __IP6_INC_STATS(net, idev, IPSTATS_MIB_INDELIVERS); + } } else { if (!raw) { if (xfrm6_policy_check(NULL, XFRM_POLICY_IN, skb)) { -- 2.8.0.rc2
[PATCH v6 net-next 14/14] ip4ip6: Support for GSO/GRO
Signed-off-by: Tom Herbert--- include/net/inet_common.h | 5 + net/ipv4/af_inet.c| 12 +++- net/ipv6/ip6_offload.c| 33 - net/ipv6/ip6_tunnel.c | 3 +++ 4 files changed, 47 insertions(+), 6 deletions(-) diff --git a/include/net/inet_common.h b/include/net/inet_common.h index 109e3ee..5d68342 100644 --- a/include/net/inet_common.h +++ b/include/net/inet_common.h @@ -39,6 +39,11 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family, int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len); +struct sk_buff **inet_gro_receive(struct sk_buff **head, struct sk_buff *skb); +int inet_gro_complete(struct sk_buff *skb, int nhoff); +struct sk_buff *inet_gso_segment(struct sk_buff *skb, +netdev_features_t features); + static inline void inet_ctl_sock_destroy(struct sock *sk) { if (sk) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 25040b1..377424e 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1192,8 +1192,8 @@ int inet_sk_rebuild_header(struct sock *sk) } EXPORT_SYMBOL(inet_sk_rebuild_header); -static struct sk_buff *inet_gso_segment(struct sk_buff *skb, - netdev_features_t features) +struct sk_buff *inet_gso_segment(struct sk_buff *skb, +netdev_features_t features) { bool udpfrag = false, fixedid = false, encap; struct sk_buff *segs = ERR_PTR(-EINVAL); @@ -1280,9 +1280,9 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb, out: return segs; } +EXPORT_SYMBOL(inet_gso_segment); -static struct sk_buff **inet_gro_receive(struct sk_buff **head, -struct sk_buff *skb) +struct sk_buff **inet_gro_receive(struct sk_buff **head, struct sk_buff *skb) { const struct net_offload *ops; struct sk_buff **pp = NULL; @@ -1398,6 +1398,7 @@ out: return pp; } +EXPORT_SYMBOL(inet_gro_receive); static struct sk_buff **ipip_gro_receive(struct sk_buff **head, struct sk_buff *skb) @@ -1449,7 +1450,7 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len) return -EINVAL; } -static int inet_gro_complete(struct sk_buff *skb, int nhoff) +int inet_gro_complete(struct sk_buff *skb, int nhoff) { __be16 newlen = htons(skb->len - nhoff); struct iphdr *iph = (struct iphdr *)(skb->data + nhoff); @@ -1479,6 +1480,7 @@ out_unlock: return err; } +EXPORT_SYMBOL(inet_gro_complete); static int ipip_gro_complete(struct sk_buff *skb, int nhoff) { diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index 332d6a0..22e90e5 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -16,6 +16,7 @@ #include #include +#include #include "ip6_offload.h" @@ -268,6 +269,21 @@ static struct sk_buff **sit_ip6ip6_gro_receive(struct sk_buff **head, return ipv6_gro_receive(head, skb); } +static struct sk_buff **ip4ip6_gro_receive(struct sk_buff **head, + struct sk_buff *skb) +{ + /* Common GRO receive for SIT and IP6IP6 */ + + if (NAPI_GRO_CB(skb)->encap_mark) { + NAPI_GRO_CB(skb)->flush = 1; + return NULL; + } + + NAPI_GRO_CB(skb)->encap_mark = 1; + + return inet_gro_receive(head, skb); +} + static int ipv6_gro_complete(struct sk_buff *skb, int nhoff) { const struct net_offload *ops; @@ -307,6 +323,13 @@ static int ip6ip6_gro_complete(struct sk_buff *skb, int nhoff) return ipv6_gro_complete(skb, nhoff); } +static int ip4ip6_gro_complete(struct sk_buff *skb, int nhoff) +{ + skb->encapsulation = 1; + skb_shinfo(skb)->gso_type |= SKB_GSO_IPXIP6; + return inet_gro_complete(skb, nhoff); +} + static struct packet_offload ipv6_packet_offload __read_mostly = { .type = cpu_to_be16(ETH_P_IPV6), .callbacks = { @@ -324,6 +347,14 @@ static const struct net_offload sit_offload = { }, }; +static const struct net_offload ip4ip6_offload = { + .callbacks = { + .gso_segment= inet_gso_segment, + .gro_receive= ip4ip6_gro_receive, + .gro_complete = ip4ip6_gro_complete, + }, +}; + static const struct net_offload ip6ip6_offload = { .callbacks = { .gso_segment= ipv6_gso_segment, @@ -331,7 +362,6 @@ static const struct net_offload ip6ip6_offload = { .gro_complete = ip6ip6_gro_complete, }, }; - static int __init ipv6_offload_init(void) { @@ -344,6 +374,7 @@ static int __init ipv6_offload_init(void) inet_add_offload(_offload, IPPROTO_IPV6); inet6_add_offload(_offload, IPPROTO_IPV6); + inet6_add_offload(_offload, IPPROTO_IPIP); return
[PATCH v6 net-next 06/14] fou: Call setup_udp_tunnel_sock
Use helper function to set up UDP tunnel related information for a fou socket. Signed-off-by: Tom Herbert--- net/ipv4/fou.c | 50 -- 1 file changed, 16 insertions(+), 34 deletions(-) diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c index eeec7d6..6cbc725 100644 --- a/net/ipv4/fou.c +++ b/net/ipv4/fou.c @@ -448,31 +448,13 @@ static void fou_release(struct fou *fou) kfree_rcu(fou, rcu); } -static int fou_encap_init(struct sock *sk, struct fou *fou, struct fou_cfg *cfg) -{ - udp_sk(sk)->encap_rcv = fou_udp_recv; - udp_sk(sk)->gro_receive = fou_gro_receive; - udp_sk(sk)->gro_complete = fou_gro_complete; - fou_from_sock(sk)->protocol = cfg->protocol; - - return 0; -} - -static int gue_encap_init(struct sock *sk, struct fou *fou, struct fou_cfg *cfg) -{ - udp_sk(sk)->encap_rcv = gue_udp_recv; - udp_sk(sk)->gro_receive = gue_gro_receive; - udp_sk(sk)->gro_complete = gue_gro_complete; - - return 0; -} - static int fou_create(struct net *net, struct fou_cfg *cfg, struct socket **sockp) { struct socket *sock = NULL; struct fou *fou = NULL; struct sock *sk; + struct udp_tunnel_sock_cfg tunnel_cfg; int err; /* Open UDP socket */ @@ -491,33 +473,33 @@ static int fou_create(struct net *net, struct fou_cfg *cfg, fou->flags = cfg->flags; fou->port = cfg->udp_config.local_udp_port; + fou->type = cfg->type; + fou->sock = sock; + + memset(_cfg, 0, sizeof(tunnel_cfg)); + tunnel_cfg.encap_type = 1; + tunnel_cfg.sk_user_data = fou; + tunnel_cfg.encap_destroy = NULL; /* Initial for fou type */ switch (cfg->type) { case FOU_ENCAP_DIRECT: - err = fou_encap_init(sk, fou, cfg); - if (err) - goto error; + tunnel_cfg.encap_rcv = fou_udp_recv; + tunnel_cfg.gro_receive = fou_gro_receive; + tunnel_cfg.gro_complete = fou_gro_complete; + fou->protocol = cfg->protocol; break; case FOU_ENCAP_GUE: - err = gue_encap_init(sk, fou, cfg); - if (err) - goto error; + tunnel_cfg.encap_rcv = gue_udp_recv; + tunnel_cfg.gro_receive = gue_gro_receive; + tunnel_cfg.gro_complete = gue_gro_complete; break; default: err = -EINVAL; goto error; } - fou->type = cfg->type; - - udp_sk(sk)->encap_type = 1; - udp_encap_enable(); - - sk->sk_user_data = fou; - fou->sock = sock; - - inet_inc_convert_csum(sk); + setup_udp_tunnel_sock(net, sock, _cfg); sk->sk_allocation = GFP_ATOMIC; -- 2.8.0.rc2
[PATCH v6 net-next 07/14] fou: Split out {fou,gue}_build_header
Create __fou_build_header and __gue_build_header. These implement the protocol generic parts of building the fou and gue header. fou_build_header and gue_build_header implement the IPv4 specific functions and call the __*_build_header functions. Signed-off-by: Tom Herbert--- include/net/fou.h | 8 net/ipv4/fou.c| 47 +-- 2 files changed, 41 insertions(+), 14 deletions(-) diff --git a/include/net/fou.h b/include/net/fou.h index 19b8a0c..7d2fda2 100644 --- a/include/net/fou.h +++ b/include/net/fou.h @@ -11,9 +11,9 @@ size_t fou_encap_hlen(struct ip_tunnel_encap *e); static size_t gue_encap_hlen(struct ip_tunnel_encap *e); -int fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, -u8 *protocol, struct flowi4 *fl4); -int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, -u8 *protocol, struct flowi4 *fl4); +int __fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, + u8 *protocol, __be16 *sport, int type); +int __gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, + u8 *protocol, __be16 *sport, int type); #endif diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c index 6cbc725..f4f2ddd 100644 --- a/net/ipv4/fou.c +++ b/net/ipv4/fou.c @@ -780,6 +780,22 @@ static void fou_build_udp(struct sk_buff *skb, struct ip_tunnel_encap *e, *protocol = IPPROTO_UDP; } +int __fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, + u8 *protocol, __be16 *sport, int type) +{ + int err; + + err = iptunnel_handle_offloads(skb, type); + if (err) + return err; + + *sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev), + skb, 0, 0, false); + + return 0; +} +EXPORT_SYMBOL(__fou_build_header); + int fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, u8 *protocol, struct flowi4 *fl4) { @@ -788,26 +804,21 @@ int fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, __be16 sport; int err; - err = iptunnel_handle_offloads(skb, type); + err = __fou_build_header(skb, e, protocol, , type); if (err) return err; - sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev), - skb, 0, 0, false); fou_build_udp(skb, e, fl4, protocol, sport); return 0; } EXPORT_SYMBOL(fou_build_header); -int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, -u8 *protocol, struct flowi4 *fl4) +int __gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, + u8 *protocol, __be16 *sport, int type) { - int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM ? SKB_GSO_UDP_TUNNEL_CSUM : - SKB_GSO_UDP_TUNNEL; struct guehdr *guehdr; size_t hdrlen, optlen = 0; - __be16 sport; void *data; bool need_priv = false; int err; @@ -826,8 +837,8 @@ int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, return err; /* Get source port (based on flow hash) before skb_push */ - sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev), - skb, 0, 0, false); + *sport = e->sport ? : udp_flow_src_port(dev_net(skb->dev), + skb, 0, 0, false); hdrlen = sizeof(struct guehdr) + optlen; @@ -872,6 +883,22 @@ int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, } + return 0; +} +EXPORT_SYMBOL(__gue_build_header); + +int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, +u8 *protocol, struct flowi4 *fl4) +{ + int type = e->flags & TUNNEL_ENCAP_FLAG_CSUM ? SKB_GSO_UDP_TUNNEL_CSUM : + SKB_GSO_UDP_TUNNEL; + __be16 sport; + int err; + + err = __gue_build_header(skb, e, protocol, , type); + if (err) + return err; + fou_build_udp(skb, e, fl4, protocol, sport); return 0; -- 2.8.0.rc2
[PATCH v6 net-next 08/14] fou: Support IPv6 in fou
This patch adds receive path support for IPv6 with fou. - Add address family to fou structure for open sockets. This supports AF_INET and AF_INET6. Lookups for fou ports are performed on both the port number and family. - In fou and gue receive adjust tot_len in IPv4 header or payload_len based on address family. - Allow AF_INET6 in FOU_ATTR_AF netlink attribute. Signed-off-by: Tom Herbert--- net/ipv4/fou.c | 47 +++ 1 file changed, 35 insertions(+), 12 deletions(-) diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c index f4f2ddd..5f9207c 100644 --- a/net/ipv4/fou.c +++ b/net/ipv4/fou.c @@ -21,6 +21,7 @@ struct fou { u8 protocol; u8 flags; __be16 port; + u8 family; u16 type; struct list_head list; struct rcu_head rcu; @@ -47,14 +48,17 @@ static inline struct fou *fou_from_sock(struct sock *sk) return sk->sk_user_data; } -static int fou_recv_pull(struct sk_buff *skb, size_t len) +static int fou_recv_pull(struct sk_buff *skb, struct fou *fou, size_t len) { - struct iphdr *iph = ip_hdr(skb); - /* Remove 'len' bytes from the packet (UDP header and * FOU header if present). */ - iph->tot_len = htons(ntohs(iph->tot_len) - len); + if (fou->family == AF_INET) + ip_hdr(skb)->tot_len = htons(ntohs(ip_hdr(skb)->tot_len) - len); + else + ipv6_hdr(skb)->payload_len = + htons(ntohs(ipv6_hdr(skb)->payload_len) - len); + __skb_pull(skb, len); skb_postpull_rcsum(skb, udp_hdr(skb), len); skb_reset_transport_header(skb); @@ -68,7 +72,7 @@ static int fou_udp_recv(struct sock *sk, struct sk_buff *skb) if (!fou) return 1; - if (fou_recv_pull(skb, sizeof(struct udphdr))) + if (fou_recv_pull(skb, fou, sizeof(struct udphdr))) goto drop; return -fou->protocol; @@ -141,7 +145,11 @@ static int gue_udp_recv(struct sock *sk, struct sk_buff *skb) hdrlen = sizeof(struct guehdr) + optlen; - ip_hdr(skb)->tot_len = htons(ntohs(ip_hdr(skb)->tot_len) - len); + if (fou->family == AF_INET) + ip_hdr(skb)->tot_len = htons(ntohs(ip_hdr(skb)->tot_len) - len); + else + ipv6_hdr(skb)->payload_len = + htons(ntohs(ipv6_hdr(skb)->payload_len) - len); /* Pull csum through the guehdr now . This can be used if * there is a remote checksum offload. @@ -426,7 +434,8 @@ static int fou_add_to_port_list(struct net *net, struct fou *fou) mutex_lock(>fou_lock); list_for_each_entry(fout, >fou_list, list) { - if (fou->port == fout->port) { + if (fou->port == fout->port && + fou->family == fout->family) { mutex_unlock(>fou_lock); return -EALREADY; } @@ -471,8 +480,9 @@ static int fou_create(struct net *net, struct fou_cfg *cfg, sk = sock->sk; - fou->flags = cfg->flags; fou->port = cfg->udp_config.local_udp_port; + fou->family = cfg->udp_config.family; + fou->flags = cfg->flags; fou->type = cfg->type; fou->sock = sock; @@ -524,12 +534,13 @@ static int fou_destroy(struct net *net, struct fou_cfg *cfg) { struct fou_net *fn = net_generic(net, fou_net_id); __be16 port = cfg->udp_config.local_udp_port; + u8 family = cfg->udp_config.family; int err = -EINVAL; struct fou *fou; mutex_lock(>fou_lock); list_for_each_entry(fou, >fou_list, list) { - if (fou->port == port) { + if (fou->port == port && fou->family == family) { fou_release(fou); err = 0; break; @@ -567,8 +578,15 @@ static int parse_nl_config(struct genl_info *info, if (info->attrs[FOU_ATTR_AF]) { u8 family = nla_get_u8(info->attrs[FOU_ATTR_AF]); - if (family != AF_INET) - return -EINVAL; + switch (family) { + case AF_INET: + break; + case AF_INET6: + cfg->udp_config.ipv6_v6only = 1; + break; + default: + return -EAFNOSUPPORT; + } cfg->udp_config.family = family; } @@ -659,6 +677,7 @@ static int fou_nl_cmd_get_port(struct sk_buff *skb, struct genl_info *info) struct fou_cfg cfg; struct fou *fout; __be16 port; + u8 family; int ret; ret = parse_nl_config(info, ); @@ -668,6 +687,10 @@ static int fou_nl_cmd_get_port(struct sk_buff *skb, struct genl_info *info) if (port == 0) return -EINVAL; + family = cfg.udp_config.family; + if (family !=
[PATCH v6 net-next 09/14] ip6_tun: Add infrastructure for doing encapsulation
Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions for getting encap hlen, setting up encap on a tunnel, performing encapsulation operation. Signed-off-by: Tom Herbert--- include/net/ip6_tunnel.h | 58 net/ipv4/ip_tunnel_core.c | 5 +++ net/ipv6/ip6_tunnel.c | 85 ++- 3 files changed, 139 insertions(+), 9 deletions(-) diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h index fb9e015..d325c81 100644 --- a/include/net/ip6_tunnel.h +++ b/include/net/ip6_tunnel.h @@ -52,10 +52,68 @@ struct ip6_tnl { __u32 o_seqno; /* The last output seqno */ int hlen; /* tun_hlen + encap_hlen */ int tun_hlen; /* Precalculated header length */ + int encap_hlen; /* Encap header length (FOU,GUE) */ + struct ip_tunnel_encap encap; int mlink; +}; +struct ip6_tnl_encap_ops { + size_t (*encap_hlen)(struct ip_tunnel_encap *e); + int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e, + u8 *protocol, struct flowi6 *fl6); }; +extern const struct ip6_tnl_encap_ops __rcu * + ip6tun_encaps[MAX_IPTUN_ENCAP_OPS]; + +int ip6_tnl_encap_add_ops(const struct ip6_tnl_encap_ops *ops, + unsigned int num); +int ip6_tnl_encap_del_ops(const struct ip6_tnl_encap_ops *ops, + unsigned int num); +int ip6_tnl_encap_setup(struct ip6_tnl *t, + struct ip_tunnel_encap *ipencap); + +static inline int ip6_encap_hlen(struct ip_tunnel_encap *e) +{ + const struct ip6_tnl_encap_ops *ops; + int hlen = -EINVAL; + + if (e->type == TUNNEL_ENCAP_NONE) + return 0; + + if (e->type >= MAX_IPTUN_ENCAP_OPS) + return -EINVAL; + + rcu_read_lock(); + ops = rcu_dereference(ip6tun_encaps[e->type]); + if (likely(ops && ops->encap_hlen)) + hlen = ops->encap_hlen(e); + rcu_read_unlock(); + + return hlen; +} + +static inline int ip6_tnl_encap(struct sk_buff *skb, struct ip6_tnl *t, + u8 *protocol, struct flowi6 *fl6) +{ + const struct ip6_tnl_encap_ops *ops; + int ret = -EINVAL; + + if (t->encap.type == TUNNEL_ENCAP_NONE) + return 0; + + if (t->encap.type >= MAX_IPTUN_ENCAP_OPS) + return -EINVAL; + + rcu_read_lock(); + ops = rcu_dereference(ip6tun_encaps[t->encap.type]); + if (likely(ops && ops->build_header)) + ret = ops->build_header(skb, >encap, protocol, fl6); + rcu_read_unlock(); + + return ret; +} + /* Tunnel encapsulation limit destination sub-option */ struct ipv6_tlv_tnl_enc_lim { diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c index cc66a20..afd6b59 100644 --- a/net/ipv4/ip_tunnel_core.c +++ b/net/ipv4/ip_tunnel_core.c @@ -37,6 +37,7 @@ #include #include #include +#include #include #include #include @@ -51,6 +52,10 @@ const struct ip_tunnel_encap_ops __rcu * iptun_encaps[MAX_IPTUN_ENCAP_OPS] __read_mostly; EXPORT_SYMBOL(iptun_encaps); +const struct ip6_tnl_encap_ops __rcu * + ip6tun_encaps[MAX_IPTUN_ENCAP_OPS] __read_mostly; +EXPORT_SYMBOL(ip6tun_encaps); + void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct sk_buff *skb, __be32 src, __be32 dst, __u8 proto, __u8 tos, __u8 ttl, __be16 df, bool xnet) diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index e79330f..ec53612 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -1125,10 +1125,14 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev, __u8 dsfield, } max_headroom = LL_RESERVED_SPACE(dst->dev) + sizeof(struct ipv6hdr) - + dst->header_len; + + dst->header_len + t->hlen; if (max_headroom > dev->needed_headroom) dev->needed_headroom = max_headroom; + err = ip6_tnl_encap(skb, t, , fl6); + if (err) + return err; + skb_push(skb, sizeof(struct ipv6hdr)); skb_reset_network_header(skb); ipv6h = ipv6_hdr(skb); @@ -1280,6 +1284,7 @@ static void ip6_tnl_link_config(struct ip6_tnl *t) struct net_device *dev = t->dev; struct __ip6_tnl_parm *p = >parms; struct flowi6 *fl6 = >fl.u.ip6; + int t_hlen; memcpy(dev->dev_addr, >laddr, sizeof(struct in6_addr)); memcpy(dev->broadcast, >raddr, sizeof(struct in6_addr)); @@ -1303,6 +1308,10 @@ static void ip6_tnl_link_config(struct ip6_tnl *t) else dev->flags &= ~IFF_POINTOPOINT; + t->tun_hlen = 0; + t->hlen = t->encap_hlen + t->tun_hlen; + t_hlen = t->hlen + sizeof(struct ipv6hdr); + if (p->flags & IP6_TNL_F_CAP_XMIT) { int strict =
[PATCH v6 net-next 12/14] ip6_tunnel: Add support for fou/gue encapsulation
Add netlink and setup for encapsulation Signed-off-by: Tom Herbert--- net/ipv6/ip6_tunnel.c | 72 +++ 1 file changed, 72 insertions(+) diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index ec53612..8076c7a 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -1796,13 +1796,55 @@ static void ip6_tnl_netlink_parms(struct nlattr *data[], parms->proto = nla_get_u8(data[IFLA_IPTUN_PROTO]); } +static bool ip6_tnl_netlink_encap_parms(struct nlattr *data[], + struct ip_tunnel_encap *ipencap) +{ + bool ret = false; + + memset(ipencap, 0, sizeof(*ipencap)); + + if (!data) + return ret; + + if (data[IFLA_IPTUN_ENCAP_TYPE]) { + ret = true; + ipencap->type = nla_get_u16(data[IFLA_IPTUN_ENCAP_TYPE]); + } + + if (data[IFLA_IPTUN_ENCAP_FLAGS]) { + ret = true; + ipencap->flags = nla_get_u16(data[IFLA_IPTUN_ENCAP_FLAGS]); + } + + if (data[IFLA_IPTUN_ENCAP_SPORT]) { + ret = true; + ipencap->sport = nla_get_be16(data[IFLA_IPTUN_ENCAP_SPORT]); + } + + if (data[IFLA_IPTUN_ENCAP_DPORT]) { + ret = true; + ipencap->dport = nla_get_be16(data[IFLA_IPTUN_ENCAP_DPORT]); + } + + return ret; +} + static int ip6_tnl_newlink(struct net *src_net, struct net_device *dev, struct nlattr *tb[], struct nlattr *data[]) { struct net *net = dev_net(dev); struct ip6_tnl *nt, *t; + struct ip_tunnel_encap ipencap; nt = netdev_priv(dev); + + if (ip6_tnl_netlink_encap_parms(data, )) { + int err = ip6_tnl_encap_setup(nt, ); + + if (err < 0) + return err; + } + ip6_tnl_netlink_parms(data, >parms); t = ip6_tnl_locate(net, >parms, 0); @@ -1819,10 +1861,17 @@ static int ip6_tnl_changelink(struct net_device *dev, struct nlattr *tb[], struct __ip6_tnl_parm p; struct net *net = t->net; struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id); + struct ip_tunnel_encap ipencap; if (dev == ip6n->fb_tnl_dev) return -EINVAL; + if (ip6_tnl_netlink_encap_parms(data, )) { + int err = ip6_tnl_encap_setup(t, ); + + if (err < 0) + return err; + } ip6_tnl_netlink_parms(data, ); t = ip6_tnl_locate(net, , 0); @@ -1863,6 +1912,14 @@ static size_t ip6_tnl_get_size(const struct net_device *dev) nla_total_size(4) + /* IFLA_IPTUN_PROTO */ nla_total_size(1) + + /* IFLA_IPTUN_ENCAP_TYPE */ + nla_total_size(2) + + /* IFLA_IPTUN_ENCAP_FLAGS */ + nla_total_size(2) + + /* IFLA_IPTUN_ENCAP_SPORT */ + nla_total_size(2) + + /* IFLA_IPTUN_ENCAP_DPORT */ + nla_total_size(2) + 0; } @@ -1880,6 +1937,17 @@ static int ip6_tnl_fill_info(struct sk_buff *skb, const struct net_device *dev) nla_put_u32(skb, IFLA_IPTUN_FLAGS, parm->flags) || nla_put_u8(skb, IFLA_IPTUN_PROTO, parm->proto)) goto nla_put_failure; + + if (nla_put_u16(skb, IFLA_IPTUN_ENCAP_TYPE, + tunnel->encap.type) || + nla_put_be16(skb, IFLA_IPTUN_ENCAP_SPORT, +tunnel->encap.sport) || + nla_put_be16(skb, IFLA_IPTUN_ENCAP_DPORT, +tunnel->encap.dport) || + nla_put_u16(skb, IFLA_IPTUN_ENCAP_FLAGS, + tunnel->encap.flags)) + goto nla_put_failure; + return 0; nla_put_failure: @@ -1903,6 +1971,10 @@ static const struct nla_policy ip6_tnl_policy[IFLA_IPTUN_MAX + 1] = { [IFLA_IPTUN_FLOWINFO] = { .type = NLA_U32 }, [IFLA_IPTUN_FLAGS] = { .type = NLA_U32 }, [IFLA_IPTUN_PROTO] = { .type = NLA_U8 }, + [IFLA_IPTUN_ENCAP_TYPE] = { .type = NLA_U16 }, + [IFLA_IPTUN_ENCAP_FLAGS]= { .type = NLA_U16 }, + [IFLA_IPTUN_ENCAP_SPORT]= { .type = NLA_U16 }, + [IFLA_IPTUN_ENCAP_DPORT]= { .type = NLA_U16 }, }; static struct rtnl_link_ops ip6_link_ops __read_mostly = { -- 2.8.0.rc2
[PATCH v6 net-next 13/14] ip6ip6: Support for GSO/GRO
Signed-off-by: Tom Herbert--- net/ipv6/ip6_offload.c | 24 +--- net/ipv6/ip6_tunnel.c | 3 +++ 2 files changed, 24 insertions(+), 3 deletions(-) diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index 787e55f..332d6a0 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -253,9 +253,11 @@ out: return pp; } -static struct sk_buff **sit_gro_receive(struct sk_buff **head, - struct sk_buff *skb) +static struct sk_buff **sit_ip6ip6_gro_receive(struct sk_buff **head, + struct sk_buff *skb) { + /* Common GRO receive for SIT and IP6IP6 */ + if (NAPI_GRO_CB(skb)->encap_mark) { NAPI_GRO_CB(skb)->flush = 1; return NULL; @@ -298,6 +300,13 @@ static int sit_gro_complete(struct sk_buff *skb, int nhoff) return ipv6_gro_complete(skb, nhoff); } +static int ip6ip6_gro_complete(struct sk_buff *skb, int nhoff) +{ + skb->encapsulation = 1; + skb_shinfo(skb)->gso_type |= SKB_GSO_IPXIP6; + return ipv6_gro_complete(skb, nhoff); +} + static struct packet_offload ipv6_packet_offload __read_mostly = { .type = cpu_to_be16(ETH_P_IPV6), .callbacks = { @@ -310,11 +319,19 @@ static struct packet_offload ipv6_packet_offload __read_mostly = { static const struct net_offload sit_offload = { .callbacks = { .gso_segment= ipv6_gso_segment, - .gro_receive= sit_gro_receive, + .gro_receive= sit_ip6ip6_gro_receive, .gro_complete = sit_gro_complete, }, }; +static const struct net_offload ip6ip6_offload = { + .callbacks = { + .gso_segment= ipv6_gso_segment, + .gro_receive= sit_ip6ip6_gro_receive, + .gro_complete = ip6ip6_gro_complete, + }, +}; + static int __init ipv6_offload_init(void) { @@ -326,6 +343,7 @@ static int __init ipv6_offload_init(void) dev_add_offload(_packet_offload); inet_add_offload(_offload, IPPROTO_IPV6); + inet6_add_offload(_offload, IPPROTO_IPV6); return 0; } diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index 8076c7a..d205f17 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -1238,6 +1238,9 @@ ip6ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev) if (t->parms.flags & IP6_TNL_F_USE_ORIG_FWMARK) fl6.flowi6_mark = skb->mark; + if (iptunnel_handle_offloads(skb, SKB_GSO_IPXIP6)) + return -1; + err = ip6_tnl_xmit(skb, dev, dsfield, , encap_limit, , IPPROTO_IPV6); if (err != 0) { -- 2.8.0.rc2
[PATCH v6 net-next 05/14] net: Cleanup encap items in ip_tunnels.h
Consolidate all the ip_tunnel_encap definitions in one spot in the header file. Also, move ip_encap_hlen and ip_tunnel_encap from ip_tunnel.c to ip_tunnels.h so they call be called without a dependency on ip_tunnel module. Similarly, move iptun_encaps to ip_tunnel_core.c. Signed-off-by: Tom Herbert--- include/net/ip_tunnels.h | 76 --- net/ipv4/ip_tunnel.c | 45 net/ipv4/ip_tunnel_core.c | 4 +++ 3 files changed, 62 insertions(+), 63 deletions(-) diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index d916b43..dbf 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -171,22 +171,6 @@ struct ip_tunnel_net { struct ip_tunnel __rcu *collect_md_tun; }; -struct ip_tunnel_encap_ops { - size_t (*encap_hlen)(struct ip_tunnel_encap *e); - int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e, - u8 *protocol, struct flowi4 *fl4); -}; - -#define MAX_IPTUN_ENCAP_OPS 8 - -extern const struct ip_tunnel_encap_ops __rcu * - iptun_encaps[MAX_IPTUN_ENCAP_OPS]; - -int ip_tunnel_encap_add_ops(const struct ip_tunnel_encap_ops *op, - unsigned int num); -int ip_tunnel_encap_del_ops(const struct ip_tunnel_encap_ops *op, - unsigned int num); - static inline void ip_tunnel_key_init(struct ip_tunnel_key *key, __be32 saddr, __be32 daddr, u8 tos, u8 ttl, __be32 label, @@ -251,8 +235,6 @@ void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct rtnl_link_ops *ops); void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, const struct iphdr *tnl_params, const u8 protocol); int ip_tunnel_ioctl(struct net_device *dev, struct ip_tunnel_parm *p, int cmd); -int ip_tunnel_encap(struct sk_buff *skb, struct ip_tunnel *t, - u8 *protocol, struct flowi4 *fl4); int __ip_tunnel_change_mtu(struct net_device *dev, int new_mtu, bool strict); int ip_tunnel_change_mtu(struct net_device *dev, int new_mtu); @@ -271,9 +253,67 @@ int ip_tunnel_changelink(struct net_device *dev, struct nlattr *tb[], int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[], struct ip_tunnel_parm *p); void ip_tunnel_setup(struct net_device *dev, int net_id); + +struct ip_tunnel_encap_ops { + size_t (*encap_hlen)(struct ip_tunnel_encap *e); + int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e, + u8 *protocol, struct flowi4 *fl4); +}; + +#define MAX_IPTUN_ENCAP_OPS 8 + +extern const struct ip_tunnel_encap_ops __rcu * + iptun_encaps[MAX_IPTUN_ENCAP_OPS]; + +int ip_tunnel_encap_add_ops(const struct ip_tunnel_encap_ops *op, + unsigned int num); +int ip_tunnel_encap_del_ops(const struct ip_tunnel_encap_ops *op, + unsigned int num); + int ip_tunnel_encap_setup(struct ip_tunnel *t, struct ip_tunnel_encap *ipencap); +static inline int ip_encap_hlen(struct ip_tunnel_encap *e) +{ + const struct ip_tunnel_encap_ops *ops; + int hlen = -EINVAL; + + if (e->type == TUNNEL_ENCAP_NONE) + return 0; + + if (e->type >= MAX_IPTUN_ENCAP_OPS) + return -EINVAL; + + rcu_read_lock(); + ops = rcu_dereference(iptun_encaps[e->type]); + if (likely(ops && ops->encap_hlen)) + hlen = ops->encap_hlen(e); + rcu_read_unlock(); + + return hlen; +} + +static inline int ip_tunnel_encap(struct sk_buff *skb, struct ip_tunnel *t, + u8 *protocol, struct flowi4 *fl4) +{ + const struct ip_tunnel_encap_ops *ops; + int ret = -EINVAL; + + if (t->encap.type == TUNNEL_ENCAP_NONE) + return 0; + + if (t->encap.type >= MAX_IPTUN_ENCAP_OPS) + return -EINVAL; + + rcu_read_lock(); + ops = rcu_dereference(iptun_encaps[t->encap.type]); + if (likely(ops && ops->build_header)) + ret = ops->build_header(skb, >encap, protocol, fl4); + rcu_read_unlock(); + + return ret; +} + /* Extract dsfield from inner protocol */ static inline u8 ip_tunnel_get_dsfield(const struct iphdr *iph, const struct sk_buff *skb) diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index a69ed94..d8f5e0a 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -443,29 +443,6 @@ drop: } EXPORT_SYMBOL_GPL(ip_tunnel_rcv); -static int ip_encap_hlen(struct ip_tunnel_encap *e) -{ - const struct ip_tunnel_encap_ops *ops; - int hlen = -EINVAL; - - if (e->type == TUNNEL_ENCAP_NONE) - return 0; - - if (e->type >= MAX_IPTUN_ENCAP_OPS) - return -EINVAL; - -
[PATCH v6 net-next 02/14] net: define gso types for IPx over IPv4 and IPv6
This patch defines two new GSO definitions SKB_GSO_IPXIP4 and SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and NETIF_F_GSO_IPXIP6. These are used to described IP in IP tunnel and what the outer protocol is. The inner protocol can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT are removed (these are both instances of SKB_GSO_IPXIP4). SKB_GSO_IPXIP6 will be used when support for GSO with IP encapsulation over IPv6 is added. Signed-off-by: Tom Herbert--- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 5 ++--- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 5 ++--- drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +-- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 3 +-- drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 3 +-- drivers/net/ethernet/intel/i40evf/i40evf_main.c | 3 +-- drivers/net/ethernet/intel/igb/igb_main.c | 3 +-- drivers/net/ethernet/intel/igbvf/netdev.c | 3 +-- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 +-- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +-- include/linux/netdev_features.h | 12 ++-- include/linux/netdevice.h | 4 ++-- include/linux/skbuff.h| 4 ++-- net/core/ethtool.c| 4 ++-- net/ipv4/af_inet.c| 2 +- net/ipv4/ipip.c | 2 +- net/ipv6/ip6_offload.c| 4 ++-- net/ipv6/sit.c| 4 ++-- net/netfilter/ipvs/ip_vs_xmit.c | 17 +++-- 19 files changed, 37 insertions(+), 50 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c index d465bd7..0a5b770 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c @@ -13259,12 +13259,11 @@ static int bnx2x_init_dev(struct bnx2x *bp, struct pci_dev *pdev, NETIF_F_RXHASH | NETIF_F_HW_VLAN_CTAG_TX; if (!chip_is_e1x) { dev->hw_features |= NETIF_F_GSO_GRE | NETIF_F_GSO_UDP_TUNNEL | - NETIF_F_GSO_IPIP | NETIF_F_GSO_SIT; + NETIF_F_GSO_IPXIP4; dev->hw_enc_features = NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | NETIF_F_SG | NETIF_F_TSO | NETIF_F_TSO_ECN | NETIF_F_TSO6 | - NETIF_F_GSO_IPIP | - NETIF_F_GSO_SIT | + NETIF_F_GSO_IPXIP4 | NETIF_F_GSO_GRE | NETIF_F_GSO_UDP_TUNNEL; } diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 5a0dca3..72a2eff 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -6311,7 +6311,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) dev->hw_features = NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | NETIF_F_SG | NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_GSO_UDP_TUNNEL | NETIF_F_GSO_GRE | - NETIF_F_GSO_IPIP | NETIF_F_GSO_SIT | + NETIF_F_GSO_IPXIP4 | NETIF_F_GSO_UDP_TUNNEL_CSUM | NETIF_F_GSO_GRE_CSUM | NETIF_F_GSO_PARTIAL | NETIF_F_RXHASH | NETIF_F_RXCSUM | NETIF_F_LRO | NETIF_F_GRO; @@ -6321,8 +6321,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_GSO_UDP_TUNNEL | NETIF_F_GSO_GRE | NETIF_F_GSO_UDP_TUNNEL_CSUM | NETIF_F_GSO_GRE_CSUM | - NETIF_F_GSO_IPIP | NETIF_F_GSO_SIT | - NETIF_F_GSO_PARTIAL; + NETIF_F_GSO_IPXIP4 | NETIF_F_GSO_PARTIAL; dev->gso_partial_features = NETIF_F_GSO_UDP_TUNNEL_CSUM | NETIF_F_GSO_GRE_CSUM; dev->vlan_features = dev->hw_features | NETIF_F_HIGHDMA; diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 1cd0ebf..242a1ff 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -9083,8 +9083,7 @@ static int i40e_config_netdev(struct i40e_vsi *vsi) NETIF_F_TSO6 | NETIF_F_GSO_GRE | NETIF_F_GSO_GRE_CSUM | - NETIF_F_GSO_IPIP | - NETIF_F_GSO_SIT | +
[PATCH v6 net-next 04/14] ipv6: Change "final" protocol processing for encapsulation
When performing foo-over-UDP, UDP packets are processed by the encapsulation handler which returns another protocol to process. This may result in processing two (or more) protocols in the loop that are marked as INET6_PROTO_FINAL. The actions taken for hitting a final protocol, in particular the skb_postpull_rcsum can only be performed once. This patch set adds a check of a final protocol has been seen. The rules are: - If the final protocol has not been seen any protocol is processed (final and non-final). In the case of a final protocol, the final actions are taken (like the skb_postpull_rcsum) - If a final protocol has been seen (e.g. an encapsulating UDP header) then no further non-final protocols are allowed (e.g. extension headers). For more final protocols the final actions are not taken (e.g. skb_postpull_rcsum). Signed-off-by: Tom Herbert--- net/ipv6/ip6_input.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c index d35dff2..94611e4 100644 --- a/net/ipv6/ip6_input.c +++ b/net/ipv6/ip6_input.c @@ -223,6 +223,7 @@ static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *sk unsigned int nhoff; int nexthdr; bool raw; + bool have_final = false; /* * Parse extension headers @@ -242,9 +243,21 @@ resubmit_final: if (ipprot) { int ret; - if (ipprot->flags & INET6_PROTO_FINAL) { + if (have_final) { + if (!(ipprot->flags & INET6_PROTO_FINAL)) { + /* Once we've seen a final protocol don't +* allow encapsulation on any non-final +* ones. This allows foo in UDP encapsulation +* to work. +*/ + goto discard; + } + } else if (ipprot->flags & INET6_PROTO_FINAL) { const struct ipv6hdr *hdr; + /* Only do this once for first final protocol */ + have_final = true; + /* Free reference early: we don't need it any more, and it may hold ip_conntrack module loaded indefinitely. */ -- 2.8.0.rc2
[PATCH v6 net-next 00/14] ipv6: Enable GUEoIPv6 and more fixes for v6 tunneling
This patch set: - Fixes GRE6 to process translate flags correctly from configuration - Adds support for GSO and GRO for ip6ip6 and ip4ip6 - Add support for FOU and GUE in IPv6 - Support GRE, ip6ip6 and ip4ip6 over FOU/GUE - Fixes ip6_input to deal with UDP encapsulations - Some other minor fixes v2: - Removed a check of GSO types in MPLS - Define GSO type SKB_GSO_IPXIP6 and SKB_GSO_IPXIP4 (based on input from Alexander) - Don't define GSO types specifically for IP6IP6 and IP4IP6, above fix makes that unnecessary - Don't bother clearing encapsulation flag in UDP tunnel segment (another item suggested by Alexander). v3: - Address some minor comments from Alexander v4: - Rebase on changes to fix IP TX tunnels - Fix MTU issues in ip4ip6, ip6ip6 - Add test data for above v5: - Address feedback from Shmulik Ladkani regarding extension header code that does not return next header but in instead relies on returning value via nhoff. Solution here is to fix EH processing to return nexthdr value. - Refactored IPv4 encaps so that we won't need to create a ip6_tunnel_core.c when adding encap support IPv6. v6: - Fix build issues with regard to new GSO constants - FIx MTU calculation issues ip6_tunnel.c pointed out byt ALex - Add encap_hlen into headroom for GREv6 to work with FOU/GUE Tested: Tested a variety of case, but not the full matrix (which is quite large now). Most of the obvious cases (e.g. GRE) work fine. Still some issues probably with GSO/GRO being effective in all cases. - IPv4/GRE/GUE/IPv6 with RCO 1 TCP_STREAM 6616 Mbps 200 TCP_RR 1244043 tps 141/243/446 90/95/99% latencies 86.61% CPU utilization - IPv6/GRE/GUE/IPv6 with RCO 1 TCP_STREAM 6940 Mbps 200 TCP_RR 1270903 tps 138/236/440 90/95/99% latencies 87.51% CPU utilization - IP6IP6 1 TCP_STREAM 2576 Mbps 200 TCP_RR 498981 tps 388/498/631 90/95/99% latencies 19.75% CPU utilization (1 CPU saturated) - IP6IP6/GUE with RCO 1 TCP_STREAM 2031 Mbps 200 TCP_RR 1233818 tps 143/244/451 90/95/99% latencies 87.57 CPU utilization - IP4IP6 1 TCP_STREAM 2371 Mbps 200 TCP_RR 763774 tps 250/318/466 90/95/99% latencies 35.25% CPU utilization (1 CPU saturated) - IP4IP6/GUE with RCO 1 TCP_STREAM 2054 Mbps 200 TCP_RR 1196385 tps 148/251/460 90/95/99% latencies 87.56 CPU utilization - GRE with keyid 200 TCP_RR 744173 tps 258/332/461 90/95/99% latencies 34.59% CPU utilization (1 CPU saturated) Tom Herbert (14): gso: Remove arbitrary checks for unsupported GSO net: define gso types for IPx over IPv4 and IPv6 ipv6: Fix nexthdr for reinjection ipv6: Change "final" protocol processing for encapsulation net: Cleanup encap items in ip_tunnels.h fou: Call setup_udp_tunnel_sock fou: Split out {fou,gue}_build_header fou: Support IPv6 in fou ip6_tun: Add infrastructure for doing encapsulation fou: Add encap ops for IPv6 tunnels ip6_gre: Add support for fou/gue encapsulation ip6_tunnel: Add support for fou/gue encapsulation ip6ip6: Support for GSO/GRO ip4ip6: Support for GSO/GRO drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 5 +- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 5 +- drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 3 +- drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 3 +- drivers/net/ethernet/intel/i40evf/i40evf_main.c | 3 +- drivers/net/ethernet/intel/igb/igb_main.c | 3 +- drivers/net/ethernet/intel/igbvf/netdev.c | 3 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 +- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +- include/linux/netdev_features.h | 12 +- include/linux/netdevice.h | 4 +- include/linux/skbuff.h| 4 +- include/net/fou.h | 10 +- include/net/inet_common.h | 5 + include/net/ip6_tunnel.h | 58 include/net/ip_tunnels.h | 76 +++--- net/core/ethtool.c| 4 +- net/ipv4/af_inet.c| 32 ++--- net/ipv4/fou.c| 144 +++ net/ipv4/gre_offload.c| 14 -- net/ipv4/ip_tunnel.c | 45 -- net/ipv4/ip_tunnel_core.c | 9 ++ net/ipv4/ipip.c | 2 +- net/ipv4/tcp_offload.c| 19 --- net/ipv4/udp_offload.c| 10
[PATCH net-next] bpf, doc: fix typo on bpf_asm descriptions
Fix description of some of the bpf_asm tool related jump instructions and generally move them to format A k. Reported-by: Sebastian AmendSigned-off-by: Daniel Borkmann --- Documentation/networking/filter.txt | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index 6aef0b5..b9a4edf 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -216,14 +216,14 @@ opcodes as defined in linux/filter.h stand for: jmp 6Jump to label ja 6Jump to label - jeq 7, 8 Jump on k == A - jneq 8Jump on k != A - jne 8Jump on k != A - jlt 8Jump on k < A - jle 8Jump on k <= A - jgt 7, 8 Jump on k > A - jge 7, 8 Jump on k >= A - jset 7, 8 Jump on k & A + jeq 7, 8 Jump on A == k + jneq 8Jump on A != k + jne 8Jump on A != k + jlt 8Jump on A < k + jle 8Jump on A <= k + jgt 7, 8 Jump on A > k + jge 7, 8 Jump on A >= k + jset 7, 8 Jump on A & k add 0, 4 A + sub 0, 4 A - -- 1.9.3
Re: [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
On Fri, 2016-05-13 at 14:56 +0900, Ethan Zhao wrote: > Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less > CPUs were assigned. especially when DCB is enabled, so we should take > num_online_cpus() as top limit, and aslo to make sure every TC has > at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues > number. > > Signed-off-by: Ethan Zhao> --- > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 > 1 file changed, 4 insertions(+) Dropping this patch based on Alex's and John's feedback. signature.asc Description: This is a digitally signed message part
Re: [PATCH v2 2/2] phy dp83867: Make rgmii parameters optional
On 05/16/2016 01:52 PM, Alexander Graf wrote: > If you compile without OF_MDIO support in an RGMII configuration, we fail > to configure the dp83867 phy today by writing garbage into its configuration > registers. > > On the other hand if you do compile with OF_MDIO and the phy gets loaded via > device tree, you have to have the properties set in the device tree, otherwise > we fail to load the driver and don't even attach the generic phy driver to > the interface anymore. > > To make things slightly more consistent, make the rgmii configuration > properties > optional and allow a user to omit them in their device tree. > > Signed-off-by: Alexander Graf> --- > drivers/net/phy/dp83867.c | 31 --- > 1 file changed, 28 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c > index 94cc278..1b01680 100644 > --- a/drivers/net/phy/dp83867.c > +++ b/drivers/net/phy/dp83867.c > @@ -65,6 +65,7 @@ struct dp83867_private { > int rx_id_delay; > int tx_id_delay; > int fifo_depth; > + int values_are_sane; > }; > > static int dp83867_ack_interrupt(struct phy_device *phydev) > @@ -113,15 +114,30 @@ static int dp83867_of_init(struct phy_device *phydev) > ret = of_property_read_u32(of_node, "ti,rx-internal-delay", > >rx_id_delay); > if (ret) > - return ret; > + goto invalid_dt; > > ret = of_property_read_u32(of_node, "ti,tx-internal-delay", > >tx_id_delay); > if (ret) > - return ret; > + goto invalid_dt; Optional means you may or may not have the entries I would prefer to wrap the DT reading with the interface type check. if (phydev->interface == PHY_INTERFACE_MODE_RGMII_TXID || phydev->interface == PHY_INTERFACE_MODE_RGMII_ID ) ret = of_property_read_u32(of_node, "ti,tx-internal-delay", >tx_id_delay); if (ret) goto invalid_dt; Otherwise this continues to mandate that you need to declare all the DT entries when in fact you may only have to declare 1. And if the other interfaces are declared then DT entries are ignored. And configuring internal delay is not required per section 8.9 footnote 3 of the data sheet. Dan > > - return of_property_read_u32(of_node, "ti,fifo-depth", > + ret = of_property_read_u32(of_node, "ti,fifo-depth", > >fifo_depth); > + if (ret) > + goto invalid_dt; > + > + dp83867->values_are_sane = 1; > + > + return 0; > + > +invalid_dt: > + phydev_err(phydev, "missing properties in device tree"); > + > + /* > + * We can still run with a broken dt by not using any of the optional > + * parameters, so just don't set dp83867->values_are_sane. > + */ > + return 0; > } > #else > static int dp83867_of_init(struct phy_device *phydev) > @@ -150,6 +166,15 @@ static int dp83867_config_init(struct phy_device *phydev) > dp83867 = (struct dp83867_private *)phydev->priv; > } > > + /* > + * With no or broken device tree, we don't have the values that we would > + * want to configure the phy with. In that case, cross our fingers and > + * assume that firmware did everything correctly for us or that we don't > + * need them. > + */ > + if (!dp83867->values_are_sane) > + return 0; > + > if (phy_interface_is_rgmii(phydev)) { > ret = phy_write(phydev, MII_DP83867_PHYCTRL, > (dp83867->fifo_depth << > DP83867_PHYCR_FIFO_DEPTH_SHIFT)); -- -- Dan Murphy
Re: [PATCH v5 net-next 09/14] ip6_tun: Add infrastructure for doing encapsulation
On Mon, May 16, 2016 at 12:28 PM, Tom Herbertwrote: > On Mon, May 16, 2016 at 12:24 PM, Alexander Duyck > wrote: >> On Sun, May 15, 2016 at 4:42 PM, Tom Herbert wrote: >>> Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions >>> for getting encap hlen, setting up encap on a tunnel, performing >>> encapsulation operation. >>> >>> Signed-off-by: Tom Herbert >>> --- >>> include/net/ip6_tunnel.h | 58 ++ >>> net/ipv4/ip_tunnel_core.c | 5 +++ >>> net/ipv6/ip6_tunnel.c | 89 >>> +-- >>> 3 files changed, 141 insertions(+), 11 deletions(-) >> >> So a bisect is pointing to this patch as causing a regression in IPv6 >> GRE throughput from 20 Gb/s to .04 Mb/s >> >> <...> >> >>> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c >>> index e79330f..9f0ea85 100644 >>> --- a/net/ipv6/ip6_tunnel.c >>> +++ b/net/ipv6/ip6_tunnel.c >>> @@ -1010,7 +1010,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct >>> net_device *dev, __u8 dsfield, >>> struct dst_entry *dst = NULL, *ndst = NULL; >>> struct net_device *tdev; >>> int mtu; >>> - unsigned int max_headroom = sizeof(struct ipv6hdr); >>> + unsigned int max_headroom = sizeof(struct ipv6hdr) + t->hlen; >>> int err = -1; >>> >>> /* NBMA tunnel */ >>> @@ -1063,7 +1063,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct >>> net_device *dev, __u8 dsfield, >>> t->parms.name); >>> goto tx_err_dst_release; >>> } >>> - mtu = dst_mtu(dst) - sizeof(*ipv6h); >>> + mtu = dst_mtu(dst) - sizeof(*ipv6h) - t->hlen; >>> if (encap_limit >= 0) { >>> max_headroom += 8; >>> mtu -= 8; >> >> So I am pretty sure this bit here is causing the regression. Your skb >> already has a GRE header added and it is included in skb->len. In the >> tests just below here you are comparing skb->len to mtu, but you now >> have the GRE header included twice so it is going to fail. Odds are >> this should be t->encap_hlen, and not t->hlen. >> > Good catch! Fixing now... Actually I think the one other case above for max_headroom probably should be encap_hlen as well. After all we don't need to allocate headroom for something we have already placed in the skb. I'm still digging into the patch set. If I find anything else I will let you know. I'm hoping to be able to test ip6ip6 hardware tunnel offloads by the end of today. - Alex
Re: [PATCH v2 2/2] phy dp83867: Make rgmii parameters optional
On Mon, May 16, 2016 at 08:52:43PM +0200, Alexander Graf wrote: > If you compile without OF_MDIO support in an RGMII configuration, we fail > to configure the dp83867 phy today by writing garbage into its configuration > registers. > > On the other hand if you do compile with OF_MDIO and the phy gets loaded via > device tree, you have to have the properties set in the device tree, otherwise > we fail to load the driver and don't even attach the generic phy driver to > the interface anymore. > > To make things slightly more consistent, make the rgmii configuration > properties > optional and allow a user to omit them in their device tree. The binding document actually says they are required. It would be good to make the binding documentation and the code consistent. Andrew
Re: OpenWRT wrong adjustment of fq_codel defaults (Was: [Codel] fq_codel_drop vs a udp flood)
On 16 May 2016 at 19:04, Dave Tahtwrote: > On Mon, May 16, 2016 at 1:14 AM, Roman Yeryomin wrote: >> On 16 May 2016 at 01:34, Roman Yeryomin wrote: >>> On 6 May 2016 at 22:43, Dave Taht wrote: On Fri, May 6, 2016 at 11:56 AM, Roman Yeryomin wrote: > On 6 May 2016 at 21:43, Roman Yeryomin wrote: >> On 6 May 2016 at 15:47, Jesper Dangaard Brouer wrote: >>> >>> I've created a OpenWRT ticket[1] on this issue, as it seems that >>> someone[2] >>> closed Felix'es OpenWRT email account (bad choice! emails bouncing). >>> Sounds like OpenWRT and the LEDE https://www.lede-project.org/ project >>> is in some kind of conflict. >>> >>> OpenWRT ticket [1] https://dev.openwrt.org/ticket/22349 >>> >>> [2] >>> http://thread.gmane.org/gmane.comp.embedded.openwrt.devel/40298/focus=40335 >> >> OK, so, after porting the patch to 4.1 openwrt kernel and playing a >> bit with fq_codel limits I was able to get 420Mbps UDP like this: >> tc qdisc replace dev wlan0 parent :1 fq_codel flows 16 limit 256 > > Forgot to mention, I've reduced drop_batch_size down to 32 0) Not clear to me if that's the right line, there are 4 wifi queues, and the third one is the BE queue. >>> >>> That was an example, sorry, should have stated that. I've applied same >>> settings to all 4 queues. >>> That is too low a limit, also, for normal use. And: for the purpose of this particular UDP test, flows 16 is ok, but not ideal. >>> >>> I played with different combinations, it doesn't make any >>> (significant) difference: 20-30Mbps, not more. >>> What numbers would you propose? >>> 1) What's the tcp number (with a simultaneous ping) with this latest patchset? (I care about tcp performance a lot more than udp floods - surviving a udp flood yes, performance, no) >>> >>> During the test (both TCP and UDP) it's roughly 5ms in average, not >>> running tests ~2ms. Actually I'm now wondering if target is working at >>> all, because I had same result with target 80ms.. >>> So, yes, latency is good, but performance is poor. >>> before/after? tc -s qdisc show dev wlan0 during/after results? >>> >>> during the test: >>> >>> qdisc mq 0: root >>> Sent 1600496000 bytes 1057194 pkt (dropped 1421568, overlimits 0 requeues >>> 17) >>> backlog 1545794b 1021p requeues 17 >>> qdisc fq_codel 8001: parent :1 limit 1024p flows 16 quantum 1514 >>> target 80.0ms ce_threshold 32us interval 100.0ms ecn >>> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) >>> backlog 0b 0p requeues 0 >>> maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0 >>> new_flows_len 0 old_flows_len 0 >>> qdisc fq_codel 8002: parent :2 limit 1024p flows 16 quantum 1514 >>> target 80.0ms ce_threshold 32us interval 100.0ms ecn >>> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) >>> backlog 0b 0p requeues 0 >>> maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0 >>> new_flows_len 0 old_flows_len 0 >>> qdisc fq_codel 8003: parent :3 limit 1024p flows 16 quantum 1514 >>> target 80.0ms ce_threshold 32us interval 100.0ms ecn >>> Sent 1601271168 bytes 1057706 pkt (dropped 1422304, overlimits 0 requeues >>> 17) >>> backlog 1541252b 1018p requeues 17 >>> maxpacket 1514 drop_overlimit 1422304 new_flow_count 35 ecn_mark 0 >>> new_flows_len 0 old_flows_len 1 >>> qdisc fq_codel 8004: parent :4 limit 1024p flows 16 quantum 1514 >>> target 80.0ms ce_threshold 32us interval 100.0ms ecn >>> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) >>> backlog 0b 0p requeues 0 >>> maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0 >>> new_flows_len 0 old_flows_len 0 >>> >>> >>> after the test (60sec): >>> >>> qdisc mq 0: root >>> Sent 3084996052 bytes 2037744 pkt (dropped 2770176, overlimits 0 requeues >>> 28) >>> backlog 0b 0p requeues 28 >>> qdisc fq_codel 8001: parent :1 limit 1024p flows 16 quantum 1514 >>> target 80.0ms ce_threshold 32us interval 100.0ms ecn >>> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) >>> backlog 0b 0p requeues 0 >>> maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0 >>> new_flows_len 0 old_flows_len 0 >>> qdisc fq_codel 8002: parent :2 limit 1024p flows 16 quantum 1514 >>> target 80.0ms ce_threshold 32us interval 100.0ms ecn >>> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) >>> backlog 0b 0p requeues 0 >>> maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0 >>> new_flows_len 0 old_flows_len 0 >>> qdisc fq_codel 8003: parent :3 limit 1024p flows 16 quantum 1514 >>> target 80.0ms ce_threshold 32us interval 100.0ms ecn >>> Sent 3084996052 bytes 2037744 pkt (dropped 2770176, overlimits 0 requeues >>> 28) >>> backlog 0b 0p requeues 28 >>> maxpacket 1514 drop_overlimit 2770176
[PATCH] i40e: Fix errors resulted while turning off TSO
On systems with 128 CPUs, turning off TSO results in errors, i40e :03:00.0: failed to get tracking for 1 vectors for VSI 400, err=-12 i40e :03:00.0: Couldn't create FDir VSI i40e :03:00.0: i40e_ptp_init: PTP not supported on eth0 i40e :03:00.0: couldn't add VEB, err I40E_ERR_ADMIN_QUEUE_ERROR aq_err I40E_AQ_RC_ENOENT i40e :03:00.0: rebuild of switch failed: -1, will try to set up simple PF connection i40e :03:00.0 eth0: adding 00:10:e0:8a:24:b6 vid=0 Enabling FD_SB without checking availability of MSI-X vector is the root cause. This change adds necessary check. Signed-off-by: Tushar Dave--- drivers/net/ethernet/intel/i40e/i40e.h |1 + drivers/net/ethernet/intel/i40e/i40e_main.c |8 ++-- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h index 68f2204..80dcb5c 100644 --- a/drivers/net/ethernet/intel/i40e/i40e.h +++ b/drivers/net/ethernet/intel/i40e/i40e.h @@ -270,6 +270,7 @@ struct i40e_pf { #endif /* I40E_FCOE */ u16 num_lan_qps; /* num lan queues this PF has set up */ u16 num_lan_msix; /* num queue vectors for the base PF vsi */ + u16 num_fdsb_msix; /* num queue vectors for sideband Fdir */ int queues_left; /* queues left unclaimed */ u16 alloc_rss_size;/* allocated RSS queues */ u16 rss_size_max; /* HW defined max RSS queues */ diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 8f3b53e..9248863 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -7170,7 +7170,7 @@ static int i40e_set_num_rings_in_vsi(struct i40e_vsi *vsi) vsi->alloc_queue_pairs = 1; vsi->num_desc = ALIGN(I40E_FDIR_RING_COUNT, I40E_REQ_DESCRIPTOR_MULTIPLE); - vsi->num_q_vectors = 1; + vsi->num_q_vectors = pf->num_fdsb_msix; break; case I40E_VSI_VMDQ2: @@ -7558,9 +7558,11 @@ static int i40e_init_msix(struct i40e_pf *pf) /* reserve one vector for sideband flow director */ if (pf->flags & I40E_FLAG_FD_SB_ENABLED) { if (vectors_left) { + pf->num_fdsb_msix = 1; v_budget++; vectors_left--; } else { + pf->num_fdsb_msix = 0; pf->flags &= ~I40E_FLAG_FD_SB_ENABLED; } } @@ -8443,7 +8445,9 @@ bool i40e_set_ntuple(struct i40e_pf *pf, netdev_features_t features) /* Enable filters and mark for reset */ if (!(pf->flags & I40E_FLAG_FD_SB_ENABLED)) need_reset = true; - pf->flags |= I40E_FLAG_FD_SB_ENABLED; + /* enable FD_SB only if there is MSI-X vector */ + if (pf->num_fdsb_msix > 0) + pf->flags |= I40E_FLAG_FD_SB_ENABLED; } else { /* turn off filters, mark for reset and clear SW filter list */ if (pf->flags & I40E_FLAG_FD_SB_ENABLED) { -- 1.7.1
i40e: Errors while turning off TSO
On systems with 128 CPUs, turning off TSO results in errors. Errors: i40e :03:00.0: failed to get tracking for 1 vectors for VSI 400, err=-12 i40e :03:00.0: Couldn't create FDir VSI i40e :03:00.0: i40e_ptp_init: PTP not supported on eth0 i40e :03:00.0: couldn't add VEB, err I40E_ERR_ADMIN_QUEUE_ERROR aq_err I40E_AQ_RC_ENOENT i40e :03:00.0: rebuild of switch failed: -1, will try to set up simple PF connection i40e :03:00.0 eth0: adding 00:10:e0:8a:24:b6 vid=0 From kernel log: i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 1.4.8-k i40e: Copyright (c) 2013 - 2014 Intel Corporation. i40e :03:00.0: fw 4.40.35115 api 1.4 nvm 4.53 0x80001e8c 0.0.0 i40e :03:00.0: MAC address: 00:10:e0:8a:24:b6 i40e :03:00.0: i40e_ptp_init: PTP not supported on eth0 i40e :03:00.0: PCI-Express: Speed 8.0GT/s Width x8 i40e :03:00.0: Features: PF-id[0] VFs: 32 VSIs: 34 QP: 128 RX: 1BUF RSS FD_ATR VxLAN VEPA As per the log above, feature FD_SB (sideband flow director)) is not enabled. Because there are no enough MSI-X vectors available. (Device function caps report 129 MSI-X vectors in this case. And driver reserved 1 of them for misc interrupt and rest 128 for 128 QP. So no vector left for FD_SB. Therefore driver disables FD_SB) However turning off TSO invokes i40e_set_ntuple() that enables FD_SB and returns true, issues reset in i40e_set_features() i.e i40e_do_reset. Later during reset, driver fails to find irq for FD_SB from irq pile (and it won't because there was no irq vector assigned for FD_SB). This results in the very first error, 'i40e :03:00.0: failed to get tracking for 1 vectors for VSI 400, err=-12' I believe before enabling FD_SB in i40e_set_ntuple(), driver should check if MSI-X vector available for FD_SB. Sending patch in separate email. (FWIW, if number of CPUs reduced to 64, I don't see the issue described above because in that case out of 129 MSI-X vectors only 64 get assigned to QP. Remaining are used for features like FD_SB) -Tushar
Re: [PATCH v5 net-next 09/14] ip6_tun: Add infrastructure for doing encapsulation
On Mon, May 16, 2016 at 12:24 PM, Alexander Duyckwrote: > On Sun, May 15, 2016 at 4:42 PM, Tom Herbert wrote: >> Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions >> for getting encap hlen, setting up encap on a tunnel, performing >> encapsulation operation. >> >> Signed-off-by: Tom Herbert >> --- >> include/net/ip6_tunnel.h | 58 ++ >> net/ipv4/ip_tunnel_core.c | 5 +++ >> net/ipv6/ip6_tunnel.c | 89 >> +-- >> 3 files changed, 141 insertions(+), 11 deletions(-) > > So a bisect is pointing to this patch as causing a regression in IPv6 > GRE throughput from 20 Gb/s to .04 Mb/s > > <...> > >> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c >> index e79330f..9f0ea85 100644 >> --- a/net/ipv6/ip6_tunnel.c >> +++ b/net/ipv6/ip6_tunnel.c >> @@ -1010,7 +1010,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct >> net_device *dev, __u8 dsfield, >> struct dst_entry *dst = NULL, *ndst = NULL; >> struct net_device *tdev; >> int mtu; >> - unsigned int max_headroom = sizeof(struct ipv6hdr); >> + unsigned int max_headroom = sizeof(struct ipv6hdr) + t->hlen; >> int err = -1; >> >> /* NBMA tunnel */ >> @@ -1063,7 +1063,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct >> net_device *dev, __u8 dsfield, >> t->parms.name); >> goto tx_err_dst_release; >> } >> - mtu = dst_mtu(dst) - sizeof(*ipv6h); >> + mtu = dst_mtu(dst) - sizeof(*ipv6h) - t->hlen; >> if (encap_limit >= 0) { >> max_headroom += 8; >> mtu -= 8; > > So I am pretty sure this bit here is causing the regression. Your skb > already has a GRE header added and it is included in skb->len. In the > tests just below here you are comparing skb->len to mtu, but you now > have the GRE header included twice so it is going to fail. Odds are > this should be t->encap_hlen, and not t->hlen. > Good catch! Fixing now... > - Alex
Re: [PATCH v5 net-next 09/14] ip6_tun: Add infrastructure for doing encapsulation
On Sun, May 15, 2016 at 4:42 PM, Tom Herbertwrote: > Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions > for getting encap hlen, setting up encap on a tunnel, performing > encapsulation operation. > > Signed-off-by: Tom Herbert > --- > include/net/ip6_tunnel.h | 58 ++ > net/ipv4/ip_tunnel_core.c | 5 +++ > net/ipv6/ip6_tunnel.c | 89 > +-- > 3 files changed, 141 insertions(+), 11 deletions(-) So a bisect is pointing to this patch as causing a regression in IPv6 GRE throughput from 20 Gb/s to .04 Mb/s <...> > diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c > index e79330f..9f0ea85 100644 > --- a/net/ipv6/ip6_tunnel.c > +++ b/net/ipv6/ip6_tunnel.c > @@ -1010,7 +1010,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device > *dev, __u8 dsfield, > struct dst_entry *dst = NULL, *ndst = NULL; > struct net_device *tdev; > int mtu; > - unsigned int max_headroom = sizeof(struct ipv6hdr); > + unsigned int max_headroom = sizeof(struct ipv6hdr) + t->hlen; > int err = -1; > > /* NBMA tunnel */ > @@ -1063,7 +1063,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device > *dev, __u8 dsfield, > t->parms.name); > goto tx_err_dst_release; > } > - mtu = dst_mtu(dst) - sizeof(*ipv6h); > + mtu = dst_mtu(dst) - sizeof(*ipv6h) - t->hlen; > if (encap_limit >= 0) { > max_headroom += 8; > mtu -= 8; So I am pretty sure this bit here is causing the regression. Your skb already has a GRE header added and it is included in skb->len. In the tests just below here you are comparing skb->len to mtu, but you now have the GRE header included twice so it is going to fail. Odds are this should be t->encap_hlen, and not t->hlen. - Alex
Re: [PATCH v2 2/2] phy dp83867: Make rgmii parameters optional
On 05/16/2016 11:52 AM, Alexander Graf wrote: > If you compile without OF_MDIO support in an RGMII configuration, we fail > to configure the dp83867 phy today by writing garbage into its configuration > registers. > > On the other hand if you do compile with OF_MDIO and the phy gets loaded via > device tree, you have to have the properties set in the device tree, otherwise > we fail to load the driver and don't even attach the generic phy driver to > the interface anymore. > > To make things slightly more consistent, make the rgmii configuration > properties > optional and allow a user to omit them in their device tree. > > Signed-off-by: Alexander Graf> --- > drivers/net/phy/dp83867.c | 31 --- > 1 file changed, 28 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c > index 94cc278..1b01680 100644 > --- a/drivers/net/phy/dp83867.c > +++ b/drivers/net/phy/dp83867.c > @@ -65,6 +65,7 @@ struct dp83867_private { > int rx_id_delay; > int tx_id_delay; > int fifo_depth; > + int values_are_sane; This could be a boolean type. > }; > > static int dp83867_ack_interrupt(struct phy_device *phydev) > @@ -113,15 +114,30 @@ static int dp83867_of_init(struct phy_device *phydev) > ret = of_property_read_u32(of_node, "ti,rx-internal-delay", > >rx_id_delay); > if (ret) > - return ret; > + goto invalid_dt; > > ret = of_property_read_u32(of_node, "ti,tx-internal-delay", > >tx_id_delay); > if (ret) > - return ret; > + goto invalid_dt; > > - return of_property_read_u32(of_node, "ti,fifo-depth", > + ret = of_property_read_u32(of_node, "ti,fifo-depth", > >fifo_depth); > + if (ret) > + goto invalid_dt; > + > + dp83867->values_are_sane = 1; > + > + return 0; > + > +invalid_dt: > + phydev_err(phydev, "missing properties in device tree"); phydev_warn() maybe? Other than that, this looks okay to me. -- Florian
[PATCH v2 2/2] phy dp83867: Make rgmii parameters optional
If you compile without OF_MDIO support in an RGMII configuration, we fail to configure the dp83867 phy today by writing garbage into its configuration registers. On the other hand if you do compile with OF_MDIO and the phy gets loaded via device tree, you have to have the properties set in the device tree, otherwise we fail to load the driver and don't even attach the generic phy driver to the interface anymore. To make things slightly more consistent, make the rgmii configuration properties optional and allow a user to omit them in their device tree. Signed-off-by: Alexander Graf--- drivers/net/phy/dp83867.c | 31 --- 1 file changed, 28 insertions(+), 3 deletions(-) diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c index 94cc278..1b01680 100644 --- a/drivers/net/phy/dp83867.c +++ b/drivers/net/phy/dp83867.c @@ -65,6 +65,7 @@ struct dp83867_private { int rx_id_delay; int tx_id_delay; int fifo_depth; + int values_are_sane; }; static int dp83867_ack_interrupt(struct phy_device *phydev) @@ -113,15 +114,30 @@ static int dp83867_of_init(struct phy_device *phydev) ret = of_property_read_u32(of_node, "ti,rx-internal-delay", >rx_id_delay); if (ret) - return ret; + goto invalid_dt; ret = of_property_read_u32(of_node, "ti,tx-internal-delay", >tx_id_delay); if (ret) - return ret; + goto invalid_dt; - return of_property_read_u32(of_node, "ti,fifo-depth", + ret = of_property_read_u32(of_node, "ti,fifo-depth", >fifo_depth); + if (ret) + goto invalid_dt; + + dp83867->values_are_sane = 1; + + return 0; + +invalid_dt: + phydev_err(phydev, "missing properties in device tree"); + + /* +* We can still run with a broken dt by not using any of the optional +* parameters, so just don't set dp83867->values_are_sane. +*/ + return 0; } #else static int dp83867_of_init(struct phy_device *phydev) @@ -150,6 +166,15 @@ static int dp83867_config_init(struct phy_device *phydev) dp83867 = (struct dp83867_private *)phydev->priv; } + /* +* With no or broken device tree, we don't have the values that we would +* want to configure the phy with. In that case, cross our fingers and +* assume that firmware did everything correctly for us or that we don't +* need them. +*/ + if (!dp83867->values_are_sane) + return 0; + if (phy_interface_is_rgmii(phydev)) { ret = phy_write(phydev, MII_DP83867_PHYCTRL, (dp83867->fifo_depth << DP83867_PHYCR_FIFO_DEPTH_SHIFT)); -- 1.8.5.6
[PATCH v2 1/2] phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
When CONFIG_OF_MDIO is configured as module, the #define for it really is CONFIG_OF_MDIO_MODULE, not CONFIG_OF_MDIO. So if we are compiling it as module, the dp83867 doesn't see that OF_MDIO was selected and doesn't read the dt rgmii parameters. The fix is simple: Use IS_ENABLED(). It checks for both - module as well as compiled in code. Signed-off-by: Alexander Graf--- drivers/net/phy/dp83867.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c index 2afa61b..94cc278 100644 --- a/drivers/net/phy/dp83867.c +++ b/drivers/net/phy/dp83867.c @@ -99,7 +99,7 @@ static int dp83867_config_intr(struct phy_device *phydev) return phy_write(phydev, MII_DP83867_MICR, micr_status); } -#ifdef CONFIG_OF_MDIO +#if IS_ENABLED(CONFIG_OF_MDIO) static int dp83867_of_init(struct phy_device *phydev) { struct dp83867_private *dp83867 = phydev->priv; -- 1.8.5.6
Re: [ANNOUNCE] Netdev 1.2 conference
On Tue, 17 May 2016 00:36:58 +0900 Hajime Tazakiwrote: > Following the last successful Netdev 0.1 in Ottawa, Canada and > 1.1 in Seville, Spain. We are happy to announce the third Netdev conference: > Netdev 1.2 (year 1, conference 2) from 5th to 7th October 2016 in Tokyo, > Japan (http://netdevconf.org/1.2/). I understand that getting a free date for a conference is hard to find, but those dates overlap with LinuxCon Europe in Berlin. There may not be a lot of overlap in possible attendees but it seems like there might be a better date?
linux-4.6/net/kcm/kcmsock.c:1508: bad if test ?
Hello there, linux-4.6/net/kcm/kcmsock.c:1508]: (style) Checking if unsigned variable 'copied' is less than zero. Source code is if (copied < 0) { but size_t copied; Suggest code rework. Regards David Binderman
Re: [PATCH v5 net-next 03/14] ipv6: Fix nexthdr for reinjection
On Mon, May 16, 2016 at 11:19 AM, Shmulik Ladkaniwrote: > Hi, > > On Sun, 15 May 2016 16:42:24 -0700 Tom Herbert wrote: >> In ip6_input_finish the nexthdr protocol is retrieved from the >> next header offset that is returned in the cb of the skb. >> This method does not work for UDP encapsulation that may not >> even have a concept of a nexthdr field (e.g. FOU). >> >> This patch checks for a final protocol (INET6_PROTO_FINAL) when a >> protocol handler returns > 1. If the protocol is not final then > > If you respin due to other reasons: s/> 1/> 0/ > Will do. Thanks! Tom >> resubmission is performed on nhoff value. If the protocol is final >> then the nexthdr is taken to be the return value. >> >> Signed-off-by: Tom Herbert > > Reviewed-by: Shmulik Ladkani
Re: [PATCH v5 net-next 02/14] net: define gso types for IPx over IPv4 and IPv6
On Mon, May 16, 2016 at 11:28 AM, Tom Herbertwrote: > On Mon, May 16, 2016 at 11:13 AM, Alexander Duyck > wrote: >> On Mon, May 16, 2016 at 11:07 AM, Tom Herbert wrote: >>> On Mon, May 16, 2016 at 9:32 AM, Alexander Duyck >>> wrote: On Sun, May 15, 2016 at 4:42 PM, Tom Herbert wrote: > This patch defines two new GSO definitions SKB_GSO_IPXIP4 and > SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and > NETIF_F_GSO_IPXIP6. These are used to described IP in IP > tunnel and what the outer protocol is. The inner protocol > can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and > SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT > are removed (these are both instances of SKB_GSO_IPXIP4). > SKB_GSO_IPXIP6 will be used when support for GSO with IP > encapsulation over IPv6 is added. > > Signed-off-by: Tom Herbert > --- > drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 5 ++--- > drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++-- > drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +-- > drivers/net/ethernet/intel/i40e/i40e_txrx.c | 3 +-- > drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 3 +-- > drivers/net/ethernet/intel/i40evf/i40evf_main.c | 3 +-- > drivers/net/ethernet/intel/igb/igb_main.c | 3 +-- > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 +-- > drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +-- > include/linux/netdev_features.h | 12 ++-- > include/linux/netdevice.h | 4 ++-- > include/linux/skbuff.h| 4 ++-- > net/core/ethtool.c| 4 ++-- > net/ipv4/af_inet.c| 2 +- > net/ipv4/ipip.c | 2 +- > net/ipv6/ip6_offload.c| 4 ++-- > net/ipv6/sit.c| 4 ++-- > net/netfilter/ipvs/ip_vs_xmit.c | 17 +++-- > 18 files changed, 36 insertions(+), 47 deletions(-) It looks like you missed drivers/net/ethernet/intel/igb/netdev.c. If you don't get it then it will break the build. >>> >>> I don't see file that in betdev branch, maybe it's new? >> >> Nope, it has been there for a while. It got patched to support IPIP >> and SIT tunnels in the same patch that updated igb/igb_main.c. >> > > Looks like it's > > drivers/net/ethernet/intel/igbvf/netdev.c > >> I am also looking into the other patches now. It looks like something >> broke hardware offloads again as I am only getting .9 Mb/s for IPv6 >> based GRE tunnels with your patches applied. I'm trying to bisect it >> now. >> > What hardware are you using? I'm using an Intel X710, it is an i40e based NIC. - Alex
Re: [PATCH iproute2] ip link: Add support for kernel side filtering
On 5/16/16 12:27 PM, David Ahern wrote: In general older kernels do not parse the attributes appended to the get request. sorry, wrong wording: the attributes are parsed but ignored. I just checked an older 3.4 kernel tree and that is true there as well as prior to the kernel commit for this feature.
Re: [PATCH v5 net-next 02/14] net: define gso types for IPx over IPv4 and IPv6
On Mon, May 16, 2016 at 11:13 AM, Alexander Duyckwrote: > On Mon, May 16, 2016 at 11:07 AM, Tom Herbert wrote: >> On Mon, May 16, 2016 at 9:32 AM, Alexander Duyck >> wrote: >>> On Sun, May 15, 2016 at 4:42 PM, Tom Herbert wrote: This patch defines two new GSO definitions SKB_GSO_IPXIP4 and SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and NETIF_F_GSO_IPXIP6. These are used to described IP in IP tunnel and what the outer protocol is. The inner protocol can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT are removed (these are both instances of SKB_GSO_IPXIP4). SKB_GSO_IPXIP6 will be used when support for GSO with IP encapsulation over IPv6 is added. Signed-off-by: Tom Herbert --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 5 ++--- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++-- drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +-- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 3 +-- drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 3 +-- drivers/net/ethernet/intel/i40evf/i40evf_main.c | 3 +-- drivers/net/ethernet/intel/igb/igb_main.c | 3 +-- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 +-- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +-- include/linux/netdev_features.h | 12 ++-- include/linux/netdevice.h | 4 ++-- include/linux/skbuff.h| 4 ++-- net/core/ethtool.c| 4 ++-- net/ipv4/af_inet.c| 2 +- net/ipv4/ipip.c | 2 +- net/ipv6/ip6_offload.c| 4 ++-- net/ipv6/sit.c| 4 ++-- net/netfilter/ipvs/ip_vs_xmit.c | 17 +++-- 18 files changed, 36 insertions(+), 47 deletions(-) >>> >>> It looks like you missed drivers/net/ethernet/intel/igb/netdev.c. If >>> you don't get it then it will break the build. >>> >> >> I don't see file that in betdev branch, maybe it's new? > > Nope, it has been there for a while. It got patched to support IPIP > and SIT tunnels in the same patch that updated igb/igb_main.c. > Looks like it's drivers/net/ethernet/intel/igbvf/netdev.c > I am also looking into the other patches now. It looks like something > broke hardware offloads again as I am only getting .9 Mb/s for IPv6 > based GRE tunnels with your patches applied. I'm trying to bisect it > now. > What hardware are you using? Tom > - Alex
Re: [PATCH iproute2] ip link: Add support for kernel side filtering
On 5/16/16 12:19 PM, Stephen Hemminger wrote: On Wed, 11 May 2016 06:51:58 -0700 David Ahernwrote: Kernel gained support for filtering link dumps with commit dc599f76c22b ("net: Add support for filtering link dump by master device and kind"). Add support to ip link command. If a user passes master device or kind to ip link command they are added to the link dump request message. Signed-off-by: David Ahern --- include/libnetlink.h | 6 ++ ip/ipaddress.c | 33 - lib/libnetlink.c | 28 3 files changed, 66 insertions(+), 1 deletion(-) Was this tested on older kernels? Don't want to add something that breaks when run on old kernels that are in stable distros. Yes. Not really far back but older 4.x kernels. In general older kernels do not parse the attributes appended to the get request. This is very similar to the neigh filter added by b8c753245bad3f13a03b105b724ff406d278c753.
Re: [PATCH] phy dp83867: depend on CONFIG_OF_MDIO
Alex On 05/16/2016 12:57 PM, Alexander Graf wrote: > Hi Dan, > > On 16.05.16 15:38, Dan Murphy wrote: >> Alexander >> >> On 05/16/2016 06:28 AM, Alexander Graf wrote: >>> The DP83867 phy driver doesn't actually work when CONFIG_OF_MDIO isn't >>> enabled. >>> It simply passes the device tree test, but leaves all internal configuration >>> initialized at 0. Then it configures the phy with those values and renders a >>> previously working configuration useless. >>> >>> This patch makes sure that we only build the DP83867 phy code when >>> CONFIG_OF_MDIO is set, to not run into that problem. >>> >>> Signed-off-by: Alexander Graf>>> --- >>> drivers/net/phy/Kconfig | 1 + >>> drivers/net/phy/dp83867.c | 7 --- >>> 2 files changed, 1 insertion(+), 7 deletions(-) >>> >>> diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig >>> index 6dad9a9..4265ad5 100644 >>> --- a/drivers/net/phy/Kconfig >>> +++ b/drivers/net/phy/Kconfig >>> @@ -148,6 +148,7 @@ config DP83848_PHY >>> >>> config DP83867_PHY >>> tristate "Drivers for Texas Instruments DP83867 Gigabit PHY" >>> + depends on OF_MDIO >>> ---help--- >>> Currently supports the DP83867 PHY. >>> >>> diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c >>> index 2afa61b..ff867ba 100644 >>> --- a/drivers/net/phy/dp83867.c >>> +++ b/drivers/net/phy/dp83867.c >>> @@ -99,7 +99,6 @@ static int dp83867_config_intr(struct phy_device *phydev) >>> return phy_write(phydev, MII_DP83867_MICR, micr_status); >>> } >>> >>> -#ifdef CONFIG_OF_MDIO >>> static int dp83867_of_init(struct phy_device *phydev) >>> { >>> struct dp83867_private *dp83867 = phydev->priv; >>> @@ -123,12 +122,6 @@ static int dp83867_of_init(struct phy_device *phydev) >>> return of_property_read_u32(of_node, "ti,fifo-depth", >>>>fifo_depth); >>> } >>> -#else >>> -static int dp83867_of_init(struct phy_device *phydev) >>> -{ >>> - return 0; >>> -} >>> -#endif /* CONFIG_OF_MDIO */ >>> >>> static int dp83867_config_init(struct phy_device *phydev) >>> { >> I don't think we want this to depend solely on OF_MDIO. >> >> The #else case should probably be coded to look at platform data, if >> it exists. I don't have any boards that still used platform data to test >> this >> out so I did not feel comfortable adding code I could not test. > Since there was no code to look at platform data, those boards would be > broken just as well today, no? So at the end of the day, this change > should be no regression for them. As Andrew pointed out if you are not using RGMII you don't need internal delay or fifo_depth so making the driver dependent on OF_MDIO does not make sense. The DP83867 RGMII tx and rx delays and fifo should really be changed to optional parameters and only programmed if set. Dan > > Alex -- -- Dan Murphy
Re: [PATCH iproute2 -next] ingress, clsact: don't add TCA_OPTIONS to nl msg
On Sun, 15 May 2016 18:36:03 +0200 Daniel Borkmannwrote: > In ingress and clsact qdisc TCA_OPTIONS are ignored, since it's > parameterless. In tc, we add an empty addattr_l(... TCA_OPTIONS, > NULL, 0) to the netlink message nevertheless. This has the > side effect that when someone tries a 'tc qdisc replace' and > already an existing such qdisc is present, tc fails with > EINVAL here. > > Reason is that in the kernel, this invokes qdisc_change() when > such requested qdisc is already present. When TCA_OPTIONS are > passed to modify parameters, it looks whether qdisc implements > .change() callback, and if not present (like in both cases here) > it returns with error. Rather than adding an empty stub to the > kernel that ignores TCA_OPTIONS again, just don't add TCA_OPTIONS > to the netlink message in the first place. > > Before: > > # tc qdisc replace dev foo clsact# first try > # tc qdisc replace dev foo clsact# second one > RTNETLINK answers: Invalid argument > > After: > > # tc qdisc replace dev foo clsact > # tc qdisc replace dev foo clsact > # tc qdisc replace dev foo clsact > > Signed-off-by: Daniel Borkmann > --- > tc/q_clsact.c | 1 - > tc/q_ingress.c | 1 - > 2 files changed, 2 deletions(-) Applied to net-next
Re: [iproute2 net-next repost 1/2] devlink: implement shared buffer support
On Sat, 14 May 2016 15:21:01 +0200 Jiri Pirkowrote: > From: Jiri Pirko > > Implement kernel devlink shared buffer interface. Introduce new object > "sb" and allow to browse the shared buffer parameters and also change > configuration. > > Signed-off-by: Jiri Pirko > --- > devlink/devlink.c | 653 > +- > 1 file changed, 652 insertions(+), 1 deletion(-) Both applied to net-next
Re: [PATCH iproute2] ip link: Add support for kernel side filtering
On Wed, 11 May 2016 06:51:58 -0700 David Ahernwrote: > Kernel gained support for filtering link dumps with commit dc599f76c22b > ("net: Add support for filtering link dump by master device and kind"). > Add support to ip link command. If a user passes master device or > kind to ip link command they are added to the link dump request message. > > Signed-off-by: David Ahern > --- > include/libnetlink.h | 6 ++ > ip/ipaddress.c | 33 - > lib/libnetlink.c | 28 > 3 files changed, 66 insertions(+), 1 deletion(-) > Was this tested on older kernels? Don't want to add something that breaks when run on old kernels that are in stable distros.
Re: [PATCH v5 net-next 03/14] ipv6: Fix nexthdr for reinjection
Hi, On Sun, 15 May 2016 16:42:24 -0700 Tom Herbertwrote: > In ip6_input_finish the nexthdr protocol is retrieved from the > next header offset that is returned in the cb of the skb. > This method does not work for UDP encapsulation that may not > even have a concept of a nexthdr field (e.g. FOU). > > This patch checks for a final protocol (INET6_PROTO_FINAL) when a > protocol handler returns > 1. If the protocol is not final then If you respin due to other reasons: s/> 1/> 0/ > resubmission is performed on nhoff value. If the protocol is final > then the nexthdr is taken to be the return value. > > Signed-off-by: Tom Herbert Reviewed-by: Shmulik Ladkani
Re: [iproute2 PATCH 1/1] tc fix ife late binding
On Sun, 8 May 2016 11:28:49 -0400 Jamal Hadi Salimwrote: > From: Jamal Hadi Salim > > following late action binding didn't work: > > sudo tc actions add action ife encode \ > type 0xDEAD allow mark dst 02:15:15:15:15:15 index 1 > > sudo tc filter add dev lo parent : protocol ip prio 2 u32\ > match ip src 127.0.0.2/32 flowid 1:2 action ife index 1 > > Signed-off-by: Jamal Hadi Salim Ok, applied all the ife patches (for 4.6)
Re: iwlwifi: mvm: add reorder buffer per queue
I can't even describe how much I hate the concept of the reorder buffer in general. Ordering is the endpoints problem. Someday, after we get fq_codeled, short queues again, I'll be able to show why. On Mon, May 16, 2016 at 4:41 AM, Luca Coelhowrote: > On Fri, 2016-05-13 at 11:54 +0300, Dan Carpenter wrote: >> Hello Sara Sharon, >> >> The patch b915c10174fb: "iwlwifi: mvm: add reorder buffer per queue" >> from Mar 23, 2016, leads to the following static checker warnings: >> >> drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c:912 >> iwl_mvm_rx_mpdu_mq() >> error: potential NULL dereference 'sta'. >> >> drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c:912 >> iwl_mvm_rx_mpdu_mq() >> error: we previously assumed 'sta' could be null (see line 796) > > Thanks for the analysis and report, Dan! > > I have queued a fix for this through our internal tree. > > -- > Cheers, > Luca. > -- > To unsubscribe from this list: send the line "unsubscribe linux-wireless" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org
Re: [PATCH v5 net-next 02/14] net: define gso types for IPx over IPv4 and IPv6
On Mon, May 16, 2016 at 11:07 AM, Tom Herbertwrote: > On Mon, May 16, 2016 at 9:32 AM, Alexander Duyck > wrote: >> On Sun, May 15, 2016 at 4:42 PM, Tom Herbert wrote: >>> This patch defines two new GSO definitions SKB_GSO_IPXIP4 and >>> SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and >>> NETIF_F_GSO_IPXIP6. These are used to described IP in IP >>> tunnel and what the outer protocol is. The inner protocol >>> can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and >>> SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT >>> are removed (these are both instances of SKB_GSO_IPXIP4). >>> SKB_GSO_IPXIP6 will be used when support for GSO with IP >>> encapsulation over IPv6 is added. >>> >>> Signed-off-by: Tom Herbert >>> --- >>> drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 5 ++--- >>> drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++-- >>> drivers/net/ethernet/intel/i40e/i40e_main.c | 3 +-- >>> drivers/net/ethernet/intel/i40e/i40e_txrx.c | 3 +-- >>> drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 3 +-- >>> drivers/net/ethernet/intel/i40evf/i40evf_main.c | 3 +-- >>> drivers/net/ethernet/intel/igb/igb_main.c | 3 +-- >>> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 +-- >>> drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +-- >>> include/linux/netdev_features.h | 12 ++-- >>> include/linux/netdevice.h | 4 ++-- >>> include/linux/skbuff.h| 4 ++-- >>> net/core/ethtool.c| 4 ++-- >>> net/ipv4/af_inet.c| 2 +- >>> net/ipv4/ipip.c | 2 +- >>> net/ipv6/ip6_offload.c| 4 ++-- >>> net/ipv6/sit.c| 4 ++-- >>> net/netfilter/ipvs/ip_vs_xmit.c | 17 +++-- >>> 18 files changed, 36 insertions(+), 47 deletions(-) >> >> It looks like you missed drivers/net/ethernet/intel/igb/netdev.c. If >> you don't get it then it will break the build. >> > > I don't see file that in betdev branch, maybe it's new? Nope, it has been there for a while. It got patched to support IPIP and SIT tunnels in the same patch that updated igb/igb_main.c. I am also looking into the other patches now. It looks like something broke hardware offloads again as I am only getting .9 Mb/s for IPv6 based GRE tunnels with your patches applied. I'm trying to bisect it now. - Alex
Re: pull-request: wireless-drivers-next 2016-05-13
On Mon, 2016-05-16 at 17:08 +0300, Kalle Valo wrote: > Kalle Valowrites: > > > > > Kalle Valo writes: > > > > > > > > The following changes since commit > > > ede00a5ceb4d903a8c137a52bb77d574baaef8bd: > > > > > > Merge tag 'wireless-drivers-next-for-davem-2016-05-02' of > > > git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless- > > > drivers-next (2016-05-03 00:35:16 -0400) > > > > > > are available in the git repository at: > > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless- > > > drivers-next.git tags/wireless-drivers-next-for-davem-2016-05-13 > > Please don't pull this yet, there might be something wrong now with > > merges and need to check that first. > Ok, like discussed in thread "linux-next: manual merge of the > wireless-drivers-next tree with the net-next tree" there seems to be > a > problem on net-next in function iwl_mvm_set_tx_cmd(). Here is how I > propose to fix this. > > When pulling the tag above you should get a conflict like this: > > diff --cc drivers/net/wireless/intel/iwlwifi/mvm/tx.c > index 880210917a6f,779bafcbc9a1.. > --- a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c > +++ b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c > @@@ -294,7 -295,7 +294,11 @@@ void iwl_mvm_set_tx_cmd(struct iwl_mvm > tx_cmd->tx_flags = cpu_to_le32(tx_flags); > /* Total # bytes to be transmitted */ > tx_cmd->len = cpu_to_le16((u16)skb->len + > ++<<< HEAD > + (uintptr_t)info->driver_data[0]); > ++=== > + (uintptr_t)skb_info->driver_data[0]); > ++>>> master > tx_cmd->life_time = cpu_to_le32(TX_CMD_LIFE_TIME_INFINITE); > tx_cmd->sta_id = sta_id; > > Pick the latter with skb_info and then add skb_info to the beginning > of > the same function. So the function should be: > > void iwl_mvm_set_tx_cmd(struct iwl_mvm *mvm, struct sk_buff *skb, > struct iwl_tx_cmd *tx_cmd, > struct ieee80211_tx_info *info, u8 sta_id) > { > struct ieee80211_tx_info *skb_info = IEEE80211_SKB_CB(skb); > struct ieee80211_hdr *hdr = (void *)skb->data; > __le16 fc = hdr->frame_control; > u32 tx_flags = le32_to_cpu(tx_cmd->tx_flags); > u32 len = skb->len + FCS_LEN; > u8 ac; > > [...] > > tx_cmd->tx_flags = cpu_to_le32(tx_flags); > /* Total # bytes to be transmitted */ > tx_cmd->len = cpu_to_le16((u16)skb->len + > (uintptr_t)skb_info->driver_data[0]); > tx_cmd->life_time = cpu_to_le32(TX_CMD_LIFE_TIME_INFINITE); > tx_cmd->sta_id = sta_id; > > Sorry about the hassle and please let me know if you have any > problems. > Adding Luca and Emmanuel just in case I missed something. ACK. This looks correct. I just diffed the iwlwifi-next.git tree (at commit a525d0eab17d -- which is where I merge iwlwifi-fixes into iwlwifi-next) with net-next.git master and the difference [1] is exactly what you proposed to fix. [1] http://pastebin.coelho.fi/1b6907cdb9a25413.txt -- Cheers, Luca.