[PATCH] net: support compat 64-bit time in {s,g}etsockopt

2018-04-25 Thread Lance Richardson
For the x32 ABI, struct timeval has two 64-bit fields. However
the kernel currently interprets the user-space values used for
the SO_RCVTIMEO and SO_SNDTIMEO socket options as having a pair
of 32-bit fields.

When the seconds portion of the requested timeout is less than 2**32,
the seconds portion of the effective timeout is correct but the
microseconds portion is zero.  When the seconds portion of the
requested timeout is zero and the microseconds portion is non-zero,
the kernel interprets the timeout as zero (never timeout).

Fix by using 64-bit time for SO_RCVTIMEO/SO_SNDTIMEO as required
for the ABI.

The code included below demonstrates the problem.

Results before patch:
$ gcc -m64 -Wall -O2 -o socktmo socktmo.c && ./socktmo
recv time: 2.008181 seconds
send time: 2.015985 seconds

$ gcc -m32 -Wall -O2 -o socktmo socktmo.c && ./socktmo
recv time: 2.016763 seconds
send time: 2.016062 seconds

$ gcc -mx32 -Wall -O2 -o socktmo socktmo.c && ./socktmo
recv time: 1.007239 seconds
send time: 1.023890 seconds

Results after patch:
$ gcc -m64 -O2 -Wall -o socktmo socktmo.c && ./socktmo
recv time: 2.010062 seconds
send time: 2.015836 seconds

$ gcc -m32 -O2 -Wall -o socktmo socktmo.c && ./socktmo
recv time: 2.013974 seconds
send time: 2.015981 seconds

$ gcc -mx32 -O2 -Wall -o socktmo socktmo.c && ./socktmo
recv time: 2.030257 seconds
send time: 2.013383 seconds

 #include 
 #include 
 #include 
 #include 
 #include 

 void checkrc(char *str, int rc)
 {
 if (rc >= 0)
 return;

 perror(str);
 exit(1);
 }

 static char buf[1024];
 int main(int argc, char **argv)
 {
 int rc;
 int socks[2];
 struct timeval tv;
 struct timeval start, end, delta;

 rc = socketpair(AF_UNIX, SOCK_STREAM, 0, socks);
 checkrc("socketpair", rc);

 /* set timeout to 1.99 seconds */
 tv.tv_sec = 1;
 tv.tv_usec = 99;
 rc = setsockopt(socks[0], SOL_SOCKET, SO_RCVTIMEO, , sizeof tv);
 rc = setsockopt(socks[0], SOL_SOCKET, SO_SNDTIMEO, , sizeof tv);
 checkrc("setsockopt", rc);

 /* measure actual receive timeout */
 gettimeofday(, NULL);
 rc = recv(socks[0], buf, sizeof buf, 0);
 gettimeofday(, NULL);
 timersub(, , );

 printf("recv time: %ld.%06ld seconds\n",
(long)delta.tv_sec, (long)delta.tv_usec);

 /* fill send buffer */
 do {
 rc = send(socks[0], buf, sizeof buf, 0);
 } while (rc > 0);

 /* measure actual send timeout */
 gettimeofday(, NULL);
 rc = send(socks[0], buf, sizeof buf, 0);
 gettimeofday(, NULL);
 timersub(, , );

 printf("send time: %ld.%06ld seconds\n",
(long)delta.tv_sec, (long)delta.tv_usec);
 exit(0);
 }

Fixes: 515c7af85ed9 ("x32: Use compat shims for {g,s}etsockopt")
Reported-by: Gopal RajagopalSai <gopals...@gmail.com>
Signed-off-by: Lance Richardson <lance.richardson@gmail.com>
---
 net/compat.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/compat.c b/net/compat.c
index 5ae7437d3853..7242cce5631b 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -377,7 +377,8 @@ static int compat_sock_setsockopt(struct socket *sock, int 
level, int optname,
optname == SO_ATTACH_REUSEPORT_CBPF)
return do_set_attach_filter(sock, level, optname,
optval, optlen);
-   if (optname == SO_RCVTIMEO || optname == SO_SNDTIMEO)
+   if (!COMPAT_USE_64BIT_TIME &&
+   (optname == SO_RCVTIMEO || optname == SO_SNDTIMEO))
return do_set_sock_timeout(sock, level, optname, optval, 
optlen);
 
return sock_setsockopt(sock, level, optname, optval, optlen);
@@ -448,7 +449,8 @@ static int do_get_sock_timeout(struct socket *sock, int 
level, int optname,
 static int compat_sock_getsockopt(struct socket *sock, int level, int optname,
char __user *optval, int __user *optlen)
 {
-   if (optname == SO_RCVTIMEO || optname == SO_SNDTIMEO)
+   if (!COMPAT_USE_64BIT_TIME &&
+   (optname == SO_RCVTIMEO || optname == SO_SNDTIMEO))
return do_get_sock_timeout(sock, level, optname, optval, 
optlen);
return sock_getsockopt(sock, level, optname, optval, optlen);
 }
-- 
2.11.0



Re: [iproute PATCH 21/51] lib/libnetlink: Don't pass NULL parameter to memcpy()

2017-08-18 Thread Lance Richardson
> From: "Phil Sutter" <p...@nwl.cc>
> To: "Stephen Hemminger" <step...@networkplumber.org>
> Cc: netdev@vger.kernel.org
> Sent: Tuesday, August 15, 2017 12:42:55 PM
> Subject: Re: [iproute PATCH 21/51] lib/libnetlink: Don't pass NULL parameter 
> to memcpy()
> 
> On Tue, Aug 15, 2017 at 08:15:55AM -0700, Stephen Hemminger wrote:
> > On Sat, 12 Aug 2017 14:04:40 +0200
> > Phil Sutter <p...@nwl.cc> wrote:
> > 
> > > Both addattr_l() and rta_addattr_l() may be called with NULL data
> > > pointer and 0 alen parameters. Avoid calling memcpy() in that case.
> > > 
> > > Signed-off-by: Phil Sutter <p...@nwl.cc>
> > 
> > What are you fixing. memcpy(dest, NULL, 0) should be harmless NOP
> 
> Yes, if that turns into a NOP this patch is not needed.
> 
> Thanks, Phil
> 

It is a NOP in this case, but it is also "undefined behavior" and can lead
to the compiler assuming that dest != NULL, which would be problematic
if dest were dereferenced later in the code (it isn't in this case, but
might be in general).

A small example with current gcc:

foo.c:
#include 

extern void foo(char *, size_t);

int main(int argc, char **argv)
{
char x[128];

foo(x, sizeof x);
foo(NULL, 0);

return 0;
}

bar.c:
#include 
#include 

void foo(char *ptr, size_t len)
{
memset(ptr, 0, len);

if (ptr)
printf("ptr is non-null: %p\n", ptr);
}

Compile the code:

$ gcc -o foobar -O2 foo.c bar.c

Execute it (note second line of output, which might be surprising):

$ ./foobar
ptr is non-null: 0x7ffdc47daef0
ptr is non-null: (nil)


Regards,

Lance Richardson


Re: Faster TCP keepalive

2017-06-26 Thread Lance Richardson
> From: "Stephen Suryaputra Lin" 
> To: netdev@vger.kernel.org
> Sent: Friday, 23 June, 2017 3:58:17 PM
> Subject: Faster TCP keepalive
> 
> Greetings,
> 
> I'm writing this to probe if there has been thoughts or efforts in
> allowing sub-second TCP keep alive interval? One application is for TCP
> connections between IP hosts connected by an internal backplane where a
> faster detection is a necessity and the increased traffic can be
> accommodated.
> 
> Suggestions on other ways to quickly tearing down TCP connections to a
> rebooted host in the application above are welcomed.
> 

For quickly tearing down TCP connections to a host that has been rebooted,
avoiding e.g. waiting for TCP max retransmit count to be reached), you
could use SOCK_DESTROY (as e.g. iproute2's "ss -K [ FILTER ]" does).

Regards,

  Lance

> Thank you,
> 
> Stephen.
> 


Re: [PATCH net] vxlan: eliminate cached dst leak

2017-06-01 Thread Lance Richardson
> From: "David Miller" <da...@davemloft.net>
> To: lrich...@redhat.com
> Cc: netdev@vger.kernel.org, pab...@redhat.com
> Sent: Thursday, 1 June, 2017 2:46:48 PM
> Subject: Re: [PATCH net] vxlan: eliminate cached dst leak
> 
> From: Lance Richardson <lrich...@redhat.com>
> Date: Mon, 29 May 2017 13:25:57 -0400
> 
> > After commit 0c1d70af924b ("net: use dst_cache for vxlan device"),
> > cached dst entries could be leaked when more than one remote was
> > present for a given vxlan_fdb entry, causing subsequent netns
> > operations to block indefinitely and "unregister_netdevice: waiting
> > for lo to become free." messages to appear in the kernel log.
> > 
> > Fix by properly releasing cached dst and freeing resources in this
> > case.
> > 
> > Fixes: commit 0c1d70af924b ("net: use dst_cache for vxlan device")
> 
> In the future please do not put that "commit " string there, it's not
> needed.

Oops, sorry about that, I know I've made that mistake before (both times
because a checkpatch.pl warning in the body made me think I needed to
change the Fixes: tag as well).  Now I'm tempted to roll a checkpatch.pl
improvement...

> 
> > Signed-off-by: Lance Richardson <lrich...@redhat.com>
> 
> Applied and queued up for -stable, thank you.
> 


Re: [PATCH net] vxlan: eliminate cached dst leak

2017-05-31 Thread Lance Richardson
> From: "Lance Richardson" <lrich...@redhat.com>
> To: netdev@vger.kernel.org, pab...@redhat.com
> Sent: Monday, 29 May, 2017 1:25:57 PM
> Subject: [PATCH net] vxlan: eliminate cached dst leak
> 
> After commit 0c1d70af924b ("net: use dst_cache for vxlan device"),
> cached dst entries could be leaked when more than one remote was
> present for a given vxlan_fdb entry, causing subsequent netns
> operations to block indefinitely and "unregister_netdevice: waiting
> for lo to become free." messages to appear in the kernel log.
> 
> Fix by properly releasing cached dst and freeing resources in this
> case.
> 
> Fixes: commit 0c1d70af924b ("net: use dst_cache for vxlan device")
> Signed-off-by: Lance Richardson <lrich...@redhat.com>
> ---

This problem was originally debugged and the patch tested in an OpenStack
(devstack) test environment. Here's a small(-ish) reproducer script that
was cooked up after posting:

ip netns add ns0
ip netns add ns1
ip netns add ns2

ip link add p0 type veth peer name p1
ip link add p2 type veth peer name p3
ip link add p4 type veth peer name p5

ip link add name br0 type bridge
ip link set br0 up

ip link set p0 master br0 up
ip link set p1 netns ns0

ip link set p2 master br0 up
ip link set p3 netns ns1

ip link set p4 master br0 up
ip link set p5 netns ns2

ip netns exec ns0 ip addr add "1.1.1.1/24" dev p1
ip netns exec ns0 ip link set dev p1 up

ip netns exec ns1 ip addr add "1.1.1.2/24" dev p3
ip netns exec ns1 ip link set dev p3 up

ip netns exec ns2 ip addr add "1.1.1.3/24" dev p5
ip netns exec ns2 ip link set dev p5 up

ip netns exec ns0 ip link add vxlan0 type vxlan dstport 4789 id 10 dev p1
ip netns exec ns0 ip addr add "4.1.1.1/24" dev vxlan0
ip netns exec ns0 ip link set dev vxlan0 up mtu 1450

ip netns exec ns1 ip link add vxlan1 type vxlan dstport 4789 id 10 dev p3
ip netns exec ns1 ip addr add "4.1.1.2/24" dev vxlan1
ip netns exec ns1 ip link set dev vxlan1 up mtu 1450

ip netns exec ns2 ip link add vxlan2 type vxlan dstport 4789 id 10 dev p5
ip netns exec ns2 ip addr add "4.1.1.3/24" dev vxlan2
ip netns exec ns2 ip link set dev vxlan2 up mtu 1450

# Create a vxlan default fdb entry with two remotes in the list
ip netns exec ns0 bridge fdb append to 00:00:00:00:00:00 dst 1.1.1.2 dev vxlan0
ip netns exec ns0 bridge fdb append to 00:00:00:00:00:00 dst 1.1.1.3 dev vxlan0

# Forward some packets to populate dst cache for default fdb
ip netns exec ns0 ping -c 2 4.1.1.2
ip netns exec ns0 ping -c 2 4.1.1.3

# delete one of the entries in the fdb remotes list to trigger the bug
ip netns exec ns0 bridge fdb del to 00:00:00:00:00:00 dst 1.1.1.3 dev vxlan0

ip netns del ns2
ip netns del ns1
ip netns del ns0

# If bug is triggered, kernel messages similar to this should be logged:
#
#kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1
#
# Netns commands like "ip netns add ns3" will hang indefinitely.


[PATCH net] vxlan: eliminate cached dst leak

2017-05-29 Thread Lance Richardson
After commit 0c1d70af924b ("net: use dst_cache for vxlan device"),
cached dst entries could be leaked when more than one remote was
present for a given vxlan_fdb entry, causing subsequent netns
operations to block indefinitely and "unregister_netdevice: waiting
for lo to become free." messages to appear in the kernel log.

Fix by properly releasing cached dst and freeing resources in this
case.

Fixes: commit 0c1d70af924b ("net: use dst_cache for vxlan device")
Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 drivers/net/vxlan.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 328b471..5c1d69e 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -740,6 +740,22 @@ static void vxlan_fdb_destroy(struct vxlan_dev *vxlan, 
struct vxlan_fdb *f)
call_rcu(>rcu, vxlan_fdb_free);
 }
 
+static void vxlan_dst_free(struct rcu_head *head)
+{
+   struct vxlan_rdst *rd = container_of(head, struct vxlan_rdst, rcu);
+
+   dst_cache_destroy(>dst_cache);
+   kfree(rd);
+}
+
+static void vxlan_fdb_dst_destroy(struct vxlan_dev *vxlan, struct vxlan_fdb *f,
+ struct vxlan_rdst *rd)
+{
+   list_del_rcu(>list);
+   vxlan_fdb_notify(vxlan, f, rd, RTM_DELNEIGH);
+   call_rcu(>rcu, vxlan_dst_free);
+}
+
 static int vxlan_fdb_parse(struct nlattr *tb[], struct vxlan_dev *vxlan,
   union vxlan_addr *ip, __be16 *port, __be32 *src_vni,
   __be32 *vni, u32 *ifindex)
@@ -864,9 +880,7 @@ static int __vxlan_fdb_delete(struct vxlan_dev *vxlan,
 * otherwise destroy the fdb entry
 */
if (rd && !list_is_singular(>remotes)) {
-   list_del_rcu(>list);
-   vxlan_fdb_notify(vxlan, f, rd, RTM_DELNEIGH);
-   kfree_rcu(rd, rcu);
+   vxlan_fdb_dst_destroy(vxlan, f, rd);
goto out;
}
 
-- 
2.7.4



Re: [PATCH net-next] geneve: fix incorrect setting of UDP checksum flag

2017-04-28 Thread Lance Richardson
> From: "Girish Moodalbail" <girish.moodalb...@oracle.com>
> To: da...@davemloft.net
> Cc: netdev@vger.kernel.org, pshe...@ovn.org
> Sent: Thursday, 27 April, 2017 5:11:53 PM
> Subject: [PATCH net-next] geneve: fix incorrect setting of UDP checksum flag
> 
> Creating a geneve link with 'udpcsum' set results in a creation of link
> for which UDP checksum will NOT be computed on outbound packets, as can
> be seen below.
> 
> 11: gen0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
> link/ether c2:85:27:b6:b4:15 brd ff:ff:ff:ff:ff:ff promiscuity 0
> geneve id 200 remote 192.168.13.1 dstport 6081 noudpcsum
> 
> Similarly, creating a link with 'noudpcsum' set results in a creation
> of link for which UDP checksum will be computed on outbound packets.
> 
> Fixes: 9b4437a5b870 ("geneve: Unify LWT and netdev handling.")
> Signed-off-by: Girish Moodalbail <girish.moodalb...@oracle.com>
> ---
>  drivers/net/geneve.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> index 7074b40..dec5d56 100644
> --- a/drivers/net/geneve.c
> +++ b/drivers/net/geneve.c
> @@ -1244,7 +1244,7 @@ static int geneve_newlink(struct net *net, struct
> net_device *dev,
>   metadata = true;
>  
>   if (data[IFLA_GENEVE_UDP_CSUM] &&
> - !nla_get_u8(data[IFLA_GENEVE_UDP_CSUM]))
> + nla_get_u8(data[IFLA_GENEVE_UDP_CSUM]))
>   info.key.tun_flags |= TUNNEL_CSUM;
>  
>   if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX] &&
> --
> 1.8.3.1
> 
> 

Verified issue on 4.10.10 kernel. Note that this doesn't impact
lightweight geneve tunnels (e.g. as used by openvswitch).

Acked-by: Lance Richardson <lrich...@redhat.com>


Re: [PATCH net 3/4] qed: Fix possible system hang in the dcbnl-getdcbx() path.

2017-04-19 Thread Lance Richardson
> From: "Sudarsana Reddy Kalluru" 
> To: da...@davemloft.net
> Cc: netdev@vger.kernel.org, "Yuval Mintz" 
> Sent: Wednesday, 19 April, 2017 6:19:54 AM
> Subject: [PATCH net 3/4] qed: Fix possible system hang in the dcbnl-getdcbx() 
> path.
> 
> qed_dcbnl_get_dcbx() API uses kmalloc in GFT_KERNEL mode. The API gets
> invoked in the interrupt context by qed_dcbnl_getdcbx callback. Need
> to invoke this kmalloc in atomic mode.
> 
> Signed-off-by: Sudarsana Reddy Kalluru 
> Signed-off-by: Yuval Mintz 
> ---
>  drivers/net/ethernet/qlogic/qed/qed_dcbx.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
> b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
> index ff058a3..8f0783a 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
> @@ -1264,7 +1264,7 @@ static struct qed_dcbx_get *qed_dcbnl_get_dcbx(struct
> qed_hwfn *hwfn,
>  {
>   struct qed_dcbx_get *dcbx_info;
>  
> - dcbx_info = kzalloc(sizeof(*dcbx_info), GFP_KERNEL);
> + dcbx_info = kmalloc(sizeof(*dcbx_info), GFP_ATOMIC);

You are changing a kzalloc to kmalloc, was that intentional?

>   if (!dcbx_info)
>   return NULL;
>  
> --
> 1.8.3.1
> 
> 


Re: [RFC] [net]openvswitch: Clear the ct flow key for the recirculated packet

2017-03-16 Thread Lance Richardson
> From: "Numan Siddique" 
> To: netdev@vger.kernel.org, "ovs dev" 
> Cc: "Joe Stringer" , "Andy Zhou" , ja...@ovn.org
> Sent: Thursday, March 16, 2017 8:25:06 AM
> Subject: [RFC] [net]openvswitch: Clear the ct flow key for the recirculated 
> packet
> 
> It is possible that the ct flow key information would have
> gone stale for the packets received from the userspace due to
> clone or ct_clear actions.
> 
> In the case of OVN, it adds ping responder flows, which modifies
> the original icmp4 request packet to a reply packet. It uses the
> OVS actions - clone and ct_clear. When the reply packet hits the
> "ovs_ct_execute" function, and since the ct flow key info is not
> cleared, the connection tracker doesn't set the state to
> ESTABLISHED state.
> 
> Note: This patch is marked as RFC, as I am not sure if this is the correct
> place to address this issue or it should be addressed in ovs-vswitchd
> to set the OVS_KEY_ATTR_CT_STATE and other related attributes
> properly for ct_clear action.
> 
> Signed-off-by: Numan Siddique 
> ---

Hi Numan,

With this patch applied I'm consistently seeing failures for two of the
kernel datapath unit tests (via "make check-kernel"):

 16: conntrack - force commitFAILED 
(system-traffic.at:692)
 54: conntrack - SNAT with ct_mark change on reply   FAILED 
(system-traffic.at:2446)



[PATCH net-next] csum: eliminate sparse warning in remcsum_unadjust()

2017-01-18 Thread Lance Richardson
Cast second parameter of csum_sub() from __sum16 to __wsum.

Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 include/net/checksum.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/checksum.h b/include/net/checksum.h
index 35d0fab..aef2b2b 100644
--- a/include/net/checksum.h
+++ b/include/net/checksum.h
@@ -179,7 +179,7 @@ static inline __wsum remcsum_adjust(void *ptr, __wsum csum,
 
 static inline void remcsum_unadjust(__sum16 *psum, __wsum delta)
 {
-   *psum = csum_fold(csum_sub(delta, *psum));
+   *psum = csum_fold(csum_sub(delta, (__force __wsum)*psum));
 }
 
 #endif
-- 
2.5.5



[PATCH net-next] vxlan: preserve type of dst_port parm for encap_bypass_if_local()

2017-01-18 Thread Lance Richardson
Eliminate sparse warning by maintaining type of dst_port
as __be16.

Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 drivers/net/vxlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ca7196c..19b1653 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1951,7 +1951,7 @@ static void vxlan_encap_bypass(struct sk_buff *skb, 
struct vxlan_dev *src_vxlan,
 
 static int encap_bypass_if_local(struct sk_buff *skb, struct net_device *dev,
 struct vxlan_dev *vxlan, union vxlan_addr 
*daddr,
-__be32 dst_port, __be32 vni, struct dst_entry 
*dst,
+__be16 dst_port, __be32 vni, struct dst_entry 
*dst,
 u32 rt_flags)
 {
 #if IS_ENABLED(CONFIG_IPV6)
-- 
2.5.5



[PATCH net] vxlan: fix byte order of vxlan-gpe port number

2017-01-16 Thread Lance Richardson
vxlan->cfg.dst_port is in network byte order, so an htons()
is needed here. Also reduced comment length to stay closer
to 80 column width (still slightly over, however).

Fixes: e1e5314de08b ("vxlan: implement GPE")
Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 drivers/net/vxlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ca7196c..8a79cfc 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2890,7 +2890,7 @@ static int vxlan_dev_configure(struct net *src_net, 
struct net_device *dev,
memcpy(>cfg, conf, sizeof(*conf));
if (!vxlan->cfg.dst_port) {
if (conf->flags & VXLAN_F_GPE)
-   vxlan->cfg.dst_port = 4790; /* IANA assigned VXLAN-GPE 
port */
+   vxlan->cfg.dst_port = htons(4790); /* IANA VXLAN-GPE 
port */
else
vxlan->cfg.dst_port = default_port;
}
-- 
2.5.5



[PATCH net-next] bridge: sparse fixes in br_ip6_multicast_alloc_query()

2017-01-16 Thread Lance Richardson
Changed type of csum field in struct igmpv3_query from __be16 to
__sum16 to eliminate type warning, made same change in struct
igmpv3_report for consistency.

Fixed up an ntohs() where htons() should have been used instead.

Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 include/uapi/linux/igmp.h | 4 ++--
 net/bridge/br_multicast.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/igmp.h b/include/uapi/linux/igmp.h
index ccbb32a..a97f9a7 100644
--- a/include/uapi/linux/igmp.h
+++ b/include/uapi/linux/igmp.h
@@ -53,7 +53,7 @@ struct igmpv3_grec {
 struct igmpv3_report {
__u8 type;
__u8 resv1;
-   __be16 csum;
+   __sum16 csum;
__be16 resv2;
__be16 ngrec;
struct igmpv3_grec grec[0];
@@ -62,7 +62,7 @@ struct igmpv3_report {
 struct igmpv3_query {
__u8 type;
__u8 code;
-   __be16 csum;
+   __sum16 csum;
__be32 group;
 #if defined(__LITTLE_ENDIAN_BITFIELD)
__u8 qrv:3,
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index b30e77e..f6634612 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -540,7 +540,7 @@ static struct sk_buff *br_ip6_multicast_alloc_query(struct 
net_bridge *br,
break;
case 2:
mld2q = (struct mld2_query *)icmp6_hdr(skb);
-   mld2q->mld2q_mrc = ntohs((u16)jiffies_to_msecs(interval));
+   mld2q->mld2q_mrc = htons((u16)jiffies_to_msecs(interval));
mld2q->mld2q_type = ICMPV6_MGM_QUERY;
mld2q->mld2q_code = 0;
mld2q->mld2q_cksum = 0;
-- 
2.5.5



[PATCH net] openvswitch: maintain correct checksum state in conntrack actions

2017-01-12 Thread Lance Richardson
When executing conntrack actions on skbuffs with checksum mode
CHECKSUM_COMPLETE, the checksum must be updated to account for
header pushes and pulls. Otherwise we get "hw csum failure"
logs similar to this (ICMP packet received on geneve tunnel
via ixgbe NIC):

[  405.740065] genev_sys_6081: hw csum failure
[  405.740106] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G  I 
4.10.0-rc3+ #1
[  405.740108] Call Trace:
[  405.740110]  
[  405.740113]  dump_stack+0x63/0x87
[  405.740116]  netdev_rx_csum_fault+0x3a/0x40
[  405.740118]  __skb_checksum_complete+0xcf/0xe0
[  405.740120]  nf_ip_checksum+0xc8/0xf0
[  405.740124]  icmp_error+0x1de/0x351 [nf_conntrack_ipv4]
[  405.740132]  nf_conntrack_in+0xe1/0x550 [nf_conntrack]
[  405.740137]  ? find_bucket.isra.2+0x62/0x70 [openvswitch]
[  405.740143]  __ovs_ct_lookup+0x95/0x980 [openvswitch]
[  405.740145]  ? netif_rx_internal+0x44/0x110
[  405.740149]  ovs_ct_execute+0x147/0x4b0 [openvswitch]
[  405.740153]  do_execute_actions+0x22e/0xa70 [openvswitch]
[  405.740157]  ovs_execute_actions+0x40/0x120 [openvswitch]
[  405.740161]  ovs_dp_process_packet+0x84/0x120 [openvswitch]
[  405.740166]  ovs_vport_receive+0x73/0xd0 [openvswitch]
[  405.740168]  ? udp_rcv+0x1a/0x20
[  405.740170]  ? ip_local_deliver_finish+0x93/0x1e0
[  405.740172]  ? ip_local_deliver+0x6f/0xe0
[  405.740174]  ? ip_rcv_finish+0x3a0/0x3a0
[  405.740176]  ? ip_rcv_finish+0xdb/0x3a0
[  405.740177]  ? ip_rcv+0x2a7/0x400
[  405.740180]  ? __netif_receive_skb_core+0x970/0xa00
[  405.740185]  netdev_frame_hook+0xd3/0x160 [openvswitch]
[  405.740187]  __netif_receive_skb_core+0x1dc/0xa00
[  405.740194]  ? ixgbe_clean_rx_irq+0x46d/0xa20 [ixgbe]
[  405.740197]  __netif_receive_skb+0x18/0x60
[  405.740199]  netif_receive_skb_internal+0x40/0xb0
[  405.740201]  napi_gro_receive+0xcd/0x120
[  405.740204]  gro_cell_poll+0x57/0x80 [geneve]
[  405.740206]  net_rx_action+0x260/0x3c0
[  405.740209]  __do_softirq+0xc9/0x28c
[  405.740211]  irq_exit+0xd9/0xf0
[  405.740213]  do_IRQ+0x51/0xd0
[  405.740215]  common_interrupt+0x93/0x93

Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 net/openvswitch/conntrack.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 6b78bab..54253ea 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -514,7 +514,7 @@ static int ovs_ct_nat_execute(struct sk_buff *skb, struct 
nf_conn *ct,
int hooknum, nh_off, err = NF_ACCEPT;
 
nh_off = skb_network_offset(skb);
-   skb_pull(skb, nh_off);
+   skb_pull_rcsum(skb, nh_off);
 
/* See HOOK2MANIP(). */
if (maniptype == NF_NAT_MANIP_SRC)
@@ -579,6 +579,7 @@ static int ovs_ct_nat_execute(struct sk_buff *skb, struct 
nf_conn *ct,
err = nf_nat_packet(ct, ctinfo, hooknum, skb);
 push:
skb_push(skb, nh_off);
+   skb_postpush_rcsum(skb, skb->data, nh_off);
 
return err;
 }
@@ -886,7 +887,7 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb,
 
/* The conntrack module expects to be working at L3. */
nh_ofs = skb_network_offset(skb);
-   skb_pull(skb, nh_ofs);
+   skb_pull_rcsum(skb, nh_ofs);
 
if (key->ip.frag != OVS_FRAG_TYPE_NONE) {
err = handle_fragments(net, key, info->zone.id, skb);
@@ -900,6 +901,7 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb,
err = ovs_ct_lookup(net, key, info, skb);
 
skb_push(skb, nh_ofs);
+   skb_postpush_rcsum(skb, skb->data, nh_ofs);
if (err)
kfree_skb(skb);
return err;
-- 
1.8.3.1



Re: Large performance regression with 6in4 tunnel (sit)

2016-11-28 Thread Lance Richardson
> From: "Lance Richardson" <lrich...@redhat.com>
> To: "Stephen Rothwell" <s...@canb.auug.org.au>
> Cc: "Sven-Haegar Koch" <hae...@sdinet.de>, "Eli Cooper" <elicoo...@gmx.com>, 
> netdev@vger.kernel.org, "Eric Dumazet"
> <eric.duma...@gmail.com>
> Sent: Monday, November 28, 2016 12:54:07 PM
> Subject: Re: Large performance regression with 6in4 tunnel (sit)
> 
> > From: "Stephen Rothwell" <s...@canb.auug.org.au>
> > To: "Sven-Haegar Koch" <hae...@sdinet.de>
> > Cc: "Eli Cooper" <elicoo...@gmx.com>, netdev@vger.kernel.org, "Eric
> > Dumazet" <eric.duma...@gmail.com>
> > Sent: Saturday, November 26, 2016 10:23:40 PM
> > Subject: Re: Large performance regression with 6in4 tunnel (sit)
> > 
> > Hi Sven-Haegar,
> > 
> > On Fri, 25 Nov 2016 05:06:53 +0100 (CET) Sven-Haegar Koch
> > <hae...@sdinet.de>
> > wrote:
> > >
> > > Somehow this problem description really reminds me of a report on
> > > netdev a bit ago, which the following patch fixed:
> > > 
> > > commit 9ee6c5dc816aa8256257f2cd4008a9291ec7e985
> > > Author: Lance Richardson <lrich...@redhat.com>
> > > Date:   Wed Nov 2 16:36:17 2016 -0400
> > > 
> > > ipv4: allow local fragmentation in ip_finish_output_gso()
> > > 
> > > Some configurations (e.g. geneve interface with default
> > > MTU of 1500 over an ethernet interface with 1500 MTU) result
> > > in the transmission of packets that exceed the configured MTU.
> > > While this should be considered to be a "bad" configuration,
> > > it is still allowed and should not result in the sending
> > > of packets that exceed the configured MTU.
> > > 
> > > Could this be related?
> > > 
> > > I suppose it would be difficult to test this patch on this machine?
> > 
> > The kernel I am running on is based on 4.7.8, so the above patch
> > doesn't come close to applying. Most fo what it is reverting was
> > introduced in commit 359ebda25aa0 ("net/ipv4: Introduce IPSKB_FRAG_SEGS
> > bit to inet_skb_parm.flags") in v4.8-rc1.
> > 
> > --
> > Cheers,
> > Stephen Rothwell
> > 
> 
> This should be equivalent for 4.7.x:
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 4bd4921..8a253e2 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -224,8 +224,7 @@ static int ip_finish_output_gso(struct net *net, struct
> sock *sk,
> int ret = 0;
>  
> /* common case: locally created skb or seglen is <= mtu */
> -   if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
> - skb_gso_network_seglen(skb) <= mtu)
> +   if (skb_gso_network_seglen(skb) <= mtu)
> return ip_finish_output2(net, sk, skb);
>  
> /* Slowpath -  GSO segment length is exceeding the dst MTU.
> 

BTW, I do think this would be worth trying. For the geneve case, I
measured on the order of a 10X-100X performance hit without this
patch, traces were similar to what you describe (too-large gso packets
were dropped, corresponding TCP segments were retransmitted later via
a non-gso code path).

Regards,

   Lance


Re: Large performance regression with 6in4 tunnel (sit)

2016-11-28 Thread Lance Richardson
> From: "Stephen Rothwell" <s...@canb.auug.org.au>
> To: "Sven-Haegar Koch" <hae...@sdinet.de>
> Cc: "Eli Cooper" <elicoo...@gmx.com>, netdev@vger.kernel.org, "Eric Dumazet" 
> <eric.duma...@gmail.com>
> Sent: Saturday, November 26, 2016 10:23:40 PM
> Subject: Re: Large performance regression with 6in4 tunnel (sit)
> 
> Hi Sven-Haegar,
> 
> On Fri, 25 Nov 2016 05:06:53 +0100 (CET) Sven-Haegar Koch <hae...@sdinet.de>
> wrote:
> >
> > Somehow this problem description really reminds me of a report on
> > netdev a bit ago, which the following patch fixed:
> > 
> > commit 9ee6c5dc816aa8256257f2cd4008a9291ec7e985
> > Author: Lance Richardson <lrich...@redhat.com>
> > Date:   Wed Nov 2 16:36:17 2016 -0400
> > 
> > ipv4: allow local fragmentation in ip_finish_output_gso()
> > 
> > Some configurations (e.g. geneve interface with default
> > MTU of 1500 over an ethernet interface with 1500 MTU) result
> > in the transmission of packets that exceed the configured MTU.
> > While this should be considered to be a "bad" configuration,
> > it is still allowed and should not result in the sending
> > of packets that exceed the configured MTU.
> > 
> > Could this be related?
> > 
> > I suppose it would be difficult to test this patch on this machine?
> 
> The kernel I am running on is based on 4.7.8, so the above patch
> doesn't come close to applying. Most fo what it is reverting was
> introduced in commit 359ebda25aa0 ("net/ipv4: Introduce IPSKB_FRAG_SEGS
> bit to inet_skb_parm.flags") in v4.8-rc1.
> 
> --
> Cheers,
> Stephen Rothwell
> 

This should be equivalent for 4.7.x:

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4bd4921..8a253e2 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -224,8 +224,7 @@ static int ip_finish_output_gso(struct net *net, struct 
sock *sk,
int ret = 0;
 
/* common case: locally created skb or seglen is <= mtu */
-   if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
- skb_gso_network_seglen(skb) <= mtu)
+   if (skb_gso_network_seglen(skb) <= mtu)
return ip_finish_output2(net, sk, skb);
 
/* Slowpath -  GSO segment length is exceeding the dst MTU.


[PATCH net v2] ipv4: update comment to document GSO fragmentation cases.

2016-11-09 Thread Lance Richardson
This is a follow-up to commit 9ee6c5dc816a ("ipv4: allow local
fragmentation in ip_finish_output_gso()"), updating the comment
documenting cases in which fragmentation is needed for egress
GSO packets.

Suggested-by: Shmulik Ladkani <shmulik.ladk...@gmail.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladk...@gmail.com>
Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
v2: corrected commit ID (v1 used local commit ID by mistake)

 net/ipv4/ip_output.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4971401..c2dae40 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -244,12 +244,18 @@ static int ip_finish_output_gso(struct net *net, struct 
sock *sk,
if (skb_gso_validate_mtu(skb, mtu))
return ip_finish_output2(net, sk, skb);
 
-   /* Slowpath -  GSO segment length is exceeding the dst MTU.
+   /* Slowpath -  GSO segment length exceeds the egress MTU.
 *
-* This can happen in two cases:
-* 1) TCP GRO packet, DF bit not set
-* 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
-* from host network stack.
+* This can happen in several cases:
+*  - Forwarding of a TCP GRO skb, when DF flag is not set.
+*  - Forwarding of an skb that arrived on a virtualization interface
+*(virtio-net/vhost/tap) with TSO/GSO size set by other network
+*stack.
+*  - Local GSO skb transmitted on an NETIF_F_TSO tunnel stacked over an
+*interface with a smaller MTU.
+*  - Arriving GRO skb (or GSO skb in a virtualized environment) that is
+*bridged to a NETIF_F_TSO tunnel stacked over an interface with an
+*insufficent MTU.
 */
features = netif_skb_features(skb);
BUILD_BUG_ON(sizeof(*IPCB(skb)) > SKB_SGO_CB_OFFSET);
-- 
2.5.5



[PATCH net] ipv4: update comment to document GSO fragmentation cases.

2016-11-04 Thread Lance Richardson
This is a follow-up to commit eb96202f1e34 ("ipv4: allow local
fragmentation in ip_finish_output_gso()"), updating the comment
documenting cases in which fragmentation is needed for egress
GSO packets.

Suggested-by: Shmulik Ladkani <shmulik.ladk...@gmail.com>
Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 net/ipv4/ip_output.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4971401..c2dae40 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -244,12 +244,18 @@ static int ip_finish_output_gso(struct net *net, struct 
sock *sk,
if (skb_gso_validate_mtu(skb, mtu))
return ip_finish_output2(net, sk, skb);
 
-   /* Slowpath -  GSO segment length is exceeding the dst MTU.
+   /* Slowpath -  GSO segment length exceeds the egress MTU.
 *
-* This can happen in two cases:
-* 1) TCP GRO packet, DF bit not set
-* 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
-* from host network stack.
+* This can happen in several cases:
+*  - Forwarding of a TCP GRO skb, when DF flag is not set.
+*  - Forwarding of an skb that arrived on a virtualization interface
+*(virtio-net/vhost/tap) with TSO/GSO size set by other network
+*stack.
+*  - Local GSO skb transmitted on an NETIF_F_TSO tunnel stacked over an
+*interface with a smaller MTU.
+*  - Arriving GRO skb (or GSO skb in a virtualized environment) that is
+*bridged to a NETIF_F_TSO tunnel stacked over an interface with an
+*insufficent MTU.
 */
features = netif_skb_features(skb);
BUILD_BUG_ON(sizeof(*IPCB(skb)) > SKB_SGO_CB_OFFSET);
-- 
2.5.5



Re: [PATCH net v3] ipv4: allow local fragmentation in ip_finish_output_gso()

2016-11-04 Thread Lance Richardson
> From: "Shmulik Ladkani" <shmulik.ladk...@gmail.com>
> To: "Lance Richardson" <lrich...@redhat.com>
> Cc: netdev@vger.kernel.org, f...@strlen.de, jtl...@redhat.com, 
> han...@stressinduktion.org
> Sent: Friday, November 4, 2016 5:24:09 AM
> Subject: Re: [PATCH net v3] ipv4: allow local fragmentation in 
> ip_finish_output_gso()
> 
> Hi,
> 
> On Thu, 3 Nov 2016 09:06:27 -0400 (EDT) Lance Richardson
> <lrich...@redhat.com> wrote:
> > I'm not sure what could be added that would help, was there something
> > specific you had in mind?
> 
> How about something like this (preliminary, feel free to massage):
> 
> @@ -248,10 +248,16 @@ static int ip_finish_output_gso(struct net *net, struct
> sock *sk,
>  
>   /* Slowpath -  GSO segment length is exceeding the dst MTU.
>*
> -  * This can happen in two cases:
> -  * 1) TCP GRO packet, DF bit not set
> -  * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
> -  * from host network stack.
> +  * This can happen in several cases:
> +  *  - Forwarding of TCP GRO packet, DF bit not set
> +  *  - Forwarding of skb arrived in a virtualization environment (from
> +  *virtio-net/vhost/tap) with TSO/GSO size set by other's network
> +  *stack
> +  *  - Local GSO skb xmitted on an NETIF_F_TSO tunnel stacked over an
> +  *interface with a smaller mtu
> +  *  - Arriving GRO skb (or GSO skb in a virtualized env) that gets L2
> +  *bridged to a NETIF_F_TSO tunnel stacked over an interface with an
> +  *insufficent mtu
>*/
>   features = netif_skb_features(skb);
>   BUILD_BUG_ON(sizeof(*IPCB(skb)) > SKB_SGO_CB_OFFSET);
> 

Thanks, that looks good to me. I can send a follow-up patch with this change,
if you like (there seems to be agreement that the original patch is OK).

   Lance


Re: [PATCH net v3] ipv4: allow local fragmentation in ip_finish_output_gso()

2016-11-04 Thread Lance Richardson
> From: "Shmulik Ladkani" <shmulik.ladk...@gmail.com>
> To: "Hannes Frederic Sowa" <han...@stressinduktion.org>
> Cc: "Lance Richardson" <lrich...@redhat.com>, f...@strlen.de, 
> netdev@vger.kernel.org, jtl...@redhat.com
> Sent: Friday, November 4, 2016 5:40:14 AM
> Subject: Re: [PATCH net v3] ipv4: allow local fragmentation in 
> ip_finish_output_gso()
> 
> On Thu, 3 Nov 2016 22:34:34 +0100 Hannes Frederic Sowa
> <han...@stressinduktion.org> wrote:
> > Correct, but we should maybe redefine the code a bit. From my
> > understanding we can now create an ICMP storm in case every fragment gets.
> 
> Yes, you are right.
> 
> Each segment gets into ip_fragment, and due to outer DF being set,
> ICMP_FRAG_NEEDED is sent per segment.
> 
> BTW, suppose GRO is off, and sender actually did send a burst of
> (non-gso) packets with outer DF set, and each was tunnel encapsulated,
> resulting in oversized frames.
> 
> Would'nt the stack just send the ICMP_FRAG_NEEDED per encapsulated
> frame?
> 
> If so, then the GRO behaviour is aligned, and there's nothing to fix.
> 

Agree.

> Best,
> Shmulik
> 


Re: [PATCH net v3] ipv4: allow local fragmentation in ip_finish_output_gso()

2016-11-03 Thread Lance Richardson
> From: "Shmulik Ladkani" <shmulik.ladk...@gmail.com>
> To: "Lance Richardson" <lrich...@redhat.com>, f...@strlen.de, 
> han...@stressinduktion.org
> Cc: netdev@vger.kernel.org, jtl...@redhat.com
> Sent: Thursday, November 3, 2016 4:27:51 PM
> Subject: Re: [PATCH net v3] ipv4: allow local fragmentation in 
> ip_finish_output_gso()
> 
> Hi Hannes, Lance,
> 
> On Wed,  2 Nov 2016 16:36:17 -0400 Lance Richardson <lrich...@redhat.com>
> wrote:
> >  
> > -   if (skb_iif && !(df & htons(IP_DF))) {
> > -   /* Arrived from an ingress interface, got encapsulated, with
> > -* fragmentation of encapulating frames allowed.
> > -* If skb is gso, the resulting encapsulated network segments
> > -* may exceed dst mtu.
> > -* Allow IP Fragmentation of segments.
> > -*/
> > -   IPCB(skb)->flags |= IPSKB_FRAG_SEGS;
> > -   }
> 
> Thinking this over, I'm concerned of this change.
> 
> Few months back, we discussed this and got to the conclusion that in the
> "ingress,tunnel,egress" scenario, segments are allowed to be
> fragmented if the original inner ip packet does NOT have the DF.
> 
> See
>   https://patchwork.ozlabs.org/patch/657132/
>   https://patchwork.ozlabs.org/patch/661219/
> 
> I think you expressed that those tunneled skbs already having DF set
> should go through pmtu discovery.
> 
> Suggested patch unconditionally calls skb_gso_validate_mtu().
> 
> Thus we're changing behavior for "ingress,tunnel,egress" scenario of
> the tunneled packets having DF set in the inner iph.
> 
> WDYT?
> 

I'm still digesting the patchwork history, but it seems to me:

   1) If we call skb_gso_validate_mtu() and it returns true, 
ip_finish_output2() will
  be called, just as before, so nothing changes.

   2) If we were to avoid calling skb_gso_validate_mtu() when it would have 
returned
  false and call ip_finish_output2() without performing fragmentation, we
  would transmit (or attempt to transmit) a packet that exceeds the 
configured
  MTU.

   3) If we do call skb_gso_validate_mtu() and it returns false, 
ip_finish_output_gso()
  will call ip_fragment() to perform needed fragmentation. Whether 
fragmentation
  actually occurs at this point depends on the value of the DF flag in the
  IP header (and perhaps skb->ignore_df, frag_max_size, etc.).

Is the issue you're pointing out about cases in which the inner IP header has 
DF set
but the tunnel header does not?

Thanks,

   Lance


Re: [PATCH net v3] ipv4: allow local fragmentation in ip_finish_output_gso()

2016-11-03 Thread Lance Richardson
> From: "Shmulik Ladkani" <shmulik.ladk...@gmail.com>
> To: "Lance Richardson" <lrich...@redhat.com>
> Cc: netdev@vger.kernel.org, f...@strlen.de, jtl...@redhat.com, 
> han...@stressinduktion.org
> Sent: Thursday, November 3, 2016 3:42:39 AM
> Subject: Re: [PATCH net v3] ipv4: allow local fragmentation in 
> ip_finish_output_gso()
> 
> On Wed,  2 Nov 2016 16:36:17 -0400
> Lance Richardson <lrich...@redhat.com> wrote:
> 
> > -   /* common case: fragmentation of segments is not allowed,
> > -* or seglen is <= mtu
> > +   /* common case: seglen is <= mtu
> >  */
> > -   if (((IPCB(skb)->flags & IPSKB_FRAG_SEGS) == 0) ||
> > - skb_gso_validate_mtu(skb, mtu))
> > +   if (skb_gso_validate_mtu(skb, mtu))
> > return ip_finish_output2(net, sk, skb);
> 
> OK.
> Resembles https://patchwork.ozlabs.org/patch/644724/ ;)
> 
> > -   if (skb_iif && !(df & htons(IP_DF))) {
> > -   /* Arrived from an ingress interface, got encapsulated, with
> > -* fragmentation of encapulating frames allowed.
> > -* If skb is gso, the resulting encapsulated network segments
> > -* may exceed dst mtu.
> > -* Allow IP Fragmentation of segments.
> > -*/
> > -   IPCB(skb)->flags |= IPSKB_FRAG_SEGS;
> > -   }
> 
> The only potential issue I see removing this, is that the KNOWLEDGE of
> the forwarding/tunnel-bridging usecases (that require a
> skb_gso_validate_mtu check) is lost.
> 
> Meaning, if one decides (as claimed in the past) that locally generated
> traffic must NOT go through fragmentation validation, then no one
> remembers those other valid usecases (which are very specific and not
> trivial to imagine) that MUST go through it; Therefore it can easily get
> broken.
> 

Hi Shmulik,

It is my understanding that the test to avoid fragmentation validation
for locally originated packets was a performance optimization which was
based on the premise that the IP stack will never produce locally originated
packets with an MTU mismatch. Tunneling encapsulation breaks this premise.
For correctness (honoring configured MTU), it seems we cannot avoid calling
skb_gso_validate_mtu().

> Maybe we can elaborate in the comment above the call to
> skb_gso_validate_mtu in ip_finish_output_gso?

I'm not sure what could be added that would help, was there something
specific you had in mind?

Thanks for reviewing this.

Regards,

   Lance
> 
> Best,
> Shmulik
> 


[PATCH net v3] ipv4: allow local fragmentation in ip_finish_output_gso()

2016-11-02 Thread Lance Richardson
Some configurations (e.g. geneve interface with default
MTU of 1500 over an ethernet interface with 1500 MTU) result
in the transmission of packets that exceed the configured MTU.
While this should be considered to be a "bad" configuration,
it is still allowed and should not result in the sending
of packets that exceed the configured MTU.

Fix by dropping the assumption in ip_finish_output_gso() that
locally originated gso packets will never need fragmentation.
Basic testing using iperf (observing CPU usage and bandwidth)
have shown no measurable performance impact for traffic not
requiring fragmentation.

Fixes: c7ba65d7b649 ("net: ip: push gso skb forwarding handling down the stack")
Reported-by: Jan Tluka <jtl...@redhat.com>
Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 v2: IPSKB_FRAG_SEGS is no longer useful, remove it.
 v3: Eliminate unused variable warning.
 include/net/ip.h  |  3 +--
 net/ipv4/ip_forward.c |  2 +-
 net/ipv4/ip_output.c  |  6 ++
 net/ipv4/ip_tunnel_core.c | 11 ---
 net/ipv4/ipmr.c   |  2 +-
 5 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 5413883..d3a1078 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -47,8 +47,7 @@ struct inet_skb_parm {
 #define IPSKB_REROUTED BIT(4)
 #define IPSKB_DOREDIRECT   BIT(5)
 #define IPSKB_FRAG_PMTUBIT(6)
-#define IPSKB_FRAG_SEGSBIT(7)
-#define IPSKB_L3SLAVE  BIT(8)
+#define IPSKB_L3SLAVE  BIT(7)
 
u16 frag_max_size;
 };
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 8b4ffd2..9f0a7b9 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -117,7 +117,7 @@ int ip_forward(struct sk_buff *skb)
if (opt->is_strictroute && rt->rt_uses_gateway)
goto sr_failed;
 
-   IPCB(skb)->flags |= IPSKB_FORWARDED | IPSKB_FRAG_SEGS;
+   IPCB(skb)->flags |= IPSKB_FORWARDED;
mtu = ip_dst_mtu_maybe_forward(>dst, true);
if (ip_exceeds_mtu(skb, mtu)) {
IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 03e7f73..4971401 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -239,11 +239,9 @@ static int ip_finish_output_gso(struct net *net, struct 
sock *sk,
struct sk_buff *segs;
int ret = 0;
 
-   /* common case: fragmentation of segments is not allowed,
-* or seglen is <= mtu
+   /* common case: seglen is <= mtu
 */
-   if (((IPCB(skb)->flags & IPSKB_FRAG_SEGS) == 0) ||
- skb_gso_validate_mtu(skb, mtu))
+   if (skb_gso_validate_mtu(skb, mtu))
return ip_finish_output2(net, sk, skb);
 
/* Slowpath -  GSO segment length is exceeding the dst MTU.
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index 777bc18..fed3d29 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -63,7 +63,6 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct 
sk_buff *skb,
int pkt_len = skb->len - skb_inner_network_offset(skb);
struct net *net = dev_net(rt->dst.dev);
struct net_device *dev = skb->dev;
-   int skb_iif = skb->skb_iif;
struct iphdr *iph;
int err;
 
@@ -73,16 +72,6 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, 
struct sk_buff *skb,
skb_dst_set(skb, >dst);
memset(IPCB(skb), 0, sizeof(*IPCB(skb)));
 
-   if (skb_iif && !(df & htons(IP_DF))) {
-   /* Arrived from an ingress interface, got encapsulated, with
-* fragmentation of encapulating frames allowed.
-* If skb is gso, the resulting encapsulated network segments
-* may exceed dst mtu.
-* Allow IP Fragmentation of segments.
-*/
-   IPCB(skb)->flags |= IPSKB_FRAG_SEGS;
-   }
-
/* Push down and install the IP header. */
skb_push(skb, sizeof(struct iphdr));
skb_reset_network_header(skb);
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 5f006e1..27089f5 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1749,7 +1749,7 @@ static void ipmr_queue_xmit(struct net *net, struct 
mr_table *mrt,
vif->dev->stats.tx_bytes += skb->len;
}
 
-   IPCB(skb)->flags |= IPSKB_FORWARDED | IPSKB_FRAG_SEGS;
+   IPCB(skb)->flags |= IPSKB_FORWARDED;
 
/* RFC1584 teaches, that DVMRP/PIM router must deliver packets locally
 * not only before forwarding, but after forwarding on all output
-- 
2.5.5



Re: [PATCH net] ipv4: allow local fragmentation in ip_finish_output_gso()

2016-11-02 Thread Lance Richardson


- Original Message -
> From: "Florian Westphal" <f...@strlen.de>
> To: "Lance Richardson" <lrich...@redhat.com>
> Cc: netdev@vger.kernel.org, f...@strlen.de, jtl...@redhat.com
> Sent: Wednesday, November 2, 2016 1:20:36 PM
> Subject: Re: [PATCH net] ipv4: allow local fragmentation in 
> ip_finish_output_gso()
> 
> Lance Richardson <lrich...@redhat.com> wrote:
> > Some configurations (e.g. geneve interface with default
> > MTU of 1500 over an ethernet interface with 1500 MTU) result
> > in the transmission of packets that exceed the configured MTU.
> > While this should be considered to be a "bad" configuration,
> > it is still allowed and should not result in the sending
> > of packets that exceed the configured MTU.
> > 
> > Fix by dropping the assumption in ip_finish_output_gso() that
> > locally originated gso packets will never need fragmentation.
> > Basic testing using iperf (observing CPU usage and bandwidth)
> > have shown no measurable performance impact for traffic not
> > requiring fragmentation.
> > 
> > Fixes: c7ba65d7b649 ("net: ip: push gso skb forwarding handling down the
> > stack")
> > Reported-by: Jan Tluka <jtl...@redhat.com>
> > Signed-off-by: Lance Richardson <lrich...@redhat.com>
> > ---
> >  net/ipv4/ip_output.c | 6 ++
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> > 
> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index 03e7f73..4971401 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -239,11 +239,9 @@ static int ip_finish_output_gso(struct net *net,
> > struct sock *sk,
> > struct sk_buff *segs;
> > int ret = 0;
> >  
> > -   /* common case: fragmentation of segments is not allowed,
> > -* or seglen is <= mtu
> > +   /* common case: seglen is <= mtu
> >  */
> > -   if (((IPCB(skb)->flags & IPSKB_FRAG_SEGS) == 0) ||
> > - skb_gso_validate_mtu(skb, mtu))
> > +   if (skb_gso_validate_mtu(skb, mtu))
> 
> IPSKB_FRAG_SEGS is now useless and should be removed.
> 

Thanks, Florian, I've removed IPSKB_FRAG_SEGS in v2.

   Lance


[PATCH net v2] ipv4: allow local fragmentation in ip_finish_output_gso()

2016-11-02 Thread Lance Richardson
Some configurations (e.g. geneve interface with default
MTU of 1500 over an ethernet interface with 1500 MTU) result
in the transmission of packets that exceed the configured MTU.
While this should be considered to be a "bad" configuration,
it is still allowed and should not result in the sending
of packets that exceed the configured MTU.

Fix by dropping the assumption in ip_finish_output_gso() that
locally originated gso packets will never need fragmentation.
Basic testing using iperf (observing CPU usage and bandwidth)
have shown no measurable performance impact for traffic not
requiring fragmentation.

Fixes: c7ba65d7b649 ("net: ip: push gso skb forwarding handling down the stack")
Reported-by: Jan Tluka <jtl...@redhat.com>
Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 v2: IPSKB_FRAG_SEGS is no longer useful, remove it.

 include/net/ip.h  |  3 +--
 net/ipv4/ip_forward.c |  2 +-
 net/ipv4/ip_output.c  |  6 ++
 net/ipv4/ip_tunnel_core.c | 10 --
 net/ipv4/ipmr.c   |  2 +-
 5 files changed, 5 insertions(+), 18 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 5413883..d3a1078 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -47,8 +47,7 @@ struct inet_skb_parm {
 #define IPSKB_REROUTED BIT(4)
 #define IPSKB_DOREDIRECT   BIT(5)
 #define IPSKB_FRAG_PMTUBIT(6)
-#define IPSKB_FRAG_SEGSBIT(7)
-#define IPSKB_L3SLAVE  BIT(8)
+#define IPSKB_L3SLAVE  BIT(7)
 
u16 frag_max_size;
 };
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 8b4ffd2..9f0a7b9 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -117,7 +117,7 @@ int ip_forward(struct sk_buff *skb)
if (opt->is_strictroute && rt->rt_uses_gateway)
goto sr_failed;
 
-   IPCB(skb)->flags |= IPSKB_FORWARDED | IPSKB_FRAG_SEGS;
+   IPCB(skb)->flags |= IPSKB_FORWARDED;
mtu = ip_dst_mtu_maybe_forward(>dst, true);
if (ip_exceeds_mtu(skb, mtu)) {
IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 03e7f73..4971401 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -239,11 +239,9 @@ static int ip_finish_output_gso(struct net *net, struct 
sock *sk,
struct sk_buff *segs;
int ret = 0;
 
-   /* common case: fragmentation of segments is not allowed,
-* or seglen is <= mtu
+   /* common case: seglen is <= mtu
 */
-   if (((IPCB(skb)->flags & IPSKB_FRAG_SEGS) == 0) ||
- skb_gso_validate_mtu(skb, mtu))
+   if (skb_gso_validate_mtu(skb, mtu))
return ip_finish_output2(net, sk, skb);
 
/* Slowpath -  GSO segment length is exceeding the dst MTU.
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index 777bc18..0f6995b1 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -73,16 +73,6 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, 
struct sk_buff *skb,
skb_dst_set(skb, >dst);
memset(IPCB(skb), 0, sizeof(*IPCB(skb)));
 
-   if (skb_iif && !(df & htons(IP_DF))) {
-   /* Arrived from an ingress interface, got encapsulated, with
-* fragmentation of encapulating frames allowed.
-* If skb is gso, the resulting encapsulated network segments
-* may exceed dst mtu.
-* Allow IP Fragmentation of segments.
-*/
-   IPCB(skb)->flags |= IPSKB_FRAG_SEGS;
-   }
-
/* Push down and install the IP header. */
skb_push(skb, sizeof(struct iphdr));
skb_reset_network_header(skb);
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 5f006e1..27089f5 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1749,7 +1749,7 @@ static void ipmr_queue_xmit(struct net *net, struct 
mr_table *mrt,
vif->dev->stats.tx_bytes += skb->len;
}
 
-   IPCB(skb)->flags |= IPSKB_FORWARDED | IPSKB_FRAG_SEGS;
+   IPCB(skb)->flags |= IPSKB_FORWARDED;
 
/* RFC1584 teaches, that DVMRP/PIM router must deliver packets locally
 * not only before forwarding, but after forwarding on all output
-- 
2.5.5



[PATCH net] ipv4: allow local fragmentation in ip_finish_output_gso()

2016-11-02 Thread Lance Richardson
Some configurations (e.g. geneve interface with default
MTU of 1500 over an ethernet interface with 1500 MTU) result
in the transmission of packets that exceed the configured MTU.
While this should be considered to be a "bad" configuration,
it is still allowed and should not result in the sending
of packets that exceed the configured MTU.

Fix by dropping the assumption in ip_finish_output_gso() that
locally originated gso packets will never need fragmentation.
Basic testing using iperf (observing CPU usage and bandwidth)
have shown no measurable performance impact for traffic not
requiring fragmentation.

Fixes: c7ba65d7b649 ("net: ip: push gso skb forwarding handling down the stack")
Reported-by: Jan Tluka <jtl...@redhat.com>
Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 net/ipv4/ip_output.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 03e7f73..4971401 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -239,11 +239,9 @@ static int ip_finish_output_gso(struct net *net, struct 
sock *sk,
struct sk_buff *segs;
int ret = 0;
 
-   /* common case: fragmentation of segments is not allowed,
-* or seglen is <= mtu
+   /* common case: seglen is <= mtu
 */
-   if (((IPCB(skb)->flags & IPSKB_FRAG_SEGS) == 0) ||
- skb_gso_validate_mtu(skb, mtu))
+   if (skb_gso_validate_mtu(skb, mtu))
return ip_finish_output2(net, sk, skb);
 
/* Slowpath -  GSO segment length is exceeding the dst MTU.
-- 
2.5.5



Re: [PATCH net] vti6: flush x-netns xfrm cache when vti interface is removed

2016-10-03 Thread Lance Richardson
> From: "Nicolas Dichtel" <nicolas.dich...@6wind.com>
> To: da...@davemloft.net, "steffen klassert" <steffen.klass...@secunet.com>
> Cc: netdev@vger.kernel.org, "Nicolas Dichtel" <nicolas.dich...@6wind.com>, 
> "Lance Richardson" <lrich...@redhat.com>
> Sent: Friday, September 30, 2016 5:11:07 AM
> Subject: [PATCH net] vti6: flush x-netns xfrm cache when vti interface is 
> removed
> 
> This is the same fix than commit a5d0dc810abf ("vti: flush x-netns xfrm
> cache when vti interface is removed")
> 
> This patch fixes a refcnt problem when a x-netns vti6 interface is removed:
> unregister_netdevice: waiting for vti6_test to become free. Usage count = 1
> 
> Here is a script to reproduce the problem:
> 
> ip link set dev ntfp2 up
> ip addr add dev ntfp2 2001::1/64
> ip link add vti6_test type vti6 local 2001::1 remote 2001::2 key 1
> ip netns add secure
> ip link set vti6_test netns secure
> ip netns exec secure ip link set vti6_test up
> ip netns exec secure ip link s lo up
> ip netns exec secure ip addr add dev vti6_test 2003::1/64
> ip -6 xfrm policy add dir out tmpl src 2001::1 dst 2001::2 proto esp \
>  mode tunnel mark 1
> ip -6 xfrm policy add dir in tmpl src 2001::2 dst 2001::1 proto esp \
>  mode tunnel mark 1
> ip xfrm state add src 2001::1 dst 2001::2 proto esp spi 1 mode tunnel \
>  enc des3_ede 0x112233445566778811223344556677881122334455667788 mark 
> 1
> ip xfrm state add src 2001::2 dst 2001::1 proto esp spi 1 mode tunnel \
>  enc des3_ede 0x112233445566778811223344556677881122334455667788 mark 
> 1
> ip netns exec secure  ping6 -c 4 2003::2
> ip netns del secure
> 
> CC: Lance Richardson <lrich...@redhat.com>
> Signed-off-by: Nicolas Dichtel <nicolas.dich...@6wind.com>
> ---

Looks good, corresponds closely to the ipv4 version of the fix.

Acked-by: Lance Richardson <lrich...@redhat.com>


[PATCH net-next] gre: use nla_get_be32() to extract flowinfo

2016-09-24 Thread Lance Richardson
Eliminate a sparse endianness mismatch warning, use nla_get_be32() to
extract a __be32 value instead of nla_get_u32().

Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 net/ipv6/ip6_gre.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 397e1ed..4ce74f8 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1239,7 +1239,7 @@ static void ip6gre_netlink_parms(struct nlattr *data[],
parms->encap_limit = nla_get_u8(data[IFLA_GRE_ENCAP_LIMIT]);
 
if (data[IFLA_GRE_FLOWINFO])
-   parms->flowinfo = nla_get_u32(data[IFLA_GRE_FLOWINFO]);
+   parms->flowinfo = nla_get_be32(data[IFLA_GRE_FLOWINFO]);
 
if (data[IFLA_GRE_FLAGS])
parms->flags = nla_get_u32(data[IFLA_GRE_FLAGS]);
-- 
2.5.5



Re: [PATCH net v2] ip6_gre: fix flowi6_proto value in ip6gre_xmit_other()

2016-09-23 Thread Lance Richardson
> From: "Sergei Shtylyov" <sergei.shtyl...@cogentembedded.com>
> To: "Lance Richardson" <lrich...@redhat.com>, netdev@vger.kernel.org
> Cc: "shmulik ladkani" <shmulik.ladk...@gmail.com>, jb...@redhat.com
> Sent: Friday, September 23, 2016 4:01:15 PM
> Subject: Re: [PATCH net v2] ip6_gre: fix flowi6_proto value in 
> ip6gre_xmit_other()
> 
> Hello.
> 
> On 09/23/2016 10:50 PM, Lance Richardson wrote:
> 
> > Similar to commit 3be07244b733 ("ip6_gre: fix flowi6_proto value in
> > xmit path"), set flowi6_proto to IPPROTO_GRE for output route lookup.
> >
> > Up until now, ip6gre_xmit_other() has set flowi6_proto to a bogus value.
> > This affected output route lookup for packets sent on an ip6gretap device
> > in cases where routing was dependent on the value of flowi6_proto.
> >
> > Since the correct proto is already set in the tunnel flowi6 template via
> > commit 252f3f5a1189 ("ip6_gre: Set flowi6_proto as IPPROTO_GRE in xmit
> > path."), simply delete the line setting the incorrect flowi6_proto value.
> >
> > Suggested-by: Jiri Benc <jb...@redhat.com>
> > Fixes: commit c12b395a4664 ("gre: Support GRE over IPv6")
> 
> That "commit" isn't needed here, this tag has a standardized format.
> Hopefully, can be fixed while applying...

Thanks for pointing that out, I mistakenly added that "commit" after
checkpatch.pl complained about not having "commit" before the hashes
in the log. Hoping it can be fixed when applying as well.

> 
> > Reviewed-by: Shmulik Ladkani <shmulik.ladk...@gmail.com>
> > Signed-off-by: Lance Richardson <lrich...@redhat.com>
> [...]
> 
> MBR, Sergei
> 
> 


Re: [PATCH net] ip6_gre: fix flowi6_proto value in ip6gre_xmit_other()

2016-09-23 Thread Lance Richardson
> From: "Shmulik Ladkani" <shmulik.ladk...@gmail.com>
> To: "Lance Richardson" <lrich...@redhat.com>
> Cc: netdev@vger.kernel.org
> Sent: Friday, September 23, 2016 3:00:36 PM
> Subject: Re: [PATCH net] ip6_gre: fix flowi6_proto value in 
> ip6gre_xmit_other()
> 
> On Fri, 23 Sep 2016 12:54:59 -0400 Lance Richardson <lrich...@redhat.com>
> wrote:
> > Similar to commit 3be07244b733 ("ip6_gre: fix flowi6_proto value in
> > xmit path"), set flowi6_proto to IPPROTO_GRE for output route lookup.
> 
> Suggesting to add:
> 
> Up until now, 'ip6gre_xmit_other' has set flowi6_proto to a bogus value.
> This affects output route lookup upon xmit of non ipv4/ipv6 packets on a
> ip6gretap device, in cases where routing depends on flowi6_proto.
> 

Added in v2, taking some editorial license (please let me know if I mangled it
too badly).

Thanks,

   Lance

> > Since the correct proto is already set in the tunnel flowi6 template via
> > commit 252f3f5a1189 ("ip6_gre: Set flowi6_proto as IPPROTO_GRE in xmit
> > path."), simply delete the line setting the incorrect flowi6_proto value.
> > 
> > Suggested-by: Jiri Benc <jb...@redhat.com>
> > Fixes: commit c12b395a4664 ("gre: Support GRE over IPv6")
> > Signed-off-by: Lance Richardson <lrich...@redhat.com>
> 
> Reviewed-by: Shmulik Ladkani <shmulik.ladk...@gmail.com>
> 


[PATCH net v2] ip6_gre: fix flowi6_proto value in ip6gre_xmit_other()

2016-09-23 Thread Lance Richardson
Similar to commit 3be07244b733 ("ip6_gre: fix flowi6_proto value in
xmit path"), set flowi6_proto to IPPROTO_GRE for output route lookup.

Up until now, ip6gre_xmit_other() has set flowi6_proto to a bogus value.
This affected output route lookup for packets sent on an ip6gretap device
in cases where routing was dependent on the value of flowi6_proto.

Since the correct proto is already set in the tunnel flowi6 template via
commit 252f3f5a1189 ("ip6_gre: Set flowi6_proto as IPPROTO_GRE in xmit
path."), simply delete the line setting the incorrect flowi6_proto value.

Suggested-by: Jiri Benc <jb...@redhat.com>
Fixes: commit c12b395a4664 ("gre: Support GRE over IPv6")
Reviewed-by: Shmulik Ladkani <shmulik.ladk...@gmail.com>
Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
v2: expanded commit description as suggested by Shmulik Ladkani.

 net/ipv6/ip6_gre.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 704274c..edc3daa 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -648,7 +648,6 @@ static int ip6gre_xmit_other(struct sk_buff *skb, struct 
net_device *dev)
encap_limit = t->parms.encap_limit;
 
memcpy(, >fl.u.ip6, sizeof(fl6));
-   fl6.flowi6_proto = skb->protocol;
 
err = gre_handle_offloads(skb, !!(t->parms.o_flags & TUNNEL_CSUM));
if (err)
-- 
2.5.5



[PATCH net] ip6_gre: fix flowi6_proto value in ip6gre_xmit_other()

2016-09-23 Thread Lance Richardson
Similar to commit 3be07244b733 ("ip6_gre: fix flowi6_proto value in
xmit path"), set flowi6_proto to IPPROTO_GRE for output route lookup.
Since the correct proto is already set in the tunnel flowi6 template via
commit 252f3f5a1189 ("ip6_gre: Set flowi6_proto as IPPROTO_GRE in xmit
path."), simply delete the line setting the incorrect flowi6_proto value.

Suggested-by: Jiri Benc <jb...@redhat.com>
Fixes: commit c12b395a4664 ("gre: Support GRE over IPv6")
Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 net/ipv6/ip6_gre.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 704274c..edc3daa 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -648,7 +648,6 @@ static int ip6gre_xmit_other(struct sk_buff *skb, struct 
net_device *dev)
encap_limit = t->parms.encap_limit;
 
memcpy(, >fl.u.ip6, sizeof(fl6));
-   fl6.flowi6_proto = skb->protocol;
 
err = gre_handle_offloads(skb, !!(t->parms.o_flags & TUNNEL_CSUM));
if (err)
-- 
2.5.5



Re: [PATCH v3 net-next 05/16] tcp: switch back to proper tcp_skb_cb size check in tcp_init()

2016-09-19 Thread Lance Richardson
> From: "Neal Cardwell" 
> To: "David Miller" 
> Cc: netdev@vger.kernel.org, "Eric Dumazet" , "Soheil 
> Hassas Yeganeh" , "Neal
> Cardwell" , "Yuchung Cheng" 
> Sent: Sunday, September 18, 2016 6:03:42 PM
> Subject: [PATCH v3 net-next 05/16] tcp: switch back to proper tcp_skb_cb size 
> check in tcp_init()
> 
> From: Eric Dumazet 
> 
> Revert to the tcp_skb_cb size check that tcp_init() had before commit
> b4772ef879a8 ("net: use common macro for assering skb->cb[] available
> size in protocol families"). As related commit 744d5a3e9fe2 ("net:
> move skb->dropcount to skb->cb[]") explains, the
> sock_skb_cb_check_size() mechanism was added to ensure that there is
> space for dropcount, "for protocol families using it". But TCP is not
> a protocol using dropcount, so tcp_init() doesn't need to provision
> space for dropcount in the skb->cb[], and thus we can revert to the
> older form of the tcp_skb_cb size check. Doing so allows TCP to use 4
> more bytes of the skb->cb[] space.
> 
> Fixes: b4772ef879a8 ("net: use common macro for assering skb->cb[] available
> size in protocol families")
> Signed-off-by: Eric Dumazet 
> Signed-off-by: Soheil Hassas Yeganeh 
> Signed-off-by: Neal Cardwell 
> Signed-off-by: Yuchung Cheng 
> ---
>  net/ipv4/tcp.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 5b0b49c..53798e1 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -3244,11 +3244,12 @@ static void __init tcp_init_mem(void)
>  
>  void __init tcp_init(void)
>  {
> - unsigned long limit;
>   int max_rshare, max_wshare, cnt;
> + unsigned long limit;
> + struct sk_buff *skb;
>   unsigned int i;
>  
> - sock_skb_cb_check_size(sizeof(struct tcp_skb_cb));
> + BUILD_BUG_ON(sizeof(struct tcp_skb_cb) > sizeof(skb->cb));
>  
The skb local variable could be avoided via:

BUILD_BUG_ON(sizeof(struct tcp_skb_cb) > FIELD_SIZEOF(struct sk_buff, 
cb));

>   percpu_counter_init(_sockets_allocated, 0, GFP_KERNEL);
>   percpu_counter_init(_orphan_count, 0, GFP_KERNEL);
> --
> 2.8.0.rc3.226.g39d4020
> 
> 


[PATCH net-next] openvswitch: avoid deferred execution of recirc actions

2016-09-13 Thread Lance Richardson
The ovs kernel data path currently defers the execution of all
recirc actions until stack utilization is at a minimum.
This is too limiting for some packet forwarding scenarios due to
the small size of the deferred action FIFO (10 entries). For
example, broadcast traffic sent out more than 10 ports with
recirculation results in packet drops when the deferred action
FIFO becomes full, as reported here:

 http://openvswitch.org/pipermail/dev/2016-March/067672.html

Since the current recursion depth is available (it is already tracked
by the exec_actions_level pcpu variable), we can use it to determine
whether to execute recirculation actions immediately (safe when
recursion depth is low) or defer execution until more stack space is
available.

With this change, the deferred action fifo size becomes a non-issue
for currently failing scenarios because it is no longer used when
there are three or fewer recursions through ovs_execute_actions().

Suggested-by: Pravin Shelar <pshe...@ovn.org>
Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 net/openvswitch/actions.c | 30 --
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 6eb5261..ef7cc6c 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -71,6 +71,8 @@ struct ovs_frag_data {
 static DEFINE_PER_CPU(struct ovs_frag_data, ovs_frag_data_storage);
 
 #define DEFERRED_ACTION_FIFO_SIZE 10
+#define OVS_RECURSION_LIMIT 5
+#define OVS_DEFERRED_ACTION_THRESHOLD (OVS_RECURSION_LIMIT - 2)
 struct action_fifo {
int head;
int tail;
@@ -78,7 +80,12 @@ struct action_fifo {
struct deferred_action fifo[DEFERRED_ACTION_FIFO_SIZE];
 };
 
+struct recirc_keys {
+   struct sw_flow_key key[OVS_DEFERRED_ACTION_THRESHOLD];
+};
+
 static struct action_fifo __percpu *action_fifos;
+static struct recirc_keys __percpu *recirc_keys;
 static DEFINE_PER_CPU(int, exec_actions_level);
 
 static void action_fifo_init(struct action_fifo *fifo)
@@ -1020,6 +1027,7 @@ static int execute_recirc(struct datapath *dp, struct 
sk_buff *skb,
  const struct nlattr *a, int rem)
 {
struct deferred_action *da;
+   int level;
 
if (!is_flow_key_valid(key)) {
int err;
@@ -1043,6 +1051,18 @@ static int execute_recirc(struct datapath *dp, struct 
sk_buff *skb,
return 0;
}
 
+   level = this_cpu_read(exec_actions_level);
+   if (level <= OVS_DEFERRED_ACTION_THRESHOLD) {
+   struct recirc_keys *rks = this_cpu_ptr(recirc_keys);
+   struct sw_flow_key *recirc_key = >key[level - 1];
+
+   *recirc_key = *key;
+   recirc_key->recirc_id = nla_get_u32(a);
+   ovs_dp_process_packet(skb, recirc_key);
+
+   return 0;
+   }
+
da = add_deferred_actions(skb, key, NULL);
if (da) {
da->pkt_key.recirc_id = nla_get_u32(a);
@@ -1209,11 +1229,10 @@ int ovs_execute_actions(struct datapath *dp, struct 
sk_buff *skb,
const struct sw_flow_actions *acts,
struct sw_flow_key *key)
 {
-   static const int ovs_recursion_limit = 5;
int err, level;
 
level = __this_cpu_inc_return(exec_actions_level);
-   if (unlikely(level > ovs_recursion_limit)) {
+   if (unlikely(level > OVS_RECURSION_LIMIT)) {
net_crit_ratelimited("ovs: recursion limit reached on datapath 
%s, probable configuration error\n",
 ovs_dp_name(dp));
kfree_skb(skb);
@@ -1238,10 +1257,17 @@ int action_fifos_init(void)
if (!action_fifos)
return -ENOMEM;
 
+   recirc_keys = alloc_percpu(struct recirc_keys);
+   if (!recirc_keys) {
+   free_percpu(action_fifos);
+   return -ENOMEM;
+   }
+
return 0;
 }
 
 void action_fifos_exit(void)
 {
free_percpu(action_fifos);
+   free_percpu(recirc_keys);
 }
-- 
2.5.5



Re: [PATCH] net_namespace: fixed net_device reference leak

2016-09-08 Thread Lance Richardson
> From: "Jojy Varghese" 
> To: netdev@vger.kernel.org
> Cc: da...@davemloft.net
> Sent: Thursday, September 8, 2016 3:08:29 PM
> Subject: Re: [PATCH] net_namespace: fixed net_device reference leak
> 
> (Updating the patch for the case when a non-loopback dev could be
> unregistered)
> 
> Currently during namespace cleanup, if ‘dst’ subsystem is holding
> a reference to the interface in the namespace, it does not get released.
> Current code first does a ’dev_hold’ on the same device followed by a
> ‘dev_put’ on the same device resulting in a no-op.
> 
> This change fixes this leak by assigning the initial namespace’s loopback
> device to the ‘dst’ before releasing the reference to the network
> device being released.
> 
> Additional reference: https://github.com/docker/docker/issues/5618
> 
> Signed-off-by: Jojy G Varghese 
> ---

If the dst wasn't being cleaned up in the namespace that's being deleted,
it is unlikely it will be cleaned up after moving it to the init namespace.

I think it would be better to figure out why this dst is being leaked
in the first place, otherwise this is probably still a memory leak.

Is this easily reproducible with current net or net-next kernels?

Regards,

Lance


Re: [net] openvswitch: Allow deferred action fifo to expand during run time

2016-08-24 Thread Lance Richardson
> From: "David Miller" 
> To: az...@ovn.org
> Cc: d...@openvswitch.com, netdev@vger.kernel.org
> Sent: Friday, March 18, 2016 5:19:09 PM
> Subject: Re: [net] openvswitch: Allow deferred action fifo to expand during 
> run time
> 
> From: Andy Zhou 
> Date: Thu, 17 Mar 2016 21:32:13 -0700
> 
> > Current openvswitch implementation allows up to 10 recirculation actions
> > for each packet. This limit was sufficient for most use cases in the
> > past, but with more new features, such as supporting connection
> > tracking, and testing in larger scale network environment,
> > This limit may be too restrictive.
>  ...
> 
> Actions that need to recirculate that many times are extremely poorly
> designed, and will have significant performance problems.
> 
> I think the way rules are put together and processed should be redone
> before we do insane stuff like this.
> 
> There is no way I'm applying a patch like this, sorry.
> 

Apologies for coming into this thread so late, I happened on it after finding
out that this is actually an issue in some production networks.

The need to buffer so many deferred actions seems to be mostly due to having
relatively simple rules (that have, say, one or two recirculations) that get
multiplied per packet by the number of egress ports.

For example, a configuration with 11 or more OVS bond ports in balance-tcp
mode (which needs one recirculation) will exceed the deferred action fifo limit
of 10 every time a broadcast (or multicast or unknown unicast) is forwarded by
the OVS bridge because one entry will be consumed by each egress port. Since
the order in which egress ports are handled is deterministic, this means
e.g. broadcast ARP requests will only ever make it out the first 10 
bond ports in this scenario.

Note that bonding isn't necessary to have this issue, it just makes for a
relatively straightforward example.

Andy's patch certainly seems to be an improvement on this situation,
but maybe there another/better way.

Regards,

   Lance


[PATCH net] sctp: fix overrun in sctp_diag_dump_one()

2016-08-23 Thread Lance Richardson
The function sctp_diag_dump_one() currently performs a memcpy()
of 64 bytes from a 16 byte field into another 16 byte field. Fix
by using correct size, use sizeof to obtain correct size instead
of using a hard-coded constant.

Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file")
Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 net/sctp/sctp_diag.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/sctp/sctp_diag.c b/net/sctp/sctp_diag.c
index bb69153..f3508aa 100644
--- a/net/sctp/sctp_diag.c
+++ b/net/sctp/sctp_diag.c
@@ -424,11 +424,13 @@ static int sctp_diag_dump_one(struct sk_buff *in_skb,
paddr.v4.sin_family = AF_INET;
} else {
laddr.v6.sin6_port = req->id.idiag_sport;
-   memcpy(_addr, req->id.idiag_src, 64);
+   memcpy(_addr, req->id.idiag_src,
+  sizeof(laddr.v6.sin6_addr));
laddr.v6.sin6_family = AF_INET6;
 
paddr.v6.sin6_port = req->id.idiag_dport;
-   memcpy(_addr, req->id.idiag_dst, 64);
+   memcpy(_addr, req->id.idiag_dst,
+  sizeof(paddr.v6.sin6_addr));
paddr.v6.sin6_family = AF_INET6;
}
 
-- 
2.5.5



[PATCH net v2] vti: flush x-netns xfrm cache when vti interface is removed

2016-08-09 Thread Lance Richardson
When executing the script included below, the netns delete operation
hangs with the following message (repeated at 10 second intervals):

  kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

This occurs because a reference to the lo interface in the "secure" netns
is still held by a dst entry in the xfrm bundle cache in the init netns.

Address this problem by garbage collecting the tunnel netns flow cache
when a cross-namespace vti interface receives a NETDEV_DOWN notification.

A more detailed description of the problem scenario (referencing commands
in the script below):

(1) ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1

  The vti_test interface is created in the init namespace. vti_tunnel_init()
  attaches a struct ip_tunnel to the vti interface's netdev_priv(dev),
  setting the tunnel net to _net.

(2) ip link set vti_test netns secure

  The vti_test interface is moved to the "secure" netns. Note that
  the associated struct ip_tunnel still has tunnel->net set to _net.

(3) ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1

  The first packet sent using the vti device causes xfrm_lookup() to be
  called as follows:

  dst = xfrm_lookup(tunnel->net, skb_dst(skb), fl, NULL, 0);

  Note that tunnel->net is the init namespace, while skb_dst(skb) references
  the vti_test interface in the "secure" namespace. The returned dst
  references an interface in the init namespace.

  Also note that the first parameter to xfrm_lookup() determines which flow
  cache is used to store the computed xfrm bundle, so after xfrm_lookup()
  returns there will be a cached bundle in the init namespace flow cache
  with a dst referencing a device in the "secure" namespace.

(4) ip netns del secure

  Kernel begins to delete the "secure" namespace.  At some point the
  vti_test interface is deleted, at which point dst_ifdown() changes
  the dst->dev in the cached xfrm bundle flow from vti_test to lo (still
  in the "secure" namespace however).
  Since nothing has happened to cause the init namespace's flow cache
  to be garbage collected, this dst remains attached to the flow cache,
  so the kernel loops waiting for the last reference to lo to go away.


ip link add br1 type bridge
ip link set dev br1 up
ip addr add dev br1 1.1.1.1/8

ip netns add secure
ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1
ip link set vti_test netns secure
ip netns exec secure ip link set vti_test up
ip netns exec secure ip link s lo up
ip netns exec secure ip addr add dev lo 192.168.100.1/24
ip netns exec secure ip route add 192.168.200.0/24 dev vti_test
ip xfrm policy flush
ip xfrm state flush
ip xfrm policy add dir out tmpl src 1.1.1.1 dst 1.1.1.2 \
   proto esp mode tunnel mark 1
ip xfrm policy add dir in tmpl src 1.1.1.2 dst 1.1.1.1 \
   proto esp mode tunnel mark 1
ip xfrm state add src 1.1.1.1 dst 1.1.1.2 proto esp spi 1 \
   mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788
ip xfrm state add src 1.1.1.2 dst 1.1.1.1 proto esp spi 1 \
   mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788

ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1

ip netns del secure


Reported-by: Hangbin Liu <ha...@redhat.com>
Reported-by: Jan Tluka <jtl...@redhat.com>
Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
v2: Perform garbage collection on NETDEV_DOWN notification (v1 did this 
in uninit op handler).

 net/ipv4/ip_vti.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index a917903..cc701fa 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -557,6 +557,33 @@ static struct rtnl_link_ops vti_link_ops __read_mostly = {
.get_link_net   = ip_tunnel_get_link_net,
 };
 
+static bool is_vti_tunnel(const struct net_device *dev)
+{
+   return dev->netdev_ops == _netdev_ops;
+}
+
+static int vti_device_event(struct notifier_block *unused,
+   unsigned long event, void *ptr)
+{
+   struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+   struct ip_tunnel *tunnel = netdev_priv(dev);
+
+   if (!is_vti_tunnel(dev))
+   return NOTIFY_DONE;
+
+   switch (event) {
+   case NETDEV_DOWN:
+   if (!net_eq(tunnel->net, dev_net(dev)))
+   xfrm_garbage_collect(tunnel->net);
+   break;
+   }
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block vti_notifier_block __read_mostly = {
+   .notifier_call = vti_device_event,
+};
+
 static int __init vti_init(void)
 {
const char *msg;
@@ -564,6 +591,8 @@ static int __init vti_init(void)
 
pr_info("IPv4 over IPsec tunneling driver\n");
 
+   register_netdevice_notifier(_notifier_block);
+
msg = "tunnel devi

[PATCH net] vti: flush x-netns xfrm cache when vti interface is removed

2016-08-08 Thread Lance Richardson
When executing the script included below, the netns delete operation
hangs with the following message (repeated at 10 second intervals):

  kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

This occurs because a reference to the lo interface in the "secure" netns
is still held by a dst entry in the xfrm bundle cache in the init netns.

Prevent this problem by garbage collecting the tunnel namespace flow cache
when a vti interface is deleted from a namespace that is different from
the tunnel namepace.

A more detailed description of the problem scenario (referencing commands
in the problem script) is as follows:

(1) ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1

  The vti_test interface is created in the init namespace. vti_tunnel_init()
  attaches a struct ip_tunnel to the vti interface's netdev_priv(dev),
  setting tunnel->net to _net.

(2) ip link set vti_test netns secure

  The vti_test interface is moved to the "secure" netns. Note that
  the associated struct ip_tunnel still has tunnel->net set to _net.

(3) ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1

  The first packet sent using the vti device causes xfrm_lookup() to be
  called as follows:

  dst = xfrm_lookup(tunnel->net, skb_dst(skb), fl, NULL, 0);

  Note that tunnel->net is the init namespace, while skb_dst(skb) references
  the vti_test interface in the "secure" namespace. The returned dst
  references an interface in the init namespace.

  Also note that the first parameter to xfrm_lookup() determines which flow
  cache is used to store the computed xfrm bundle, so after xfrm_lookup()
  returns there will be a cached bundle in the init namespace flow cache
  with a dst referencing a device in the "secure" namespace.

(4) ip netns del secure

  Kernel begins to delete the "secure" namespace.  At some point the
  vti_test interface is deleted, at which point dst_ifdown() changes
  the dst->dev in the cached xfrm bundle flow from vti_test to lo (still
  in the "secure" namespace however).

  Since nothing has happened to cause the init namespace's flow cache
  to be garbage collected, this dst remains attached to the flow cache,
  so the kernel loops waiting for the last reference to lo to go away.


ip link add br1 type bridge
ip link set dev br1 up
ip addr add dev br1 1.1.1.1/8

ip netns add secure
ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1
ip link set vti_test netns secure
ip netns exec secure ip link set vti_test up
ip netns exec secure ip link s lo up
ip netns exec secure ip addr add dev lo 192.168.100.1/24
ip netns exec secure ip route add 192.168.200.0/24 dev vti_test
ip xfrm policy flush
ip xfrm state flush
ip xfrm policy add dir out tmpl src 1.1.1.1 dst 1.1.1.2 \
   proto esp mode tunnel mark 1
ip xfrm policy add dir in tmpl src 1.1.1.2 dst 1.1.1.1 \
   proto esp mode tunnel mark 1
ip xfrm state add src 1.1.1.1 dst 1.1.1.2 proto esp spi 1 \
   mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788
ip xfrm state add src 1.1.1.2 dst 1.1.1.1 proto esp spi 1 \
   mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788

ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1

ip netns del secure


Reported-by: Hangbin Liu <ha...@redhat.com>
Reported-by: Jan Tluka <jtl...@redhat.com>
Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 net/ipv4/ip_vti.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index a917903..a815735 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -48,6 +48,7 @@ static struct rtnl_link_ops vti_link_ops __read_mostly;
 
 static int vti_net_id __read_mostly;
 static int vti_tunnel_init(struct net_device *dev);
+static void vti_tunnel_uninit(struct net_device *dev);
 
 static int vti_input(struct sk_buff *skb, int nexthdr, __be32 spi,
 int encap_type)
@@ -359,7 +360,7 @@ vti_tunnel_ioctl(struct net_device *dev, struct ifreq *ifr, 
int cmd)
 
 static const struct net_device_ops vti_netdev_ops = {
.ndo_init   = vti_tunnel_init,
-   .ndo_uninit = ip_tunnel_uninit,
+   .ndo_uninit = vti_tunnel_uninit,
.ndo_start_xmit = vti_tunnel_xmit,
.ndo_do_ioctl   = vti_tunnel_ioctl,
.ndo_change_mtu = ip_tunnel_change_mtu,
@@ -392,6 +393,17 @@ static int vti_tunnel_init(struct net_device *dev)
return ip_tunnel_init(dev);
 }
 
+static void vti_tunnel_uninit(struct net_device *dev)
+{
+   struct ip_tunnel *tunnel = netdev_priv(dev);
+   struct net *net = tunnel->net;
+
+   ip_tunnel_uninit(dev);
+
+   if (!net_eq(net, dev_net(dev)))
+   xfrm_garbage_collect(net);
+}
+
 static void __net_init vti_fb_tunnel_init(struct net_device *dev)
 {
struct ip_tunnel *tunnel = netdev_priv(dev);
-- 
2.5.5



Re: [PATCHi next] veth: advertise peer link relationship for both devices

2016-06-11 Thread Lance Richardson
- Original Message -
> From: "David Miller" <da...@davemloft.net>
> To: lrich...@redhat.com
> Cc: netdev@vger.kernel.org, "nicolas dichtel" <nicolas.dich...@6wind.com>
> Sent: Saturday, June 11, 2016 6:43:40 PM
> Subject: Re: [PATCHi next] veth: advertise peer link relationship for both 
> devices
> 
> From: Lance Richardson <lrich...@redhat.com>
> Date: Fri, 10 Jun 2016 12:32:19 -0400
> 
> > Currently, when creating a veth pair, notfications to user
> > space only include link peer for one end of the veth pair:
> ># ip monitor link &
> ># ip link add dev vm1 type veth peer name vm2
> >30: vm2@NONE: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
> >link/ether be:e3:b7:0e:14:52 brd ff:ff:ff:ff:ff:ff
> >31: vm1@vm2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN
> >link/ether da:e6:a6:c5:42:54 brd ff:ff:ff:ff:ff:ff
> > 
> > With this change, netlink notifications are sent with complete
> > information for both interfaces of the veth pair:
> > 
> ># 3: vm2@NONE: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
> >link/ether e2:94:54:8a:ac:f5 brd ff:ff:ff:ff:ff:ff
> >4: vm1@vm2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN
> >link/ether b2:05:70:e0:fc:35 brd ff:ff:ff:ff:ff:ff
> >3: vm2@vm1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN
> >link/ether e2:94:54:8a:ac:f5 brd ff:ff:ff:ff:ff:ff
> > 
> > Signed-off-by: Lance Richardson <lrich...@redhat.com>
> 
> I don't know about this.
> 
> First of all, those notifications you get above tell you everything you
> need to know in order to figure out what both ends of the veth pair are.
> 
> In fact, I would say that the vm1@vm2 notification #31 above is the _only_
> one you absolutely need.
> 
> > @@ -466,8 +466,16 @@ static int veth_newlink(struct net *src_net, struct
> > net_device *dev,
> >  
> > priv = netdev_priv(peer);
> > rcu_assign_pointer(priv->peer, dev);
> > +
> > +   err = rtnl_configure_link(dev, NULL);
> > +   if (err < 0)
> > +   goto err_configure_dev;
> > +
> > +   rtmsg_ifinfo(RTM_NEWLINK, peer, 0, GFP_KERNEL);
> > return 0;
> >  
> > +err_configure_dev:
> > +   /* nothing to do */
> >  err_register_dev:
> > /* nothing to do */
> >  err_configure_peer:
> 
> If you're registering the peer here explicitly, this means a link configure
> somewhere else is now superfluous.
> 
> I really don't like this change at all, both from a necessity perspective as
> well as from it's implementation.
> 

I'll confess to not being super-happy with it myself, which is why I've
been sitting on this patch for some time now. A hard NAK will help justify
a "will not fix" to the reporter of this issue.

Thanks,

  Lance


[PATCHi next] veth: advertise peer link relationship for both devices

2016-06-10 Thread Lance Richardson
Currently, when creating a veth pair, notfications to user
space only include link peer for one end of the veth pair:
   # ip monitor link &
   # ip link add dev vm1 type veth peer name vm2
   30: vm2@NONE: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
   link/ether be:e3:b7:0e:14:52 brd ff:ff:ff:ff:ff:ff
   31: vm1@vm2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN
   link/ether da:e6:a6:c5:42:54 brd ff:ff:ff:ff:ff:ff

With this change, netlink notifications are sent with complete
information for both interfaces of the veth pair:

   # 3: vm2@NONE: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
   link/ether e2:94:54:8a:ac:f5 brd ff:ff:ff:ff:ff:ff
   4: vm1@vm2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN
   link/ether b2:05:70:e0:fc:35 brd ff:ff:ff:ff:ff:ff
   3: vm2@vm1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN
   link/ether e2:94:54:8a:ac:f5 brd ff:ff:ff:ff:ff:ff

Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 drivers/net/veth.c   | 10 +-
 net/core/rtnetlink.c | 10 --
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index f37a6e6..9151686 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -458,7 +458,7 @@ static int veth_newlink(struct net *src_net, struct 
net_device *dev,
netif_carrier_off(dev);
 
/*
-* tie the deviced together
+* tie the devices together
 */
 
priv = netdev_priv(dev);
@@ -466,8 +466,16 @@ static int veth_newlink(struct net *src_net, struct 
net_device *dev,
 
priv = netdev_priv(peer);
rcu_assign_pointer(priv->peer, dev);
+
+   err = rtnl_configure_link(dev, NULL);
+   if (err < 0)
+   goto err_configure_dev;
+
+   rtmsg_ifinfo(RTM_NEWLINK, peer, 0, GFP_KERNEL);
return 0;
 
+err_configure_dev:
+   /* nothing to do */
 err_register_dev:
/* nothing to do */
 err_configure_peer:
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index d69c464..e0956bb 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2165,6 +2165,7 @@ static int rtnl_dellink(struct sk_buff *skb, struct 
nlmsghdr *nlh)
 int rtnl_configure_link(struct net_device *dev, const struct ifinfomsg *ifm)
 {
unsigned int old_flags;
+   unsigned int gchanges;
int err;
 
old_flags = dev->flags;
@@ -2174,9 +2175,14 @@ int rtnl_configure_link(struct net_device *dev, const 
struct ifinfomsg *ifm)
return err;
}
 
-   dev->rtnl_link_state = RTNL_LINK_INITIALIZED;
+   if (dev->rtnl_link_state == RTNL_LINK_INITIALIZING) {
+   dev->rtnl_link_state = RTNL_LINK_INITIALIZED;
+   gchanges = ~0U;
+   } else {
+   gchanges = dev->flags ^ old_flags;
+   }
 
-   __dev_notify_flags(dev, old_flags, ~0U);
+   __dev_notify_flags(dev, old_flags, gchanges);
return 0;
 }
 EXPORT_SYMBOL(rtnl_configure_link);
-- 
2.5.5



Re: [net v3] veth: advertise peer link once both links are tied together

2016-06-10 Thread Lance Richardson
- Original Message -
> From: "Nicolas Dichtel" <nicolas.dich...@6wind.com>
> To: "Lance Richardson" <lrich...@redhat.com>, "Vincent Bernat" 
> <vinc...@bernat.im>
> Cc: "David S. Miller" <da...@davemloft.net>, "Vijay Pandurangan" 
> <vij...@vijayp.ca>, "Paolo Abeni"
> <pab...@redhat.com>, netdev@vger.kernel.org
> Sent: Friday, June 10, 2016 9:15:01 AM
> Subject: Re: [net v3] veth: advertise peer link once both links are tied 
> together
> 
> Le 08/06/2016 22:30, Lance Richardson a écrit :
> [snip]
> > I've been pondering how to fix this very problem off-and-on for a few
> > months
> > now, without having arrived at any solution that was particularly
> > satisfying.
> > 
> > The main constraints I've been trying to meet are:
> >- User-space should be informed of veth pairing for both peers.
> >- RTM_NEWLINK messages should not refer to interfaces that haven't
> >  been announced to user-space via previous RTM_NEWLINK messages.
> >- The first (and only the first) RTM_NEWLINK message for a given
> >  interface should have ifi_changes set to ~0U, subsequent RTM_NEWLINK
> >  messages should have ifi_changes set to reflect the flags that
> >  have changed.
> > 
> > This is the closest I've come to a satisfactory solution, it does meet
> > the above constraints but still seems a little unnatural to me:
> The patch looks good to me. Can you submit it formally?
> 

Will post in a bit, thanks!

  Lance


Re: [net v3] veth: advertise peer link once both links are tied together

2016-06-08 Thread Lance Richardson
- Original Message -
> From: "Nicolas Dichtel" 
> To: "Vincent Bernat" 
> Cc: "David S. Miller" , "Vijay Pandurangan" 
> , "Paolo Abeni"
> , netdev@vger.kernel.org
> Sent: Tuesday, May 31, 2016 5:17:20 AM
> Subject: Re: [net v3] veth: advertise peer link once both links are tied 
> together
> 
> Le 31/05/2016 08:29, Vincent Bernat a écrit :
> >  ❦ 30 mai 2016 18:27 CEST, Nicolas Dichtel  :
> > 
>  +
>  +rtmsg_ifinfo(RTM_NEWLINK, peer, IFF_SLAVE, GFP_KERNEL);
> >>>
> >>> Maybe ~0U would be better than hijacking IFF_SLAVE?
> >> IFF_SLAVE is wrong. It's a flag here, that will be put in the ifi_change
> >> field
> >> not an attribute number.
> > 
> > There are some use of IFF_SLAVE (in bonding/bond_main.c). But, I'll
> > update the patch nonetheless.
> Sorry, I read it too quickly, IFF_SLAVE is a flag, not an attribute.
> But this field indicates to the userland which flags has changed and this
> flag
> does not change here ;-)
> 

I've been pondering how to fix this very problem off-and-on for a few months
now, without having arrived at any solution that was particularly satisfying.

The main constraints I've been trying to meet are:
   - User-space should be informed of veth pairing for both peers.
   - RTM_NEWLINK messages should not refer to interfaces that haven't
 been announced to user-space via previous RTM_NEWLINK messages.
   - The first (and only the first) RTM_NEWLINK message for a given
 interface should have ifi_changes set to ~0U, subsequent RTM_NEWLINK
 messages should have ifi_changes set to reflect the flags that
 have changed.

This is the closest I've come to a satisfactory solution, it does meet
the above constraints but still seems a little unnatural to me:

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index f37a6e6..9151686 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -466,8 +466,16 @@ static int veth_newlink(struct net *src_net, struct 
net_device *dev,
 
priv = netdev_priv(peer);
rcu_assign_pointer(priv->peer, dev);
+
+   err = rtnl_configure_link(dev, NULL);
+   if (err < 0)
+   goto err_configure_dev;
+
+   rtmsg_ifinfo(RTM_NEWLINK, peer, 0, GFP_KERNEL);
return 0;
 
+err_configure_dev:
+   /* nothing to do */
 err_register_dev:
/* nothing to do */
 err_configure_peer:
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index d69c464..28ee417 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2165,6 +2165,7 @@ static int rtnl_dellink(struct sk_buff *skb, struct 
nlmsghdr *nlh)
 int rtnl_configure_link(struct net_device *dev, const struct ifinfomsg *ifm)
 {
unsigned int old_flags;
+   unsigned int gchanges;
int err;
 
old_flags = dev->flags;
@@ -2174,9 +2175,13 @@ int rtnl_configure_link(struct net_device *dev, const 
struct ifinfomsg *ifm)
return err;
}
 
-   dev->rtnl_link_state = RTNL_LINK_INITIALIZED;
+   if (dev->rtnl_link_state == RTNL_LINK_INITIALIZING) {
+   dev->rtnl_link_state = RTNL_LINK_INITIALIZED;
+   gchanges = ~0U;
+   } else
+   gchanges = dev->flags ^ old_flags;
 
-   __dev_notify_flags(dev, old_flags, ~0U);
+   __dev_notify_flags(dev, old_flags, gchanges);
return 0;
 }
 EXPORT_SYMBOL(rtnl_configure_link);

Sample output:

# ip link add dev vm1 type veth peer name vm2
5: vm2@NONE:  mtu 1500 qdisc noop state DOWN 
link/ether 36:a7:b4:dc:27:3b brd ff:ff:ff:ff:ff:ff
6: vm1@vm2:  mtu 1500 qdisc noop state DOWN 
link/ether 72:78:30:f6:e6:3a brd ff:ff:ff:ff:ff:ff
5: vm2@vm1:  mtu 1500 qdisc noop state DOWN 
link/ether 36:a7:b4:dc:27:3b brd ff:ff:ff:ff:ff:ff


Re: [PATCH net 1/5] macsec: add missing NULL check after kmalloc

2016-04-19 Thread Lance Richardson
- Original Message -
> From: "Sabrina Dubroca" 
> To: netdev@vger.kernel.org
> Cc: "Hannes Frederic Sowa" , "Johannes Berg" 
> , "Dan Carpenter"
> , "Sabrina Dubroca" 
> Sent: Tuesday, April 19, 2016 1:36:38 PM
> Subject: [PATCH net 1/5] macsec: add missing NULL check after kmalloc
> 
> Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
> Reported-by: Dan Carpenter 
> Signed-off-by: Sabrina Dubroca 
> Acked-by: Hannes Frederic Sowa 
> ---
>  drivers/net/macsec.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
> index 84d3e5ca8817..f691030ee3df 100644
> --- a/drivers/net/macsec.c
> +++ b/drivers/net/macsec.c
> @@ -1622,8 +1622,8 @@ static int macsec_add_rxsa(struct sk_buff *skb, struct
> genl_info *info)
>   }
>  
>   rx_sa = kmalloc(sizeof(*rx_sa), GFP_KERNEL);
> - if (init_rx_sa(rx_sa, nla_data(tb_sa[MACSEC_SA_ATTR_KEY]), 
> secy->key_len,
> -secy->icv_len)) {
> + if (!rx_sa || init_rx_sa(rx_sa, nla_data(tb_sa[MACSEC_SA_ATTR_KEY]),
> +  secy->key_len, secy->icv_len)) {

Doesn't this leak rx_sa if kmalloc() succeeds but init_rx_sa fails?

>   rtnl_unlock();
>   return -ENOMEM;
>   }
> --
> 2.8.0
> 
> 


Re: [PATCH net] ipv4: initialize flowi4_flags before calling fib_lookup()

2016-03-22 Thread Lance Richardson
Apologies, that should have been [PATCH v2 net].

- Original Message -
> From: "Lance Richardson" <lrich...@redhat.com>
> To: netdev@vger.kernel.org
> Cc: d...@cumulusnetworks.com
> Sent: Tuesday, March 22, 2016 2:56:57 PM
> Subject: [PATCH net] ipv4: initialize flowi4_flags before calling fib_lookup()
> 
> Field fl4.flowi4_flags is not initialized in fib_compute_spec_dst()
> before calling fib_lookup(), which means fib_table_lookup() is
> using non-deterministic data at this line:
> 
>   if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) {
> 
> Fix by initializing the entire fl4 structure, which will prevent
> similar issues as fields are added in the future by ensuring that
> all fields are initialized to zero unless explicitly initialized
> to another value.
> 
> Fixes: 58189ca7b2741 ("net: Fix vti use case with oif in dst lookups")
> Suggested-by: David Ahern <d...@cumulusnetworks.com>
> Signed-off-by: Lance Richardson <lrich...@redhat.com>
> ---
>  net/ipv4/fib_frontend.c | 16 +++-
>  1 file changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index 21add55..8a9246d 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -280,7 +280,6 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
>   struct in_device *in_dev;
>   struct fib_result res;
>   struct rtable *rt;
> - struct flowi4 fl4;
>   struct net *net;
>   int scope;
>  
> @@ -296,14 +295,13 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
>  
>   scope = RT_SCOPE_UNIVERSE;
>   if (!ipv4_is_zeronet(ip_hdr(skb)->saddr)) {
> - fl4.flowi4_oif = 0;
> - fl4.flowi4_iif = LOOPBACK_IFINDEX;
> - fl4.daddr = ip_hdr(skb)->saddr;
> - fl4.saddr = 0;
> - fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
> - fl4.flowi4_scope = scope;
> - fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0;
> - fl4.flowi4_tun_key.tun_id = 0;
> + struct flowi4 fl4 = {
> + .flowi4_iif = LOOPBACK_IFINDEX,
> + .daddr = ip_hdr(skb)->saddr,
> + .flowi4_tos = RT_TOS(ip_hdr(skb)->tos),
> + .flowi4_scope = scope,
> + .flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0,
> + };
>   if (!fib_lookup(net, , , 0))
>   return FIB_RES_PREFSRC(net, res);
>   } else {
> --
> 2.5.5
> 
> 


[PATCH net] ipv4: initialize flowi4_flags before calling fib_lookup()

2016-03-22 Thread Lance Richardson
Field fl4.flowi4_flags is not initialized in fib_compute_spec_dst()
before calling fib_lookup(), which means fib_table_lookup() is
using non-deterministic data at this line:

if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) {

Fix by initializing the entire fl4 structure, which will prevent
similar issues as fields are added in the future by ensuring that
all fields are initialized to zero unless explicitly initialized
to another value.

Fixes: 58189ca7b2741 ("net: Fix vti use case with oif in dst lookups")
Suggested-by: David Ahern <d...@cumulusnetworks.com>
Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 net/ipv4/fib_frontend.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 21add55..8a9246d 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -280,7 +280,6 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
struct in_device *in_dev;
struct fib_result res;
struct rtable *rt;
-   struct flowi4 fl4;
struct net *net;
int scope;
 
@@ -296,14 +295,13 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
 
scope = RT_SCOPE_UNIVERSE;
if (!ipv4_is_zeronet(ip_hdr(skb)->saddr)) {
-   fl4.flowi4_oif = 0;
-   fl4.flowi4_iif = LOOPBACK_IFINDEX;
-   fl4.daddr = ip_hdr(skb)->saddr;
-   fl4.saddr = 0;
-   fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
-   fl4.flowi4_scope = scope;
-   fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0;
-   fl4.flowi4_tun_key.tun_id = 0;
+   struct flowi4 fl4 = {
+   .flowi4_iif = LOOPBACK_IFINDEX,
+   .daddr = ip_hdr(skb)->saddr,
+   .flowi4_tos = RT_TOS(ip_hdr(skb)->tos),
+   .flowi4_scope = scope,
+   .flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0,
+   };
if (!fib_lookup(net, , , 0))
return FIB_RES_PREFSRC(net, res);
} else {
-- 
2.5.5



Re: [PATCH net] ipv4: initialize flowi4_flags before calling fib_lookup()

2016-03-22 Thread Lance Richardson
- Original Message -
> From: "David Ahern" <d...@cumulusnetworks.com>
> To: "Lance Richardson" <lrich...@redhat.com>, netdev@vger.kernel.org
> Sent: Tuesday, March 22, 2016 2:29:02 PM
> Subject: Re: [PATCH net] ipv4: initialize flowi4_flags before calling 
> fib_lookup()
> 
> On 3/22/16 9:31 AM, Lance Richardson wrote:
> > Field fl4.flowi4_flags is not initialized in fib_compute_spec_dst()
> > before calling fib_lookup(), which means fib_table_lookup() is
> > using non-deterministic data at this line:
> >
> > if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) {
> >
> > Fix by initializing fl4.flowi4_flags to zero.
> >
> > Signed-off-by: Lance Richardson <lrich...@redhat.com>
> > ---
> >   net/ipv4/fib_frontend.c | 1 +
> >   1 file changed, 1 insertion(+)
> >
> > diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> > index 21add55..896844a 100644
> > --- a/net/ipv4/fib_frontend.c
> > +++ b/net/ipv4/fib_frontend.c
> > @@ -304,6 +304,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
> > fl4.flowi4_scope = scope;
> > fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0;
> > fl4.flowi4_tun_key.tun_id = 0;
> > +   fl4.flowi4_flags = 0;
> > if (!fib_lookup(net, , , 0))
> > return FIB_RES_PREFSRC(net, res);
> > } else {
> >
> 
> Fixes: 58189ca7b2741 ("net: Fix vti use case with oif in dst lookups")
> 
> I think a more robust solution is to move fl4 to this if case and init
> when it is declared:
> 
>   struct flowi4 fl4 = {
>   .flowi4_iif = LOOPBACK_IFINDEX,
>   .daddr = ip_hdr(skb)->saddr,
>   .flowi4_tos = RT_TOS(ip_hdr(skb)->tos),
>   .flowi4_scope = scope,
>   .flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0,
>   };
> 

Agreed... I actually debated doing something similar but opted for the
smaller delta.

v2 coming up.

Thanks for the review,

   Lance


[PATCH net] ipv4: initialize flowi4_flags before calling fib_lookup()

2016-03-22 Thread Lance Richardson
Field fl4.flowi4_flags is not initialized in fib_compute_spec_dst()
before calling fib_lookup(), which means fib_table_lookup() is
using non-deterministic data at this line:

if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) {

Fix by initializing fl4.flowi4_flags to zero.

Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 net/ipv4/fib_frontend.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 21add55..896844a 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -304,6 +304,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
fl4.flowi4_scope = scope;
fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0;
fl4.flowi4_tun_key.tun_id = 0;
+   fl4.flowi4_flags = 0;
if (!fib_lookup(net, , , 0))
return FIB_RES_PREFSRC(net, res);
} else {
-- 
2.5.0



Fwd: [PATCH net 0/3] ipv4: fix various issues reported by sparse

2016-01-06 Thread Lance Richardson
- Forwarded Message -
> This trivial patch series addresses a number of endianness- and
> lock-related issues reported by sparse.
> 
> Lance Richardson (3):
>   ipv4: fix endianness warnings in ip_tunnel_core.c
>   ipv4: eliminate endianness warnings in ip_fib.h
>   ipv4: eliminate lock count warnings in ping.c
> 
>  include/net/ip_fib.h  |  3 ++-
>  net/ipv4/ip_tunnel_core.c | 16 
>  net/ipv4/ping.c   |  2 ++
>  3 files changed, 12 insertions(+), 9 deletions(-)
> 
> --
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Apologies, I must have fumbled the "--cc-cmd=" option to git send-email.

Resending cover letter with cc: list.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net 3/3] ipv4: eliminate lock count warnings in ping.c

2016-01-06 Thread Lance Richardson
Add lock release/acquire annotations to ping_seq_start() and
ping_seq_stop() to satisfy sparse.

Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 net/ipv4/ping.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index e89094a..c117b21 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -1063,6 +1063,7 @@ static struct sock *ping_get_idx(struct seq_file *seq, 
loff_t pos)
 }
 
 void *ping_seq_start(struct seq_file *seq, loff_t *pos, sa_family_t family)
+   __acquires(ping_table.lock)
 {
struct ping_iter_state *state = seq->private;
state->bucket = 0;
@@ -1094,6 +1095,7 @@ void *ping_seq_next(struct seq_file *seq, void *v, loff_t 
*pos)
 EXPORT_SYMBOL_GPL(ping_seq_next);
 
 void ping_seq_stop(struct seq_file *seq, void *v)
+   __releases(ping_table.lock)
 {
read_unlock_bh(_table.lock);
 }
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net 1/3] ipv4: fix endianness warnings in ip_tunnel_core.c

2016-01-06 Thread Lance Richardson
Eliminate endianness mismatch warnings (reported by sparse) in this file by
using appropriate nla_put_*()/nla_get_*() calls.

Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 net/ipv4/ip_tunnel_core.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index 6cb9009..bf70c1c 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -251,7 +251,7 @@ static int ip_tun_build_state(struct net_device *dev, 
struct nlattr *attr,
tun_info = lwt_tun_info(new_state);
 
if (tb[LWTUNNEL_IP_ID])
-   tun_info->key.tun_id = nla_get_u64(tb[LWTUNNEL_IP_ID]);
+   tun_info->key.tun_id = nla_get_be64(tb[LWTUNNEL_IP_ID]);
 
if (tb[LWTUNNEL_IP_DST])
tun_info->key.u.ipv4.dst = nla_get_be32(tb[LWTUNNEL_IP_DST]);
@@ -266,7 +266,7 @@ static int ip_tun_build_state(struct net_device *dev, 
struct nlattr *attr,
tun_info->key.tos = nla_get_u8(tb[LWTUNNEL_IP_TOS]);
 
if (tb[LWTUNNEL_IP_FLAGS])
-   tun_info->key.tun_flags = nla_get_u16(tb[LWTUNNEL_IP_FLAGS]);
+   tun_info->key.tun_flags = nla_get_be16(tb[LWTUNNEL_IP_FLAGS]);
 
tun_info->mode = IP_TUNNEL_INFO_TX;
tun_info->options_len = 0;
@@ -281,12 +281,12 @@ static int ip_tun_fill_encap_info(struct sk_buff *skb,
 {
struct ip_tunnel_info *tun_info = lwt_tun_info(lwtstate);
 
-   if (nla_put_u64(skb, LWTUNNEL_IP_ID, tun_info->key.tun_id) ||
+   if (nla_put_be64(skb, LWTUNNEL_IP_ID, tun_info->key.tun_id) ||
nla_put_be32(skb, LWTUNNEL_IP_DST, tun_info->key.u.ipv4.dst) ||
nla_put_be32(skb, LWTUNNEL_IP_SRC, tun_info->key.u.ipv4.src) ||
nla_put_u8(skb, LWTUNNEL_IP_TOS, tun_info->key.tos) ||
nla_put_u8(skb, LWTUNNEL_IP_TTL, tun_info->key.ttl) ||
-   nla_put_u16(skb, LWTUNNEL_IP_FLAGS, tun_info->key.tun_flags))
+   nla_put_be16(skb, LWTUNNEL_IP_FLAGS, tun_info->key.tun_flags))
return -ENOMEM;
 
return 0;
@@ -346,7 +346,7 @@ static int ip6_tun_build_state(struct net_device *dev, 
struct nlattr *attr,
tun_info = lwt_tun_info(new_state);
 
if (tb[LWTUNNEL_IP6_ID])
-   tun_info->key.tun_id = nla_get_u64(tb[LWTUNNEL_IP6_ID]);
+   tun_info->key.tun_id = nla_get_be64(tb[LWTUNNEL_IP6_ID]);
 
if (tb[LWTUNNEL_IP6_DST])
tun_info->key.u.ipv6.dst = 
nla_get_in6_addr(tb[LWTUNNEL_IP6_DST]);
@@ -361,7 +361,7 @@ static int ip6_tun_build_state(struct net_device *dev, 
struct nlattr *attr,
tun_info->key.tos = nla_get_u8(tb[LWTUNNEL_IP6_TC]);
 
if (tb[LWTUNNEL_IP6_FLAGS])
-   tun_info->key.tun_flags = nla_get_u16(tb[LWTUNNEL_IP6_FLAGS]);
+   tun_info->key.tun_flags = nla_get_be16(tb[LWTUNNEL_IP6_FLAGS]);
 
tun_info->mode = IP_TUNNEL_INFO_TX | IP_TUNNEL_INFO_IPV6;
tun_info->options_len = 0;
@@ -376,12 +376,12 @@ static int ip6_tun_fill_encap_info(struct sk_buff *skb,
 {
struct ip_tunnel_info *tun_info = lwt_tun_info(lwtstate);
 
-   if (nla_put_u64(skb, LWTUNNEL_IP6_ID, tun_info->key.tun_id) ||
+   if (nla_put_be64(skb, LWTUNNEL_IP6_ID, tun_info->key.tun_id) ||
nla_put_in6_addr(skb, LWTUNNEL_IP6_DST, _info->key.u.ipv6.dst) 
||
nla_put_in6_addr(skb, LWTUNNEL_IP6_SRC, _info->key.u.ipv6.src) 
||
nla_put_u8(skb, LWTUNNEL_IP6_HOPLIMIT, tun_info->key.tos) ||
nla_put_u8(skb, LWTUNNEL_IP6_TC, tun_info->key.ttl) ||
-   nla_put_u16(skb, LWTUNNEL_IP6_FLAGS, tun_info->key.tun_flags))
+   nla_put_be16(skb, LWTUNNEL_IP6_FLAGS, tun_info->key.tun_flags))
return -ENOMEM;
 
return 0;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net 0/3] ipv4: fix various issues reported by sparse

2016-01-06 Thread Lance Richardson
This trivial patch series addresses a number of endianness- and
lock-related issues reported by sparse.

Lance Richardson (3):
  ipv4: fix endianness warnings in ip_tunnel_core.c
  ipv4: eliminate endianness warnings in ip_fib.h
  ipv4: eliminate lock count warnings in ping.c

 include/net/ip_fib.h  |  3 ++-
 net/ipv4/ip_tunnel_core.c | 16 
 net/ipv4/ping.c   |  2 ++
 3 files changed, 12 insertions(+), 9 deletions(-)

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net 2/3] ipv4: eliminate endianness warnings in ip_fib.h

2016-01-06 Thread Lance Richardson
fib_multipath_hash() computes a hash using __be32 values, force
cast these to u32 to pacify sparse.

Signed-off-by: Lance Richardson <lrich...@redhat.com>
---
 include/net/ip_fib.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 9f4df68..7029527 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -325,7 +325,8 @@ extern u32 fib_multipath_secret __read_mostly;
 
 static inline int fib_multipath_hash(__be32 saddr, __be32 daddr)
 {
-   return jhash_2words(saddr, daddr, fib_multipath_secret) >> 1;
+   return jhash_2words((__force u32)saddr, (__force u32)daddr,
+   fib_multipath_secret) >> 1;
 }
 
 void fib_select_multipath(struct fib_result *res, int hash);
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html