[PATCH net] tcp: limit sk_rcvlowat by the maximum receive buffer

2018-06-08 Thread Soheil Hassas Yeganeh
From: Soheil Hassas Yeganeh 

The user-provided value to setsockopt(SO_RCVLOWAT) can be
larger than the maximum possible receive buffer. Such values
mute POLLIN signals on the socket which can stall progress
on the socket.

Limit the user-provided value to half of the maximum receive
buffer, i.e., half of sk_rcvbuf when the receive buffer size
is set by the user, or otherwise half of sysctl_tcp_rmem[2].

Fixes: d1361840f8c5 ("tcp: fix SO_RCVLOWAT and RCVBUF autotuning")
Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Eric Dumazet 
Reviewed-by: Neal Cardwell 
Acked-by: Willem de Bruijn 
---
 net/ipv4/tcp.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2741953adaba2..141acd92e58ae 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1694,6 +1694,13 @@ EXPORT_SYMBOL(tcp_peek_len);
 /* Make sure sk_rcvbuf is big enough to satisfy SO_RCVLOWAT hint */
 int tcp_set_rcvlowat(struct sock *sk, int val)
 {
+   int cap;
+
+   if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
+   cap = sk->sk_rcvbuf >> 1;
+   else
+   cap = sock_net(sk)->ipv4.sysctl_tcp_rmem[2] >> 1;
+   val = min(val, cap);
sk->sk_rcvlowat = val ? : 1;
 
/* Check if we need to signal EPOLLIN right now */
@@ -1702,12 +1709,7 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
return 0;
 
-   /* val comes from user space and might be close to INT_MAX */
val <<= 1;
-   if (val < 0)
-   val = INT_MAX;
-
-   val = min(val, sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
if (val > sk->sk_rcvbuf) {
sk->sk_rcvbuf = val;
tcp_sk(sk)->window_clamp = tcp_win_from_space(sk, val);
-- 
2.18.0.rc1.242.g61856ae69a-goog



Re: Qualcomm rmnet driver and qmi_wwan

2018-06-08 Thread Subash Abhinov Kasiviswanathan

This sounds like a good idea. I probably won't have any time to look at
this in the near future, though.  Sorry about that. Extremely 
overloaded

both at work and private right now...

But I trust that you and Daniele can work out something. Please keep me
CCed, but don't expect timely replies.



Hi Daniele

Can you try out the attached patch.
I have added a new sysfs attribute pass_through to be used in raw_ip 
mode

only. Once you attach rmnet devices on it, the rx_handler will be setup
and the packet will be processed by rmnet.

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative ProjectFrom bccfae3707af1be671fe55ea63123438f2dc38a8 Mon Sep 17 00:00:00 2001
From: Subash Abhinov Kasiviswanathan 
Date: Fri, 8 Jun 2018 19:53:08 -0600
Subject: [PATCH] net: qmi_wwan: Add pass through mode

Pass through mode is to allow packets in MAP format to be passed
on to the stack. rmnet driver can be used to process and demultiplex
these packets. Note that pass through mode can be enabled when the
device is in raw ip mode only.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/usb/qmi_wwan.c | 72 ++
 1 file changed, 72 insertions(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 8e8b51f..f52a9be 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -59,6 +59,7 @@ struct qmi_wwan_state {
 enum qmi_wwan_flags {
 	QMI_WWAN_FLAG_RAWIP = 1 << 0,
 	QMI_WWAN_FLAG_MUX = 1 << 1,
+	QMI_WWAN_FLAG_PASS_THROUGH = 1 << 2,
 };
 
 enum qmi_wwan_quirks {
@@ -425,14 +426,80 @@ static ssize_t del_mux_store(struct device *d,  struct device_attribute *attr, c
 	return ret;
 }
 
+static ssize_t pass_through_show(struct device *d,
+ struct device_attribute *attr,
+ char *buf)
+{
+	struct usbnet *dev = netdev_priv(to_net_dev(d));
+	struct qmi_wwan_state *info;
+
+	info = (void *)>data;
+	return sprintf(buf, "%c\n",
+		   info->flags & QMI_WWAN_FLAG_PASS_THROUGH ? 'Y' : 'N');
+}
+
+static ssize_t pass_through_store(struct device *d,
+  struct device_attribute *attr,
+  const char *buf, size_t len)
+{
+	struct usbnet *dev = netdev_priv(to_net_dev(d));
+	struct qmi_wwan_state *info;
+	bool enable;
+	int ret;
+
+	if (strtobool(buf, ))
+		return -EINVAL;
+
+	info = (void *)>data;
+
+	/* no change? */
+	if (enable == (info->flags & QMI_WWAN_FLAG_PASS_THROUGH))
+		return len;
+
+	/* pass through mode can be set for raw ip devices only */
+	if (!(info->flags & QMI_WWAN_FLAG_RAWIP))
+		return -EINVAL;
+
+	if (!rtnl_trylock())
+		return restart_syscall();
+
+	/* we don't want to modify a running netdev */
+	if (netif_running(dev->net)) {
+		netdev_err(dev->net, "Cannot change a running device\n");
+		ret = -EBUSY;
+		goto err;
+	}
+
+	/* let other drivers deny the change */
+	ret = call_netdevice_notifiers(NETDEV_PRE_TYPE_CHANGE, dev->net);
+	ret = notifier_to_errno(ret);
+	if (ret) {
+		netdev_err(dev->net, "Type change was refused\n");
+		goto err;
+	}
+
+	if (enable)
+		info->flags |= QMI_WWAN_FLAG_PASS_THROUGH;
+	else
+		info->flags &= ~QMI_WWAN_FLAG_PASS_THROUGH;
+	qmi_wwan_netdev_setup(dev->net);
+	call_netdevice_notifiers(NETDEV_POST_TYPE_CHANGE, dev->net);
+	ret = len;
+err:
+	rtnl_unlock();
+	return ret;
+}
+
 static DEVICE_ATTR_RW(raw_ip);
 static DEVICE_ATTR_RW(add_mux);
 static DEVICE_ATTR_RW(del_mux);
+static DEVICE_ATTR_RW(pass_through);
 
 static struct attribute *qmi_wwan_sysfs_attrs[] = {
 	_attr_raw_ip.attr,
 	_attr_add_mux.attr,
 	_attr_del_mux.attr,
+	_attr_pass_through.attr,
 	NULL,
 };
 
@@ -479,6 +546,11 @@ static int qmi_wwan_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
 	if (info->flags & QMI_WWAN_FLAG_MUX)
 		return qmimux_rx_fixup(dev, skb);
 
+	if (rawip && (info->flags & QMI_WWAN_FLAG_PASS_THROUGH)) {
+		skb->protocol = htons(ETH_P_MAP);
+		return (netif_rx(skb) == NET_RX_SUCCESS);
+	}
+
 	switch (skb->data[0] & 0xf0) {
 	case 0x40:
 		proto = htons(ETH_P_IP);
-- 
1.9.1



Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Jakub Kicinski
On Fri, 8 Jun 2018 16:44:12 -0700, Siwei Liu wrote:
> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> >> that flag, as well as the 1-netdev model, is to have a means to
> >> inherit the interface name from the VF, and to eliminate playing hacks
> >> around renaming devices, customizing udev rules and et al. Why
> >> inheriting VF's name important? To allow existing config/setup around
> >> VF continues to work across kernel feature upgrade. Most of network
> >> config files in all distros are based on interface names. Few are MAC
> >> address based but making lower slaves hidden would cover the rest. And
> >> most importantly, preserving the same level of user experience as
> >> using raw VF interface once getting all ndo_ops and ethtool_ops
> >> exposed. This is essential to realize transparent live migration that
> >> users dont have to learn and be aware of the undertaken.  
> >
> > Inheriting the VF name will fail in the migration scenario.
> > It is perfectly reasonable to migrate a guest to another machine where
> > the VF PCI address is different. And since current udev/systemd model
> > is to base network device name off of PCI address, the device will change
> > name when guest is migrated.
> >  
> The scenario of having VF on a different PCI address on post migration
> is essentially equal to plugging in a new NIC. Why it has to pair with
> the original PV? A sepearte PV device should be in place to pair the
> new VF.

IMHO it may be a better idea to look at the VF as acceleration for the
PV rather than PV a migration vehicle from the VF.  Hence we should
continue to follow the naming of PV, like the current implementation
does implicitly by linking to PV's struct device.


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Siwei Liu
On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger
 wrote:
> On Fri, 8 Jun 2018 16:44:12 -0700
> Siwei Liu  wrote:
>
>> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
>>  wrote:
>> > On Fri, 8 Jun 2018 15:25:59 -0700
>> > Siwei Liu  wrote:
>> >
>> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>> >>  wrote:
>> >> > On Wed, 6 Jun 2018 15:30:27 +0300
>> >> > "Michael S. Tsirkin"  wrote:
>> >> >
>> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
>> >> >> > wrote:
>> >> >> > >The net failover should be a simple library, not a virtual
>> >> >> > >object with function callbacks (see callback hell).
>> >> >> >
>> >> >> > Why just a library? It should do a common things. I think it should 
>> >> >> > be a
>> >> >> > virtual object. Looks like your patch again splits the common
>> >> >> > functionality into multiple drivers. That is kind of backwards 
>> >> >> > attitude.
>> >> >> > I don't get it. We should rather focus on fixing the mess the
>> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >> >> > model.
>> >> >>
>> >> >> So it seems that at least one benefit for netvsc would be better
>> >> >> handling of renames.
>> >> >>
>> >> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> >> concerned about risk of breaking some userspace.
>> >> >>
>> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> >> address, and you said then "why not use existing network namespaces
>> >> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> >> compatibility?
>> >> >>
>> >> >
>> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's 
>> >> > and
>> >> > startups that all demand eth0 always be present. And VF may come and go.
>> >> > After this history, there is a strong motivation not to change how 
>> >> > kernel
>> >> > behaves. Switching to 3 device model would be perceived as breaking
>> >> > existing userspace.
>> >> >
>> >> > With virtio you can  work it out with the distro's yourself.
>> >> > There is no pre-existing semantics to deal with.
>> >> >
>> >> > For the virtio, I don't see the need for IFF_HIDDEN.
>> >>
>> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> >> that flag, as well as the 1-netdev model, is to have a means to
>> >> inherit the interface name from the VF, and to eliminate playing hacks
>> >> around renaming devices, customizing udev rules and et al. Why
>> >> inheriting VF's name important? To allow existing config/setup around
>> >> VF continues to work across kernel feature upgrade. Most of network
>> >> config files in all distros are based on interface names. Few are MAC
>> >> address based but making lower slaves hidden would cover the rest. And
>> >> most importantly, preserving the same level of user experience as
>> >> using raw VF interface once getting all ndo_ops and ethtool_ops
>> >> exposed. This is essential to realize transparent live migration that
>> >> users dont have to learn and be aware of the undertaken.
>> >
>> > Inheriting the VF name will fail in the migration scenario.
>> > It is perfectly reasonable to migrate a guest to another machine where
>> > the VF PCI address is different. And since current udev/systemd model
>> > is to base network device name off of PCI address, the device will change
>> > name when guest is migrated.
>> >
>> The scenario of having VF on a different PCI address on post migration
>> is essentially equal to plugging in a new NIC. Why it has to pair with
>> the original PV? A sepearte PV device should be in place to pair the
>> new VF.
>
> The host only guarantees that the PV device will be on the same network.
> It does not make any PCI guarantees. The way Windows works is to find
> the device based on "serial number" which is an Hyper-V specific attribute
> of PCI devices.
>
> I considered naming off of serial number but that won't work for the
> case where PV device is present first and VF arrives later. The serial
> number is attribute of VF, not the PV which is there first.

I assume the PV can get that information ahead of time before VF
arrives? Without it how do you match the device when you see a VF
coming with some serial number? Is it possible for PV to get the
matching SN even earlier during probe time? Or it has to depend on the
presence of vPCI bridge to generate this SN?

>
> Your ideas about having the PCI information of the VF form the name
> of the failover device have the same problem. The PV device may
> be the only one present on boot.

Yeah, this is a chicken-egg problem indeed, and that was the reason
why I supply the BDF info for PV to name the master interface.
However, the ACPI PCI slot needs to depend on the PCI bus enumeration
so that can't be predictable.  Would it make sense to only rename 

Re: [PATCH net-next] net: ipv6: Generate random IID for addresses on RAWIP devices

2018-06-08 Thread Subash Abhinov Kasiviswanathan

Actually, I think this is fine. RFC 7136 clarified this, and says:

==
   Thus, we can conclude that the value of the "u" bit in IIDs has no
   particular meaning.  In the case of an IID created from a MAC 
address

   according to RFC 4291, its value is determined by the MAC address,
   but that is all.
[...]
   Specifications of other forms of 64-bit IIDs MUST specify how all 64
   bits are set, but a generic semantic meaning for the "u" and "g" 
bits

   MUST NOT be defined.  However, the method of generating IIDs for
   specific link types MAY define some local significance for certain
   bits.

   In all cases, the bits in an IID have no generic semantics; in other
   words, they have opaque values.  In fact, the whole IID value MUST 
be

   viewed as an opaque bit string by third parties, except possibly in
   the local context.
==

That said - we already have a way to form all-random IIDs:
IN6_ADDR_GEN_MODE_RANDOM. Can't you just ensure that links of type
ARPHRD_RAWIP always use IN6_ADDR_GEN_MODE_RANDOM? That might lead to
less special-casing.


Hi Lorenzo

In v2 of this patchset, I used addrconf_ifid_ip6tnl() similar to
IP6 Tunnels / VTI6, so I didnt need that way of generating the IID.
addrconf_ifid_ip6tnl() also provides fixed IIDs during the lifetime of 
the

net device while IN6_ADDR_GEN_MODE_RANDOM generates different addresses.

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Stephen Hemminger
On Fri, 8 Jun 2018 16:44:12 -0700
Siwei Liu  wrote:

> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
>  wrote:
> > On Fri, 8 Jun 2018 15:25:59 -0700
> > Siwei Liu  wrote:
> >  
> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
> >>  wrote:  
> >> > On Wed, 6 Jun 2018 15:30:27 +0300
> >> > "Michael S. Tsirkin"  wrote:
> >> >  
> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
> >> >> > wrote:  
> >> >> > >The net failover should be a simple library, not a virtual
> >> >> > >object with function callbacks (see callback hell).  
> >> >> >
> >> >> > Why just a library? It should do a common things. I think it should 
> >> >> > be a
> >> >> > virtual object. Looks like your patch again splits the common
> >> >> > functionality into multiple drivers. That is kind of backwards 
> >> >> > attitude.
> >> >> > I don't get it. We should rather focus on fixing the mess the
> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >> >> > model.  
> >> >>
> >> >> So it seems that at least one benefit for netvsc would be better
> >> >> handling of renames.
> >> >>
> >> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> >> concerned about risk of breaking some userspace.
> >> >>
> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> >> address, and you said then "why not use existing network namespaces
> >> >> rather than inventing a new abstraction". So how about it then? Do you
> >> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> >> compatibility?
> >> >>  
> >> >
> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's 
> >> > and
> >> > startups that all demand eth0 always be present. And VF may come and go.
> >> > After this history, there is a strong motivation not to change how kernel
> >> > behaves. Switching to 3 device model would be perceived as breaking
> >> > existing userspace.
> >> >
> >> > With virtio you can  work it out with the distro's yourself.
> >> > There is no pre-existing semantics to deal with.
> >> >
> >> > For the virtio, I don't see the need for IFF_HIDDEN.  
> >>
> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> >> that flag, as well as the 1-netdev model, is to have a means to
> >> inherit the interface name from the VF, and to eliminate playing hacks
> >> around renaming devices, customizing udev rules and et al. Why
> >> inheriting VF's name important? To allow existing config/setup around
> >> VF continues to work across kernel feature upgrade. Most of network
> >> config files in all distros are based on interface names. Few are MAC
> >> address based but making lower slaves hidden would cover the rest. And
> >> most importantly, preserving the same level of user experience as
> >> using raw VF interface once getting all ndo_ops and ethtool_ops
> >> exposed. This is essential to realize transparent live migration that
> >> users dont have to learn and be aware of the undertaken.  
> >
> > Inheriting the VF name will fail in the migration scenario.
> > It is perfectly reasonable to migrate a guest to another machine where
> > the VF PCI address is different. And since current udev/systemd model
> > is to base network device name off of PCI address, the device will change
> > name when guest is migrated.
> >  
> The scenario of having VF on a different PCI address on post migration
> is essentially equal to plugging in a new NIC. Why it has to pair with
> the original PV? A sepearte PV device should be in place to pair the
> new VF.

The host only guarantees that the PV device will be on the same network.
It does not make any PCI guarantees. The way Windows works is to find
the device based on "serial number" which is an Hyper-V specific attribute
of PCI devices.

I considered naming off of serial number but that won't work for the
case where PV device is present first and VF arrives later. The serial
number is attribute of VF, not the PV which is there first.

Your ideas about having the PCI information of the VF form the name
of the failover device have the same problem. The PV device may
be the only one present on boot.


> > On Azure, the VF maybe removed (by host) at any time and then later
> > reattached. There is no guarantee that VF will show back up at
> > the same synthetic PCI address. It will likely have a different
> > PCI domain value.  
> 
> This is something QEMU can do and make sure the PCI address is
> consistent after migration.
> 
> -Siwei



Re: [PATCH net] udp: fix rx queue len reported by diag and proc interface

2018-06-08 Thread David Miller
From: Paolo Abeni 
Date: Fri,  8 Jun 2018 11:35:40 +0200

> After commit 6b229cf77d68 ("udp: add batching to udp_rmem_release()")
> the sk_rmem_alloc field does not measure exactly anymore the
> receive queue length, because we batch the rmem release. The issue
> is really apparent only after commit 0d4a6608f68c ("udp: do rmem bulk
> free even if the rx sk queue is empty"): the user space can easily
> check for an empty socket with not-0 queue length reported by the 'ss'
> tool or the procfs interface.
> 
> We need to use a custom UDP helper to report the correct queue length,
> taking into account the forward allocation deficit.
> 
> Reported-by: trevor.fran...@46labs.com
> Fixes: 6b229cf77d68 ("UDP: add batching to udp_rmem_release()")
> Signed-off-by: Paolo Abeni 

Applied and queued up for -stable, thanks.


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Siwei Liu
On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
 wrote:
> On Fri, 8 Jun 2018 15:25:59 -0700
> Siwei Liu  wrote:
>
>> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>>  wrote:
>> > On Wed, 6 Jun 2018 15:30:27 +0300
>> > "Michael S. Tsirkin"  wrote:
>> >
>> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:
>> >> > >The net failover should be a simple library, not a virtual
>> >> > >object with function callbacks (see callback hell).
>> >> >
>> >> > Why just a library? It should do a common things. I think it should be a
>> >> > virtual object. Looks like your patch again splits the common
>> >> > functionality into multiple drivers. That is kind of backwards attitude.
>> >> > I don't get it. We should rather focus on fixing the mess the
>> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >> > model.
>> >>
>> >> So it seems that at least one benefit for netvsc would be better
>> >> handling of renames.
>> >>
>> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> concerned about risk of breaking some userspace.
>> >>
>> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> address, and you said then "why not use existing network namespaces
>> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> compatibility?
>> >>
>> >
>> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> > startups that all demand eth0 always be present. And VF may come and go.
>> > After this history, there is a strong motivation not to change how kernel
>> > behaves. Switching to 3 device model would be perceived as breaking
>> > existing userspace.
>> >
>> > With virtio you can  work it out with the distro's yourself.
>> > There is no pre-existing semantics to deal with.
>> >
>> > For the virtio, I don't see the need for IFF_HIDDEN.
>>
>> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> that flag, as well as the 1-netdev model, is to have a means to
>> inherit the interface name from the VF, and to eliminate playing hacks
>> around renaming devices, customizing udev rules and et al. Why
>> inheriting VF's name important? To allow existing config/setup around
>> VF continues to work across kernel feature upgrade. Most of network
>> config files in all distros are based on interface names. Few are MAC
>> address based but making lower slaves hidden would cover the rest. And
>> most importantly, preserving the same level of user experience as
>> using raw VF interface once getting all ndo_ops and ethtool_ops
>> exposed. This is essential to realize transparent live migration that
>> users dont have to learn and be aware of the undertaken.
>
> Inheriting the VF name will fail in the migration scenario.
> It is perfectly reasonable to migrate a guest to another machine where
> the VF PCI address is different. And since current udev/systemd model
> is to base network device name off of PCI address, the device will change
> name when guest is migrated.
>
The scenario of having VF on a different PCI address on post migration
is essentially equal to plugging in a new NIC. Why it has to pair with
the original PV? A sepearte PV device should be in place to pair the
new VF.


> On Azure, the VF maybe removed (by host) at any time and then later
> reattached. There is no guarantee that VF will show back up at
> the same synthetic PCI address. It will likely have a different
> PCI domain value.

This is something QEMU can do and make sure the PCI address is
consistent after migration.

-Siwei


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Stephen Hemminger
On Fri, 8 Jun 2018 15:25:59 -0700
Siwei Liu  wrote:

> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>  wrote:
> > On Wed, 6 Jun 2018 15:30:27 +0300
> > "Michael S. Tsirkin"  wrote:
> >  
> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:  
> >> > >The net failover should be a simple library, not a virtual
> >> > >object with function callbacks (see callback hell).  
> >> >
> >> > Why just a library? It should do a common things. I think it should be a
> >> > virtual object. Looks like your patch again splits the common
> >> > functionality into multiple drivers. That is kind of backwards attitude.
> >> > I don't get it. We should rather focus on fixing the mess the
> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >> > model.  
> >>
> >> So it seems that at least one benefit for netvsc would be better
> >> handling of renames.
> >>
> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> concerned about risk of breaking some userspace.
> >>
> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> address, and you said then "why not use existing network namespaces
> >> rather than inventing a new abstraction". So how about it then? Do you
> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> compatibility?
> >>  
> >
> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > startups that all demand eth0 always be present. And VF may come and go.
> > After this history, there is a strong motivation not to change how kernel
> > behaves. Switching to 3 device model would be perceived as breaking
> > existing userspace.
> >
> > With virtio you can  work it out with the distro's yourself.
> > There is no pre-existing semantics to deal with.
> >
> > For the virtio, I don't see the need for IFF_HIDDEN.  
> 
> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> that flag, as well as the 1-netdev model, is to have a means to
> inherit the interface name from the VF, and to eliminate playing hacks
> around renaming devices, customizing udev rules and et al. Why
> inheriting VF's name important? To allow existing config/setup around
> VF continues to work across kernel feature upgrade. Most of network
> config files in all distros are based on interface names. Few are MAC
> address based but making lower slaves hidden would cover the rest. And
> most importantly, preserving the same level of user experience as
> using raw VF interface once getting all ndo_ops and ethtool_ops
> exposed. This is essential to realize transparent live migration that
> users dont have to learn and be aware of the undertaken.

Inheriting the VF name will fail in the migration scenario.
It is perfectly reasonable to migrate a guest to another machine where
the VF PCI address is different. And since current udev/systemd model
is to base network device name off of PCI address, the device will change
name when guest is migrated.

On Azure, the VF maybe removed (by host) at any time and then later
reattached. There is no guarantee that VF will show back up at
the same synthetic PCI address. It will likely have a different
PCI domain value.


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Siwei Liu
On Wed, Jun 6, 2018 at 2:54 PM, Samudrala, Sridhar
 wrote:
>
>
> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:
>>
>> On Wed, 6 Jun 2018 15:30:27 +0300
>> "Michael S. Tsirkin"  wrote:
>>
>>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:

 Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:
>
> The net failover should be a simple library, not a virtual
> object with function callbacks (see callback hell).

 Why just a library? It should do a common things. I think it should be a
 virtual object. Looks like your patch again splits the common
 functionality into multiple drivers. That is kind of backwards attitude.
 I don't get it. We should rather focus on fixing the mess the
 introduction of netvsc-bonding caused and switch netvsc to 3-netdev
 model.
>>>
>>> So it seems that at least one benefit for netvsc would be better
>>> handling of renames.
>>>
>>> Question is how can this change to 3-netdev happen?  Stephen is
>>> concerned about risk of breaking some userspace.
>>>
>>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>>> address, and you said then "why not use existing network namespaces
>>> rather than inventing a new abstraction". So how about it then? Do you
>>> want to find a way to use namespaces to hide the PV device for netvsc
>>> compatibility?
>>>
>> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> startups that all demand eth0 always be present. And VF may come and go.
>> After this history, there is a strong motivation not to change how kernel
>> behaves. Switching to 3 device model would be perceived as breaking
>> existing userspace.
>
>
> I think it should be possible for netvsc to work with 3 dev model if the
> only
> requirement is that eth0 will always be present. With net_failover, you will
> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an
> issue
> if somehow there is userspace requirement that there can be only 2 netdevs,
> not 3
> when VF is plugged.
>
> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc
> device
> and the IP address gets configured on eth0. Will this be an issue?
>
Did you realize that the eth0 name in the current 3-netdev code can't
be consistently persisted across reboot, if you have more than one VFs
to pair with? On one boot it got eth0/eth0nsby, on the next it may get
eth1/eth1nsby on the same interface.

It won't be useable by default until you add some custom udev rules.

-Siwei

>
>
>>
>> With virtio you can  work it out with the distro's yourself.
>> There is no pre-existing semantics to deal with.
>>
>> For the virtio, I don't see the need for IFF_HIDDEN.
>> With 3-dev model as long as you mark the PV and VF devices
>> as slaves, then userspace knows to leave them alone. Assuming userspace
>> is already able to deal with team and bond devices.
>> Any time you introduce new UAPI behavior something breaks.
>>
>> On the rename front, I really don't care if VF can be renamed. And for
>> netvsc want to allow the PV device to be renamed. Udev developers want
>> that
>> but have not found a stable/persistent value to expose to userspace
>> to allow it.
>
>


Re: [PATCH net v2] net/sched: act_simple: fix parsing of TCA_DEF_DATA

2018-06-08 Thread David Miller
From: Davide Caratti 
Date: Fri,  8 Jun 2018 05:02:31 +0200

> use nla_strlcpy() to avoid copying data beyond the length of TCA_DEF_DATA
> netlink attribute, in case it is less than SIMP_MAX_DATA and it does not
> end with '\0' character.
> 
> v2: fix errors in the commit message, thanks Hangbin Liu
> 
> Fixes: fa1b1cff3d06 ("net_cls_act: Make act_simple use of netlink policy.")
> Signed-off-by: Davide Caratti 

Applied and queued up for -stable.


Re: [PATCH net] net: aquantia: fix unsigned numvecs comparison with less than zero

2018-06-08 Thread David Miller
From: Igor Russkikh 
Date: Thu,  7 Jun 2018 17:54:37 -0400

> From: Colin Ian King 
> 
> From: Colin Ian King 
> 
> This was originally mistakenly submitted to net-next. Resubmitting to net.
> 
> The comparison of numvecs < 0 is always false because numvecs is a u32
> and hence the error return from a failed call to pci_alloc_irq_vectores
> is never detected.  Fix this by using the signed int ret to handle the
> error return and assign numvecs to err.
> 
> Detected by CoverityScan, CID#1468650 ("Unsigned compared against 0")
> 
> Fixes: a09bd81b5413 ("net: aquantia: Limit number of vectors to actually 
> allocated irqs")
> Signed-off-by: Colin Ian King 
> Signed-off-by: Igor Russkikh 

Applied and queued up for -stable, thanks.


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Siwei Liu
On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
 wrote:
> On Wed, 6 Jun 2018 15:30:27 +0300
> "Michael S. Tsirkin"  wrote:
>
>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:
>> > >The net failover should be a simple library, not a virtual
>> > >object with function callbacks (see callback hell).
>> >
>> > Why just a library? It should do a common things. I think it should be a
>> > virtual object. Looks like your patch again splits the common
>> > functionality into multiple drivers. That is kind of backwards attitude.
>> > I don't get it. We should rather focus on fixing the mess the
>> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> > model.
>>
>> So it seems that at least one benefit for netvsc would be better
>> handling of renames.
>>
>> Question is how can this change to 3-netdev happen?  Stephen is
>> concerned about risk of breaking some userspace.
>>
>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> address, and you said then "why not use existing network namespaces
>> rather than inventing a new abstraction". So how about it then? Do you
>> want to find a way to use namespaces to hide the PV device for netvsc
>> compatibility?
>>
>
> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> startups that all demand eth0 always be present. And VF may come and go.
> After this history, there is a strong motivation not to change how kernel
> behaves. Switching to 3 device model would be perceived as breaking
> existing userspace.
>
> With virtio you can  work it out with the distro's yourself.
> There is no pre-existing semantics to deal with.
>
> For the virtio, I don't see the need for IFF_HIDDEN.

I have a somewhat different view regarding IFF_HIDDEN. The purpose of
that flag, as well as the 1-netdev model, is to have a means to
inherit the interface name from the VF, and to eliminate playing hacks
around renaming devices, customizing udev rules and et al. Why
inheriting VF's name important? To allow existing config/setup around
VF continues to work across kernel feature upgrade. Most of network
config files in all distros are based on interface names. Few are MAC
address based but making lower slaves hidden would cover the rest. And
most importantly, preserving the same level of user experience as
using raw VF interface once getting all ndo_ops and ethtool_ops
exposed. This is essential to realize transparent live migration that
users dont have to learn and be aware of the undertaken.

It's unfair to say all virtio use cases don't need IFF_HIDDEN. A few
use cases don't care about getting slaves exposed, the 3-netdev model
would work for them. For the rest, the pre-existing semantics to them
is the single VF device model they've already dealt with. This is
nothing different than having Azure stick to the 2-netdev model
because of existing user base IMHO.

-Siwei


> With 3-dev model as long as you mark the PV and VF devices
> as slaves, then userspace knows to leave them alone. Assuming userspace
> is already able to deal with team and bond devices.
> Any time you introduce new UAPI behavior something breaks.
>
> On the rename front, I really don't care if VF can be renamed. And for
> netvsc want to allow the PV device to be renamed. Udev developers want that
> but have not found a stable/persistent value to expose to userspace
> to allow it.


Re: [RFC PATCH 1/3] ebpf: add next_skb_frag bpf helper for sk filter

2018-06-08 Thread Tushar Dave




On 06/08/2018 02:46 PM, Tushar Dave wrote:



On 06/08/2018 02:27 PM, Daniel Borkmann wrote:

On 06/08/2018 11:00 PM, Tushar Dave wrote:

Today socket filter only deals with linear skbs. This change allows
ebpf programs to look into non-linear skb e.g. skb frags. This will be
useful when users need to look into data which is not contained in the
linear part of skb.


Hmm, I don't think this statement is correct in its form here ... they
can handle non-linear skbs just fine.

Thanks Daniel for your reply.


Straight forward way is to use bpf_skb_load_bytes(). It's simple and uses
internally skb_header_pointer(), and that one of course walks everything
if it really has to via skb_copy_bits() (page frags _and_ frag list). And
if you need to look into mac/net headers that may otherwise not be 
accessible

anymore from socket layer, there's bpf_skb_load_bytes_relative() helper
which is effectively doing the negative offset trick from ld_abs/ind more
efficient for multi-byte loads.

I'm looking into bpf_skb_load_bytes and friends.


Daniel,

While I am trying to see if I can use exiting bpf_skb_load helpers, I am
wondering socket filter based ebpf program are allowed to change packet
data? In other words, can we use them to build firewall?

Thanks.

-Tushar


Thanks.
-Tushar


Thanks,
Daniel





Re: Fw: [Bug 199995] New: Ramdomly sent TCP Reset from Kernel with bonding mode "brodcast"

2018-06-08 Thread Eric Dumazet



On 06/08/2018 02:38 PM, Eric Dumazet wrote:
> 
> 
> On 06/08/2018 02:04 PM, Michal Kubecek wrote:

>>
>> However, the lockless listener was introduced in 4.4 so it's not clear
>> why reporter started encountering this after an upgrade from 4.13 to
>> 4.15.
> 
> Yes, I do not buy this at all.
> 
> If two identical SYN are received by two cpus, we should create one SYN_RECV 
> and send
> two SYNACK.
> 
> But it is a bit hard to test this :/
> 
> I will take a look, thanks.


Oh well, this is not done as I thought, this needs a fix, I will work on this.

reqsk_queue_hash_req() calls inet_ehash_insert() without making sure that the 
same 4-tuple
is not already there.

Do not worry, we will keep the listener lockless :)





Re: [RFC PATCH 1/3] ebpf: add next_skb_frag bpf helper for sk filter

2018-06-08 Thread Tushar Dave




On 06/08/2018 02:27 PM, Daniel Borkmann wrote:

On 06/08/2018 11:00 PM, Tushar Dave wrote:

Today socket filter only deals with linear skbs. This change allows
ebpf programs to look into non-linear skb e.g. skb frags. This will be
useful when users need to look into data which is not contained in the
linear part of skb.


Hmm, I don't think this statement is correct in its form here ... they
can handle non-linear skbs just fine.

Thanks Daniel for your reply.


Straight forward way is to use bpf_skb_load_bytes(). It's simple and uses
internally skb_header_pointer(), and that one of course walks everything
if it really has to via skb_copy_bits() (page frags _and_ frag list). And
if you need to look into mac/net headers that may otherwise not be accessible
anymore from socket layer, there's bpf_skb_load_bytes_relative() helper
which is effectively doing the negative offset trick from ld_abs/ind more
efficient for multi-byte loads.

I'm looking into bpf_skb_load_bytes and friends.

Thanks.
-Tushar


Thanks,
Daniel



Re: Fw: [Bug 199995] New: Ramdomly sent TCP Reset from Kernel with bonding mode "brodcast"

2018-06-08 Thread Eric Dumazet



On 06/08/2018 02:04 PM, Michal Kubecek wrote:
> On Fri, Jun 08, 2018 at 09:59:54AM -0700, Stephen Hemminger wrote:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=15
>>
>> Bug ID: 15
>>Summary: Ramdomly sent TCP Reset from Kernel with bonding mode
>> "brodcast"
>>
>> after a dist upgrade from Ubuntu 17.10 (Kernel 4.13.x) to Ubuntu 18.04 
>> (Kernel
>> 4.15.0) I suffer from ramdomly generated TCP RST packets sent (presumably) by
>> the Kernel 
>> on a bonding device that uses bonding mode "brodcast" with 2 physical NICs.
>>
>> With tcpdump/whireshark I can see that the kernel randomly sends TCP-RST
>> packets after the SYN/ACK/ACK packet is received (see attached PCAP).
>> This only happens if the kernel receives the initial SYN packet on both
>> physical NICs (and therefore seeing it twice), before the connection is
>> established by sending SYN/ACK.
>> It's not happening in 100% of all cases and only, if the system can use two 
>> or
>> more CPU cores/threads. With only one CPU available to the system, this
>> behaviour is not reproducable.
> 
> I have seen similar report earlier from one of our customers running
> SLE12 SP2 (kernel 4.4). The problem is that if duplicated SYN packet is
> received on both slaves, these two copies can be processed by the
> lockless listener simultaneously on different CPUs and each can reply by
> SYNACK with different sequence number which results in a reset.
> 
> I tried to think of a way to prevent this race without losing the
> performance gain of lockless listener but couldn't come with anything.
> Eventually, I managed to persuade the customer that this setup (where
> each packet is received twice under normal circumstances) is not what
> broadcast mode was designed for (based on the description in
> Documentation/networking/bonding.txt).
> 
> However, the lockless listener was introduced in 4.4 so it's not clear
> why reporter started encountering this after an upgrade from 4.13 to
> 4.15.

Yes, I do not buy this at all.

If two identical SYN are received by two cpus, we should create one SYN_RECV 
and send
two SYNACK.

But it is a bit hard to test this :/

I will take a look, thanks.




Re: [RFC PATCH 1/3] ebpf: add next_skb_frag bpf helper for sk filter

2018-06-08 Thread Daniel Borkmann
On 06/08/2018 11:00 PM, Tushar Dave wrote:
> Today socket filter only deals with linear skbs. This change allows
> ebpf programs to look into non-linear skb e.g. skb frags. This will be
> useful when users need to look into data which is not contained in the
> linear part of skb.

Hmm, I don't think this statement is correct in its form here ... they
can handle non-linear skbs just fine.

Straight forward way is to use bpf_skb_load_bytes(). It's simple and uses
internally skb_header_pointer(), and that one of course walks everything
if it really has to via skb_copy_bits() (page frags _and_ frag list). And
if you need to look into mac/net headers that may otherwise not be accessible
anymore from socket layer, there's bpf_skb_load_bytes_relative() helper
which is effectively doing the negative offset trick from ld_abs/ind more
efficient for multi-byte loads.

Thanks,
Daniel


Re: Fw: [Bug 199995] New: Ramdomly sent TCP Reset from Kernel with bonding mode "brodcast"

2018-06-08 Thread Michal Kubecek
On Fri, Jun 08, 2018 at 09:59:54AM -0700, Stephen Hemminger wrote:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=15
> 
> Bug ID: 15
>Summary: Ramdomly sent TCP Reset from Kernel with bonding mode
> "brodcast"
> 
> after a dist upgrade from Ubuntu 17.10 (Kernel 4.13.x) to Ubuntu 18.04 (Kernel
> 4.15.0) I suffer from ramdomly generated TCP RST packets sent (presumably) by
> the Kernel 
> on a bonding device that uses bonding mode "brodcast" with 2 physical NICs.
> 
> With tcpdump/whireshark I can see that the kernel randomly sends TCP-RST
> packets after the SYN/ACK/ACK packet is received (see attached PCAP).
> This only happens if the kernel receives the initial SYN packet on both
> physical NICs (and therefore seeing it twice), before the connection is
> established by sending SYN/ACK.
> It's not happening in 100% of all cases and only, if the system can use two or
> more CPU cores/threads. With only one CPU available to the system, this
> behaviour is not reproducable.

I have seen similar report earlier from one of our customers running
SLE12 SP2 (kernel 4.4). The problem is that if duplicated SYN packet is
received on both slaves, these two copies can be processed by the
lockless listener simultaneously on different CPUs and each can reply by
SYNACK with different sequence number which results in a reset.

I tried to think of a way to prevent this race without losing the
performance gain of lockless listener but couldn't come with anything.
Eventually, I managed to persuade the customer that this setup (where
each packet is received twice under normal circumstances) is not what
broadcast mode was designed for (based on the description in
Documentation/networking/bonding.txt).

However, the lockless listener was introduced in 4.4 so it's not clear
why reporter started encountering this after an upgrade from 4.13 to
4.15.

Michal Kubecek


[RFC PATCH 3/3] rds: invoke sk filter attached to rds socket

2018-06-08 Thread Tushar Dave
RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages
arrive in form of skb (over TCP) and scatterlist (over IB/RDMA).
However, because socket filter only deal with skb (e.g. struct skb as
bpf context) we can only use socket filter for rds_tcp and not for
rds_rdma. For that reason this patch invokes socket filter only for
rds socket with tcp transport e.g. rds_tcp.

note:
BTW, we dont want rds-core to be polluted by module-specific data
structures e.g. we included tcp.h to retrieve rds_tcp specific
structures. For non-RFC version we will add a way to get transport
specific indirections to get the skb.

Signed-off-by: Tushar Dave 
Reviewed-by: Shannon Nelson 
Reviewed-by: Sowmini Varadhan 
---
 net/rds/recv.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/net/rds/recv.c b/net/rds/recv.c
index dc67458..3be9628 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -39,6 +39,7 @@
 #include 
 
 #include "rds.h"
+#include "tcp.h"
 
 void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn,
  __be32 saddr)
@@ -369,6 +370,22 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 
saddr, __be32 daddr,
/* We can be racing with rds_release() which marks the socket dead. */
sk = rds_rs_to_sk(rs);
 
+   if (rs->rs_transport->t_type == RDS_TRANS_TCP) {
+   struct sk_buff *skb;
+   struct sk_filter *filter = sk->sk_filter;
+   struct rds_tcp_incoming *tinc;
+
+   tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
+   skb = tinc->ti_skb_list.next;
+   rcu_read_lock();
+   filter = rcu_dereference(sk->sk_filter);
+   if (filter) {
+   bpf_compute_data_pointers(skb);
+   bpf_prog_run_save_cb(filter->prog, skb);
+   }
+   rcu_read_unlock();
+   }
+
/* serialize with rds_release -> sock_orphan */
write_lock_irqsave(>rs_recv_lock, flags);
if (!sock_flag(sk, SOCK_DEAD)) {
-- 
1.8.3.1



[RFC PATCH 1/3] ebpf: add next_skb_frag bpf helper for sk filter

2018-06-08 Thread Tushar Dave
Today socket filter only deals with linear skbs. This change allows
ebpf programs to look into non-linear skb e.g. skb frags. This will be
useful when users need to look into data which is not contained in the
linear part of skb.

Signed-off-by: Tushar Dave 
Reviewed-by: Shannon Nelson 
Reviewed-by: Sowmini Varadhan 
---
 include/linux/filter.h|  2 ++
 include/uapi/linux/bpf.h  | 10 ++-
 net/core/filter.c | 44 +--
 tools/include/uapi/linux/bpf.h| 10 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  2 ++
 5 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 9dbcb9d..603b8bf 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -500,6 +500,7 @@ struct sk_filter {
 
 struct bpf_skb_data_end {
struct qdisc_skb_cb qdisc_cb;
+   u8 index;
void *data_meta;
void *data_end;
 };
@@ -534,6 +535,7 @@ static inline void bpf_compute_data_pointers(struct sk_buff 
*skb)
BUILD_BUG_ON(sizeof(*cb) > FIELD_SIZEOF(struct sk_buff, cb));
cb->data_meta = skb->data - skb_metadata_len(skb);
cb->data_end  = skb->data + skb_headlen(skb);
+   cb->index = 0;
 }
 
 static inline u8 *bpf_skb_cb(struct sk_buff *skb)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d94d333..5fe9668 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1902,6 +1902,13 @@ struct bpf_stack_build_id {
  * egress otherwise). This is the only flag supported for now.
  * Return
  * **SK_PASS** on success, or **SK_DROP** on error.
+ *
+ * int bpf_next_skb_frag(struct sk_buff *skb)
+ * Description
+ * This helper allows users to look into non-linear part of skb
+ * e.g. skb frags.
+ * Return
+ * 0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -1976,7 +1983,8 @@ struct bpf_stack_build_id {
FN(fib_lookup), \
FN(sock_hash_update),   \
FN(msg_redirect_hash),  \
-   FN(sk_redirect_hash),
+   FN(sk_redirect_hash),   \
+   FN(next_skb_frag),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 51ea7dd..fd8e90f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3752,6 +3752,38 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const 
void *src_buff,
.arg1_type  = ARG_PTR_TO_CTX,
 };
 
+BPF_CALL_1(bpf_next_skb_frag, struct sk_buff *, skb)
+{
+   struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
+   const skb_frag_t *frag;
+
+   if (skb->data_len == 0)
+   return -ENODATA;
+
+   if (cb->index == (u8)skb_shinfo(skb)->nr_frags)
+   return -ENODATA;
+
+   /* get the frag start and end address into data_meta and data_end
+* respectively so eBPF program can look into skb frag
+*/
+   frag = _shinfo(skb)->frags[cb->index];
+   cb->data_meta = page_address(skb_frag_page(frag)) +
+   frag->page_offset;
+   cb->data_end = cb->data_meta + skb_frag_size(frag);
+
+   /* update frag index */
+   cb->index++;
+
+   return 0;
+}
+
+static const struct bpf_func_proto bpf_next_skb_frag_proto = {
+   .func   = bpf_next_skb_frag,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+};
+
 BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
   int, level, int, optname, char *, optval, int, optlen)
 {
@@ -4415,6 +4447,8 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
return _get_socket_cookie_proto;
case BPF_FUNC_get_socket_uid:
return _get_socket_uid_proto;
+   case BPF_FUNC_next_skb_frag:
+   return _next_skb_frag_proto;
default:
return bpf_base_func_proto(func_id);
}
@@ -4698,10 +4732,16 @@ static bool sk_filter_is_valid_access(int off, int size,
  struct bpf_insn_access_aux *info)
 {
switch (off) {
-   case bpf_ctx_range(struct __sk_buff, tc_classid):
case bpf_ctx_range(struct __sk_buff, data):
-   case bpf_ctx_range(struct __sk_buff, data_meta):
+   info->reg_type = PTR_TO_PACKET;
+   break;
case bpf_ctx_range(struct __sk_buff, data_end):
+   info->reg_type = PTR_TO_PACKET_END;
+   break;
+   case bpf_ctx_range(struct __sk_buff, data_meta):
+   info->reg_type = PTR_TO_PACKET;
+   break;
+   case bpf_ctx_range(struct __sk_buff, tc_classid):

[RFC PATCH 2/3] samples/bpf: add sample RDS program

2018-06-08 Thread Tushar Dave
When run in server mode, the sample RDS program opens PF_RDS socket,
attaches ebpf program to RDS socket which then uses bpf_skb_next_frag
helper along with bpf tail calls to inspect skb linear and non-linear
data.

To ease testing, RDS client functionality is also added so that users
can generate RDS packet.

Run server:
[root@lab71 bpf]# ./rds_skb -s 192.168.3.71
running server in a loop
transport tcp
server bound to address: 192.168.3.71 port 4000
server listening on 192.168.3.71
192.168.3.71 received a packet from 192.168.3.71 of len 8192 cmsg len 0,
on port 52287
payload contains:30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41
42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59
5a 5b 5c 5d 5e 5f 60 61 62 63 64 65 66 67 68 69 6a 6b ...
server listening on 192.168.3.71

Run client:
[root@lab70 bpf]# ./rds_skb -s 192.168.3.71 -c 192.168.3.70
transport tcp
client bound to address: 192.168.3.71 port 47437
client sending 8192 byte message from 192.168.3.71 to 192.168.3.70 on
port 47437

bpf program output:
[root@lab71]# cat /sys/kernel/debug/tracing/trace_pipe
  -0 [000] ..s. 218923.839673: 0: 30 31 32
  -0 [000] ..s. 218923.839682: 0: 33 34 35
  -0 [000] ..s. 218923.845133: 0: be bf c0
  -0 [000] ..s. 218923.845135: 0: c1 c2 c3
  -0 [000] ..s. 218923.850581: 0: be bf c0
  -0 [000] ..s. 218923.850582: 0: c1 c2 c3
  -0 [000] ..s. 218923.850582: 0: no more skb frag

Note: changing MTU to 9000 help assure that RDS get skb with
fragments.

Signed-off-by: Tushar Dave 
Reviewed-by: Shannon Nelson 
Reviewed-by: Sowmini Varadhan 
---
 samples/bpf/Makefile   |   3 +
 samples/bpf/rds_skb_kern.c |  87 +
 samples/bpf/rds_skb_user.c | 311 +
 3 files changed, 401 insertions(+)
 create mode 100644 samples/bpf/rds_skb_kern.c
 create mode 100644 samples/bpf/rds_skb_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 62a99ab..a05c3b2 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -51,6 +51,7 @@ hostprogs-y += cpustat
 hostprogs-y += xdp_adjust_tail
 hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
+hostprogs-y += rds_skb
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -105,6 +106,7 @@ cpustat-objs := bpf_load.o cpustat_user.o
 xdp_adjust_tail-objs := xdp_adjust_tail_user.o
 xdpsock-objs := bpf_load.o xdpsock_user.o
 xdp_fwd-objs := bpf_load.o xdp_fwd_user.o
+rds_skb-objs := bpf_load.o rds_skb_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -160,6 +162,7 @@ always += cpustat_kern.o
 always += xdp_adjust_tail_kern.o
 always += xdpsock_kern.o
 always += xdp_fwd_kern.o
+always += rds_skb_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/rds_skb_kern.c b/samples/bpf/rds_skb_kern.c
new file mode 100644
index 000..c8832d4
--- /dev/null
+++ b/samples/bpf/rds_skb_kern.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+
+#define PROG(F) SEC("socket/"__stringify(F)) int bpf_func_##F
+
+#define bpf_printk(fmt, ...)   \
+({ \
+   char fmt[] = fmt;   \
+   bpf_trace_printk(fmt, sizeof(fmt),  \
+   ##__VA_ARGS__); \
+})
+
+
+struct bpf_map_def SEC("maps") jmp_table = {
+   .type = BPF_MAP_TYPE_PROG_ARRAY,
+   .key_size = sizeof(u32),
+   .value_size = sizeof(u32),
+   .max_entries = 2,
+};
+
+#define FRAG 1
+
+static inline void dump_skb(struct __sk_buff *skb)
+{
+   void *data = (void *)(long) skb->data_meta;
+   void *data_end = (void *)(long) skb->data_end;
+   unsigned char *d;
+
+   if (data + 6 > data_end)
+   return;
+
+   d = (unsigned char *)data;
+   bpf_printk("%x %x %x\n", d[0], d[1], d[2]);
+   bpf_printk("%x %x %x\n", d[3], d[4], d[5]);
+   return;
+}
+
+static void populate_skb_frags(struct __sk_buff *skb)
+{
+   int ret;
+
+   ret = bpf_next_skb_frag(skb);
+   if (ret == -ENODATA) {
+   bpf_printk("no more skb frag\n");
+   return;
+   }
+
+   bpf_tail_call(skb, _table, 1);
+}
+
+/* walk skb frag */
+
+PROG(FRAG)(struct __sk_buff *skb)
+{
+   dump_skb(skb);
+   populate_skb_frags(skb);
+   return 0;
+}
+
+SEC("socket/0")
+int main_prog(struct __sk_buff *skb)
+{
+   void *data = (void *)(long) skb->data;
+   void *data_end = (void *)(long) skb->data_end;
+   int ret;
+   unsigned char *d;
+
+   if (data + 6 > data_end) {
+   bpf_printk("out\n");
+   return 0;
+   }
+
+   d = (unsigned char *)data;
+   bpf_printk("%x %x %x\n", d[0], d[1], d[2]);
+   bpf_printk("%x %x %x\n", d[3], d[4], 

[RFC PATCH 0/3] BPF socket filter to deal with skb frags

2018-06-08 Thread Tushar Dave
This RFC allows bpf socket filter programs to look into complete skb
i.e. linear and non-linear part of skb. (patch1)

For a proof of concept I'm using RDS sample program that uses bpf socket
filter and inspect skb packet data from linear and non-linear part e.g.
skb frags. (patch 2 and 3)

I'm sharing this RFC to get some feedback on direction.

Details:
patch1 adds new bpf helper function and needed infrastructure so that
socket(sk) filter based eBPF program can retrieve non-linear part of skb
(e.g. skb frags) unlike current socket filter that only deals with
linear skb. This patch adds very basic functionality and for now allow
socket filter programs to only read packet data (from linear and
non-linear part of) skb. The idea behind this patch is to add eBPF
helper that allow socket filter based ebpf program to walk through the
skb frag using bpf tail call. This way ebpf program can do deep packet
inspection (i.e. allows to look into headers as well as payload).

patch2 adds sample ebpf socket filter program that uses rds socket. The
sample program opens an rds socket, attach ebpf program to rds socket
and uses bpf helper added in patch 1 to look into skb. For a test,
current ebpf program only prints first few bytes from skb->data and skb
frags.

patch3 allows rds_recv_incoming to invoke bpf socket filter program if
any program is attached to rds socket.


FYI, I'm also working on a follow-up patchset that deals with *struct
scatterlist* to allow RDS filtering for IB/RDMA use cases that do not
have an sk_buff.

Thanks.
-Tushar

Tushar Dave (3):
  ebpf: add next_skb_frag bpf helper for sk filter
  samples/bpf: add sample RDS program
  rds: invoke sk filter attached to rds socket

 include/linux/filter.h|   2 +
 include/uapi/linux/bpf.h  |  10 +-
 net/core/filter.c |  44 -
 net/rds/recv.c|  17 ++
 samples/bpf/Makefile  |   3 +
 samples/bpf/rds_skb_kern.c|  87 +
 samples/bpf/rds_skb_user.c| 311 ++
 tools/include/uapi/linux/bpf.h|  10 +-
 tools/testing/selftests/bpf/bpf_helpers.h |   2 +
 9 files changed, 482 insertions(+), 4 deletions(-)
 create mode 100644 samples/bpf/rds_skb_kern.c
 create mode 100644 samples/bpf/rds_skb_user.c

-- 
1.8.3.1



Re: netdevice notifier and device private data

2018-06-08 Thread Michael Richardson

Alexander Aring  wrote:
Alex> I already see code outside who changed tun netdevice to the
Alex> ARPHRD_6LOWPAN type and I suppose they running into this
Alex> issue.  (Btw: I don't know why somebody wants to changed that
Alex> type to ARPHRD_6LOWPAN on tun).

so that they can have the kernel do 6lowpan processing, emitting 6lowPAN
packets into userspace to be transfered into a radio via some proprietary
interface (including, for instance SLIP over USB cable to Contiki or OpenWSN 
stack, 
set up to act as radio only)

-- 
]   Never tell me the odds! | ipv6 mesh networks [ 
]   Michael Richardson, Sandelman Software Works| network architect  [ 
] m...@sandelman.ca  http://www.sandelman.ca/|   ruby on rails[ 



signature.asc
Description: PGP signature


Re: netdevice notifier and device private data

2018-06-08 Thread Alexander Aring
Hi Stephen,

On Fri, Jun 08, 2018 at 11:14:57AM -0700, Stephen Hemminger wrote:
...
> 
> notifiers are always called with RTNL mutex held
> and dev->type should not change unless RTNL is held.

thanks for you answer. I am not talking about any race between notifiers
vs dev->type change.

I am talking that dev->type was already changed and a upcoming notifier ends
in undefined behaviour when it derefences dev->priv. I have some notifier
which maps a cast from dev->type to a specific structure at dev->priv. This
structure is not there in tap/tun devices if they changed to "my" dev->type
and the notifier occurs.

- Alex


Re: Qualcomm rmnet driver and qmi_wwan

2018-06-08 Thread Bjørn Mork
Subash Abhinov Kasiviswanathan  writes:

>> I followed Dan's advice and prepared a very basic test patch
>> (attached) for testing it through ip link.
>>
>> Basically things seem to be properly working with qmicli, but I needed
>> to modify a bit qmi_wwan, so I'm adding Bjørn that maybe can help.
>>
>> Bjørn,
>>
>> I'm trying to add support to rmnet in qmi_wwan: I had to modify the
>> code as in the attached test patch, but I'm not sure it is the right
>> way.
>>
>> This is done under the assumption that the rmnet device would be the
>> only one to register an rx handler to qmi_wwan, but it is probably
>> wrong.
>>
>> Basically I'm wondering if there is a more correct way to understand
>> if an rmnet device is linked to the real qmi_wwan device.
>>
>> Thanks,
>> Daniele
>
>
> Hi Daniele / Bjørn
>
> Is it possible to define a pass through mode in qmi_wwan. This is to
> ensure that all packets in MAP format are passed through instead of
> processing in qmi_wwan layer. The pass through mode would just call
> netif_receive_skb() on all these packets.
>
> That would allow all the packets to be intercepted by the rx_handler
> attached by rmnet which would subsequently de-multiplex and process
> the packets.

This sounds like a good idea. I probably won't have any time to look at
this in the near future, though.  Sorry about that. Extremely overloaded
both at work and private right now...

But I trust that you and Daniele can work out something. Please keep me
CCed, but don't expect timely replies.


Bjørn


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Michael S. Tsirkin
On Fri, Jun 08, 2018 at 11:30:08AM -0700, Stephen Hemminger wrote:
>   * what about nested KVM on Hyper-V? Would it make sense to
> have a way to pass subset of VF queues to guest?

No as long as hyper-v doesn't have a vIOMMU.

-- 
MST


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Stephen Hemminger
On Thu, 7 Jun 2018 20:22:15 +0300
"Michael S. Tsirkin"  wrote:

> On Thu, Jun 07, 2018 at 09:17:42AM -0700, Stephen Hemminger wrote:
> > On Thu, 7 Jun 2018 18:41:31 +0300
> > "Michael S. Tsirkin"  wrote:
> >   
> > > > > Why would DPDK care what we do in the kernel? Isn't it just slapping
> > > > > vfio-pci on the netdevs it sees?
> > > > 
> > > > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel 
> > > > based.,.
> > > > The DPDK support uses:
> > > >  * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to
> > > >userspace. This means VF netdev device must exist and be visible.
> > > >  * Slow path using kernel netvsc device, TAP and BPF to get exception
> > > >path packets to userspace.
> > > >  * A autodiscovery mechanism that to set all this up that relies on
> > > >2 device model and sysfs.
> > > 
> > > Could you describe what does it look for exactly? What will break if
> > > instead of MLX5 being a child of the PV, it's a child of the failover
> > > device?  
> > 
> > So in DPDK there is an internal four device model:
> > 1. failsafe is like failover in your model
> > 2. TAP is used like netvsc in kernel
> > 3. MLX5 is the VF
> > 4. vdev_netvsc is a pseudo device whose only reason to exist
> >is to glue everything together.
> > 
> > Digging deeper inside...
> > 
> > Vdev_netvsc does:
> >* driver is started in a convuluted way off device arguments
> >* probe routine for driver runs
> >   - scans list of kernel interfaces in sysfs
> >   - matches those using VMBUS   
> 
> Could you tell a bit more what does this step entail?

Quick code high/low lights.


ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, 1, name,
kvargs, specified, );
static int
vdev_netvsc_foreach_iface(int (*func)(const struct if_nameindex *iface,
  const struct ether_addr *eth_addr,
  va_list ap), int is_netvsc, ...)
{
struct if_nameindex *iface = if_nameindex();


for (i = 0; iface[i].if_name; ++i) {

is_netvsc_ret = vdev_netvsc_iface_is_netvsc([i]) ? 1 : 0;
if (is_netvsc ^ is_netvsc_ret)
continue;

strlcpy(req.ifr_name, iface[i].if_name, sizeof(req.ifr_name));
if (ioctl(s, SIOCGIFHWADDR, ) == -1) {
}

memcpy(eth_addr.addr_bytes, req.ifr_hwaddr.sa_data,
   RTE_DIM(eth_addr.addr_bytes));

ret = func([i], _addr, ap);  << func is 
vdev_netvsc_netvsc_probe


static int
vdev_netvsc_netvsc_probe(const struct if_nameindex *iface,
 const struct ether_addr *eth_addr,
 va_list ap)
{

/* Routed NetVSC should not be probed. */
if (vdev_netvsc_has_route(iface, AF_INET) ||
vdev_netvsc_has_route(iface, AF_INET6)) {
if (!specified)
return 0;
DRV_LOG(WARNING, "probably using routed NetVSC interface \"%s\""
" (index %u)", iface->if_name, iface->if_index);
}
/* Create interface context. */
ctx = calloc(1, sizeof(*ctx));
...


> 
> >   - skip netvsc devices that have an IPV4 route
> >* scan for PCI devices that have same MAC address as kernel netvsc
> >  devices discovered in previous step
> >* add these interfaces to arguments to failsafe
> > 
> > Then failsafe configures based on arguments on device
> > 
> > The code works but is specific to the Azure hardware model, and exposes lots
> > of things to application that it should not have to care about.
> > 
> > If you  try and walk through this code in DPDK, you will see why I have 
> > developed
> > a dislike for high levels of indirection.
> > 
> > 
> >  
> 
> Thanks that was helpful!  I'll try to poke at it next week.  Just from
> the description it seems the kernel is merely used to locate the MAC
> address through sysfs and that for this DPDK code to keep working the
> hidden device must be hidden from it in sysfs - is that a fair summary?

What is the point of the 3 device model? What value does it have
to userspace? How would userspace use each of the three devices.
Going back to 3 device model really doesn't make sense to me if
there is not visible benefit.

Some other considerations:
   * there is ongoing development to support RDMA failover as
 well in netvsc.

   * there is a new driver which implements the VMBUS protocol
 in userspace for DPDK. This gets rid of several layers and
 removes any special scanning code. The vmbus device is
 unbound from netvsc and bound to UIO device.  Then the user
 space DPDK driver manages all the host signalling events
 including VF discovery. It is really 2 device model done
 all in userspace. The kernel device is still needed when
 

Re: netdevice notifier and device private data

2018-06-08 Thread Stephen Hemminger
On Fri, 8 Jun 2018 13:34:55 -0400
Alexander Aring  wrote:

> Hey netdev community,
> 
> I am trying to solve some issue which Eric Dumazet points to me by
> commit ca0edb131bdf ("ieee802154: 6lowpan: fix possible NULL deref in
> lowpan_device_event()").
> 
> The issue is that dev->type can be changed during runtime. We don't have
> any problems with the netdevice notifier which Eric Dumazet fixed. I am
> bother with another netdevice notifier which is broken because the same
> tun/tap feature and I don't have any dev->$SUBSYSTEM_DEV_POINTER to check
> if this is my netdevice type.
> 
> This netdevice notifier will access the dev->priv area which is only
> available for the dev->type which was allocated and initialized with the
> right dev->priv room. If a tap/tun netdevice changed their dev->type I
> might have an illegal read of netdev->priv and I can't confirm that it
> has the data which I cast to it. The reason for that is that tap/tun
> netdevices doesn't run my netdevice init.
> 
> I already see code outside who changed tun netdevice to the
> ARPHRD_6LOWPAN type and I suppose they running into this issue.
> (Btw: I don't know why somebody wants to changed that type to
> ARPHRD_6LOWPAN on tun).
> 
> My question is:
> 
> How we deal with that? Is it forbidden to access dev->priv from a
> global netdevice notifier which only checks for dev->type?
> 
> I could solve it like Eric Dumazet and introduce a special
> dev->$SUBSYSTEM_DEV_POINTER and check on it if set. At least tun/tap
> will not set these pointers, then I am sure the netdevice was running
> through my init function. Seems for me the best solution right now and
> I think I will go for it.
> 
> I assumed before the data of dev->priv is binded to dev->type.
> This tun/tap feature will break at least my handling and I am not sure
> if there are others users which using dev->priv in netdevice notifier
> and don't check on dev->$SUBSYSTEM_DEV_POINTER if they have one.
> 
> Thanks for everybody in advance to solve this issue.
> 
> - Alex

notifiers are always called with RTNL mutex held
and dev->type should not change unless RTNL is held.


Re: [PATCH bpf] bpf: implement dummy fops for bpf objects

2018-06-08 Thread Alexei Starovoitov
On Fri, Jun 08, 2018 at 06:10:34PM +0200, Daniel Borkmann wrote:
> syzkaller was able to trigger the following warning in
> do_dentry_open():
> 
>   WARNING: CPU: 1 PID: 4508 at fs/open.c:778 do_dentry_open+0x4ad/0xe40 
> fs/open.c:778
>   Kernel panic - not syncing: panic_on_warn set ...
> 
>   CPU: 1 PID: 4508 Comm: syz-executor867 Not tainted 4.17.0+ #90
>   Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
>   Call Trace:
>   [...]
>vfs_open+0x139/0x230 fs/open.c:908
>do_last fs/namei.c:3370 [inline]
>path_openat+0x1717/0x4dc0 fs/namei.c:3511
>do_filp_open+0x249/0x350 fs/namei.c:3545
>do_sys_open+0x56f/0x740 fs/open.c:1101
>__do_sys_openat fs/open.c:1128 [inline]
>__se_sys_openat fs/open.c:1122 [inline]
>__x64_sys_openat+0x9d/0x100 fs/open.c:1122
>do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Problem was that prog and map inodes in bpf fs did not
> implement a dummy file open operation that would return an
> error. The patch in do_dentry_open() checks whether f_ops
> are present and if not bails out with an error. While this
> may be fine, we really shouldn't be throwing a warning
> though. Thus follow the model similar to bad_file_ops and
> reject the request unconditionally with -EIO.
> 
> Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
> Reported-by: syzbot+2e7fcab0f56fdbb33...@syzkaller.appspotmail.com
> Signed-off-by: Daniel Borkmann 

Applied, Thanks


[PATCH v2 1/1] iproute2: Add support for a few routing protocols

2018-06-08 Thread Donald Sharp
Add support for:

BGP
ISIS
OSPF
RIP
EIGRP

Routing protocols to iproute2.

Signed-off-by: Donald Sharp 
---
v2: Update to latest version of code.
 etc/iproute2/rt_protos | 5 +
 lib/rt_names.c | 5 +
 2 files changed, 10 insertions(+)

diff --git a/etc/iproute2/rt_protos b/etc/iproute2/rt_protos
index 2a9ee01b..b3a0ec8f 100644
--- a/etc/iproute2/rt_protos
+++ b/etc/iproute2/rt_protos
@@ -16,3 +16,8 @@
 15 ntk
 16  dhcp
 42 babel
+186bgp
+187isis
+188ospf
+189rip
+192eigrp
diff --git a/lib/rt_names.c b/lib/rt_names.c
index a02db35e..66d5f2f0 100644
--- a/lib/rt_names.c
+++ b/lib/rt_names.c
@@ -134,6 +134,11 @@ static char *rtnl_rtprot_tab[256] = {
[RTPROT_XORP] = "xorp",
[RTPROT_NTK]  = "ntk",
[RTPROT_DHCP] = "dhcp",
+   [RTPROT_BGP]  = "bgp",
+   [RTPROT_ISIS] = "isis",
+   [RTPROT_OSPF] = "ospf",
+   [RTPROT_RIP]  = "rip",
+   [RTPROT_EIGRP]= "eigrp",
 };
 
 
-- 
2.14.4



[PATCH v2 0/1] Addition of new routing protocols for iproute2

2018-06-08 Thread Donald Sharp
The linux kernel recently accepted some new RTPROT values for some
fairly standard routing protocols.  This commit brings in support
for iproute2 to handle these new values.

v2 - Update to latest version of master which has rtnetlink.h code and drop
 of work already done.

Donald Sharp (1):
  iproute2: Add support for a few routing protocols

 etc/iproute2/rt_protos | 5 +
 lib/rt_names.c | 5 +
 2 files changed, 10 insertions(+)

-- 
2.14.4



netdevice notifier and device private data

2018-06-08 Thread Alexander Aring
Hey netdev community,

I am trying to solve some issue which Eric Dumazet points to me by
commit ca0edb131bdf ("ieee802154: 6lowpan: fix possible NULL deref in
lowpan_device_event()").

The issue is that dev->type can be changed during runtime. We don't have
any problems with the netdevice notifier which Eric Dumazet fixed. I am
bother with another netdevice notifier which is broken because the same
tun/tap feature and I don't have any dev->$SUBSYSTEM_DEV_POINTER to check
if this is my netdevice type.

This netdevice notifier will access the dev->priv area which is only
available for the dev->type which was allocated and initialized with the
right dev->priv room. If a tap/tun netdevice changed their dev->type I
might have an illegal read of netdev->priv and I can't confirm that it
has the data which I cast to it. The reason for that is that tap/tun
netdevices doesn't run my netdevice init.

I already see code outside who changed tun netdevice to the
ARPHRD_6LOWPAN type and I suppose they running into this issue.
(Btw: I don't know why somebody wants to changed that type to
ARPHRD_6LOWPAN on tun).

My question is:

How we deal with that? Is it forbidden to access dev->priv from a
global netdevice notifier which only checks for dev->type?

I could solve it like Eric Dumazet and introduce a special
dev->$SUBSYSTEM_DEV_POINTER and check on it if set. At least tun/tap
will not set these pointers, then I am sure the netdevice was running
through my init function. Seems for me the best solution right now and
I think I will go for it.

I assumed before the data of dev->priv is binded to dev->type.
This tun/tap feature will break at least my handling and I am not sure
if there are others users which using dev->priv in netdevice notifier
and don't check on dev->$SUBSYSTEM_DEV_POINTER if they have one.

Thanks for everybody in advance to solve this issue.

- Alex


Re: [PATCH 1/2] iproute2: Add support for a few routing protocols

2018-06-08 Thread Stephen Hemminger
On Fri,  8 Jun 2018 08:46:37 -0400
Donald Sharp  wrote:

> Add support for:
> 
> BGP
> ISIS
> OSPF
> RIP
> EIGRP
> 
> Routing protocols to iproute2.
> 
> Signed-off-by: Donald Sharp 
> ---
>  etc/iproute2/rt_protos| 5 +
>  include/linux/rtnetlink.h | 5 +
>  lib/rt_names.c| 5 +
>  3 files changed, 15 insertions(+)
> 

I just merged iproute2-next into iproute2 and rtnetlink.h is now up to date.
Please rebase your patches.



Re: Qualcomm rmnet driver and qmi_wwan

2018-06-08 Thread Subash Abhinov Kasiviswanathan

I followed Dan's advice and prepared a very basic test patch
(attached) for testing it through ip link.

Basically things seem to be properly working with qmicli, but I needed
to modify a bit qmi_wwan, so I'm adding Bjørn that maybe can help.

Bjørn,

I'm trying to add support to rmnet in qmi_wwan: I had to modify the
code as in the attached test patch, but I'm not sure it is the right
way.

This is done under the assumption that the rmnet device would be the
only one to register an rx handler to qmi_wwan, but it is probably
wrong.

Basically I'm wondering if there is a more correct way to understand
if an rmnet device is linked to the real qmi_wwan device.

Thanks,
Daniele



Hi Daniele / Bjørn

Is it possible to define a pass through mode in qmi_wwan. This is to
ensure that all packets in MAP format are passed through instead of
processing in qmi_wwan layer. The pass through mode would just call
netif_receive_skb() on all these packets.

That would allow all the packets to be intercepted by the rx_handler
attached by rmnet which would subsequently de-multiplex and process
the packets.

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH iproute2 v2 0/2] display netns name instead of nsid

2018-06-08 Thread Stephen Hemminger
On Tue,  5 Jun 2018 15:08:29 +0200
Nicolas Dichtel  wrote:

> After these patches, the iproute2 name of netns is displayed instead of
> the nsid. It's easier to read/understand.
> 
> v1 -> v2:
>  - open netns socket and init netns map only when needed
> 
>  ip/ip_common.h |  3 +++
>  ip/ipaddress.c | 20 +++-
>  ip/iplink.c| 18 --
>  ip/ipnetns.c   | 22 --
>  4 files changed, 54 insertions(+), 9 deletions(-)
>  
> Comments are welcomed, 
> Regards, 
> Nicolas

Applied


Re: [PATCH 2/2] iproute2: Remove leftover gated RT_PROT defines

2018-06-08 Thread Stephen Hemminger
On Fri,  8 Jun 2018 08:46:38 -0400
Donald Sharp  wrote:

> These values are not being used nor maintained, so remove.
> 
> Signed-off-by: Donald Sharp 
> ---
>  etc/iproute2/rt_protos | 13 -
>  1 file changed, 13 deletions(-)
> 
> diff --git a/etc/iproute2/rt_protos b/etc/iproute2/rt_protos
> index 3ffe8a6c..a965ad16 100644
> --- a/etc/iproute2/rt_protos
> +++ b/etc/iproute2/rt_protos
> @@ -21,16 +21,3 @@
>  188 ospf
>  189 rip
>  192 eigrp
> -
> -#
> -#Used by me for gated
> -#
> -254  gated/aggr
> -253  gated/bgp
> -252  gated/ospf
> -251  gated/ospfase
> -250  gated/rip
> -249  gated/static
> -248  gated/conn
> -247  gated/inet
> -246  gated/default

I already dropped these


Fw: [Bug 199995] New: Ramdomly sent TCP Reset from Kernel with bonding mode "brodcast"

2018-06-08 Thread Stephen Hemminger



Begin forwarded message:

Date: Fri, 08 Jun 2018 16:06:40 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 15] New: Ramdomly sent TCP Reset from Kernel with bonding 
mode "brodcast"


https://bugzilla.kernel.org/show_bug.cgi?id=15

Bug ID: 15
   Summary: Ramdomly sent TCP Reset from Kernel with bonding mode
"brodcast"
   Product: Networking
   Version: 2.5
Kernel Version: since 4.15.0
  Hardware: All
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: IPV4
  Assignee: step...@networkplumber.org
  Reporter: l.ben...@portunity.de
Regression: No

Created attachment 276401
  --> https://bugzilla.kernel.org/attachment.cgi?id=276401=edit  
TCP Dump

Hi,

after a dist upgrade from Ubuntu 17.10 (Kernel 4.13.x) to Ubuntu 18.04 (Kernel
4.15.0) I suffer from ramdomly generated TCP RST packets sent (presumably) by
the Kernel 
on a bonding device that uses bonding mode "brodcast" with 2 physical NICs.

With tcpdump/whireshark I can see that the kernel randomly sends TCP-RST
packets after the SYN/ACK/ACK packet is received (see attached PCAP).
This only happens if the kernel receives the initial SYN packet on both
physical NICs (and therefore seeing it twice), before the connection is
established by sending SYN/ACK.
It's not happening in 100% of all cases and only, if the system can use two or
more CPU cores/threads. With only one CPU available to the system, this
behaviour is not reproducable.


I can reproduce this on multiple physical servers with 2 bonded Intel NICs
connected over 2 seperate Switches and with virtual machines on a KVM Host
using 2 dedicated host bridges.
This also happens with a fresh installed Ubuntu 18.04 and Fedora 28 (kernel
4.16), so I decided to compile and boot with Kernel 4.17.0 on ubuntu, getting
the same result.
Only disabling/blocking the second network connection or reducing the amount of
CPU cores of the VM to one core solves the problem, so I think this could be a
race condition on systems with more than one CPU core and thread.

For my tests I used a very basic Ubuntu 18.04 (x86-64) running xinetd tcp-echo
service (port 7/TCP).
On the client I used the netcat-traditional packet with the following command:

  while true; do echo $(date) | nc.traditional -q 1 ECHO-SERVER 7; sleep 0.1 ;
done


This gives the following output:

---
Fr 8. Jun 09:12:43 UTC 2018
Fr 8. Jun 09:12:43 UTC 2018
Fr 8. Jun 09:12:43 UTC 2018
Fr 8. Jun 09:12:43 UTC 2018
Fr 8. Jun 09:12:43 UTC 2018
Fr 8. Jun 09:12:43 UTC 2018
Fr 8. Jun 09:12:43 UTC 2018
Fr 8. Jun 09:12:43 UTC 2018
(UNKNOWN) [192.168.86.101] 7 (echo) : Connection reset by peer
(UNKNOWN) [192.168.86.101] 7 (echo) : Connection reset by peer
(UNKNOWN) [192.168.86.101] 7 (echo) : Connection reset by peer
Fr 8. Jun 09:12:44 UTC 2018
Fr 8. Jun 09:12:44 UTC 2018
Fr 8. Jun 09:12:44 UTC 2018
Fr 8. Jun 09:12:44 UTC 2018
Fr 8. Jun 09:12:44 UTC 2018
Fr 8. Jun 09:12:44 UTC 2018
Fr 8. Jun 09:12:44 UTC 2018
Fr 8. Jun 09:12:44 UTC 2018
Fr 8. Jun 09:12:44 UTC 2018
Fr 8. Jun 09:12:44 UTC 2018
(UNKNOWN) [192.168.86.101] 7 (echo) : Connection reset by peer
(UNKNOWN) [192.168.86.101] 7 (echo) : Connection reset by peer
Fr 8. Jun 09:12:44 UTC 2018
---

-- 
You are receiving this mail because:
You are the assignee for the bug.


Re: [PATCH net] kcm: fix races on sk_receive_queue

2018-06-08 Thread Paolo Abeni
On Fri, 2018-06-08 at 10:53 -0400, David Miller wrote:
> From: Paolo Abeni 
> Date: Wed,  6 Jun 2018 15:16:29 +0200
> 
> > @@ -1126,7 +1132,7 @@ static int kcm_recvmsg(struct socket *sock, struct 
> > msghdr *msg,
> >  
> >   lock_sock(sk);
> >  
> > - skb = kcm_wait_data(sk, flags, timeo, );
> > + skb = kcm_wait_data(sk, flags, peek, timeo, );
> >   if (!skb)
> >   goto out;
> >  
> 
> Because kcm_wait_data() potentially unlinks now, you will have to kfree the
> SKB in the error paths, for example if skb_copy_datagram_msg() fails.
> 
> Otherwise we have an SKB leak.

Right. But now I fear the fix should be different: if we drop the skb
on skb_copy_datagram_msg() error, that will cause a behavior change. I
need to think more for a proper fix.

Thank you for the feedback.

Paolo



[PATCH net] KEYS: DNS: fix parsing multiple options

2018-06-08 Thread Eric Biggers
From: Eric Biggers 

My recent fix for dns_resolver_preparse() printing very long strings was
incomplete, as shown by syzbot which still managed to hit the
WARN_ONCE() in set_precision() by adding a crafted "dns_resolver" key:

precision 50001 too large
WARNING: CPU: 7 PID: 864 at lib/vsprintf.c:2164 vsnprintf+0x48a/0x5a0

The bug this time isn't just a printing bug, but also a logical error
when multiple options ("#"-separated strings) are given in the key
payload.  Specifically, when separating an option string into name and
value, if there is no value then the name is incorrectly considered to
end at the end of the key payload, rather than the end of the current
option.  This bypasses validation of the option length, and also means
that specifying multiple options is broken -- which presumably has gone
unnoticed as there is currently only one valid option anyway.

Fix it by correctly calculating the length of the option name.

Reproducer:

perl -e 'print "#A#", "\x00" x 5' | keyctl padd dns_resolver desc @s

Fixes: 4a2d789267e0 ("DNS: If the DNS server returns an error, allow that to be 
cached [ver #2]")
Signed-off-by: Eric Biggers 
---
 net/dns_resolver/dns_key.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/dns_resolver/dns_key.c b/net/dns_resolver/dns_key.c
index 40c851693f77e..d448823d4d2ed 100644
--- a/net/dns_resolver/dns_key.c
+++ b/net/dns_resolver/dns_key.c
@@ -97,7 +97,7 @@ dns_resolver_preparse(struct key_preparsed_payload *prep)
return -EINVAL;
}
 
-   eq = memchr(opt, '=', opt_len) ?: end;
+   eq = memchr(opt, '=', opt_len) ?: next_opt;
opt_nlen = eq - opt;
eq++;
opt_vlen = next_opt - eq; /* will be -1 if no value */
-- 
2.18.0.rc1.242.g61856ae69a-goog



[PATCH bpf] bpf: implement dummy fops for bpf objects

2018-06-08 Thread Daniel Borkmann
syzkaller was able to trigger the following warning in
do_dentry_open():

  WARNING: CPU: 1 PID: 4508 at fs/open.c:778 do_dentry_open+0x4ad/0xe40 
fs/open.c:778
  Kernel panic - not syncing: panic_on_warn set ...

  CPU: 1 PID: 4508 Comm: syz-executor867 Not tainted 4.17.0+ #90
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
Google 01/01/2011
  Call Trace:
  [...]
   vfs_open+0x139/0x230 fs/open.c:908
   do_last fs/namei.c:3370 [inline]
   path_openat+0x1717/0x4dc0 fs/namei.c:3511
   do_filp_open+0x249/0x350 fs/namei.c:3545
   do_sys_open+0x56f/0x740 fs/open.c:1101
   __do_sys_openat fs/open.c:1128 [inline]
   __se_sys_openat fs/open.c:1122 [inline]
   __x64_sys_openat+0x9d/0x100 fs/open.c:1122
   do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Problem was that prog and map inodes in bpf fs did not
implement a dummy file open operation that would return an
error. The patch in do_dentry_open() checks whether f_ops
are present and if not bails out with an error. While this
may be fine, we really shouldn't be throwing a warning
though. Thus follow the model similar to bad_file_ops and
reject the request unconditionally with -EIO.

Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
Reported-by: syzbot+2e7fcab0f56fdbb33...@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann 
---
 kernel/bpf/inode.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index ed13645..76efe9a 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -295,6 +295,15 @@ static const struct file_operations bpffs_map_fops = {
.release= bpffs_map_release,
 };
 
+static int bpffs_obj_open(struct inode *inode, struct file *file)
+{
+   return -EIO;
+}
+
+static const struct file_operations bpffs_obj_fops = {
+   .open   = bpffs_obj_open,
+};
+
 static int bpf_mkobj_ops(struct dentry *dentry, umode_t mode, void *raw,
 const struct inode_operations *iops,
 const struct file_operations *fops)
@@ -314,7 +323,8 @@ static int bpf_mkobj_ops(struct dentry *dentry, umode_t 
mode, void *raw,
 
 static int bpf_mkprog(struct dentry *dentry, umode_t mode, void *arg)
 {
-   return bpf_mkobj_ops(dentry, mode, arg, _prog_iops, NULL);
+   return bpf_mkobj_ops(dentry, mode, arg, _prog_iops,
+_obj_fops);
 }
 
 static int bpf_mkmap(struct dentry *dentry, umode_t mode, void *arg)
@@ -322,7 +332,7 @@ static int bpf_mkmap(struct dentry *dentry, umode_t mode, 
void *arg)
struct bpf_map *map = arg;
 
return bpf_mkobj_ops(dentry, mode, arg, _map_iops,
-map->btf ? _map_fops : NULL);
+map->btf ? _map_fops : _obj_fops);
 }
 
 static struct dentry *
-- 
2.9.5



Re: [PATCH] net: phy: Add TJA1100 BroadR-Reach PHY driver.

2018-06-08 Thread Andrew Lunn
On Fri, Jun 08, 2018 at 05:45:32PM +0300, Kirill Kranke wrote:
> Current generic PHY driver does not work with TJA1100 BroadR-REACH PHY
> properly. TJA1100 does not have any standard ability enabled at MII_BMSR
> register. Instead it has BroadR-REACH ability at MII_ESTATUS enabled, which
> is not handled by generic driver yet. Therefore generic driver is unable to
> guess required link speed, duplex etc. Device is started up with 10Mbps
> halfduplex which is incorrect.
> 
> BroadR-REACH able flag is not specified in IEEE802.3-2015. Which is why I
> did not add BroadR-REACH able flag support at generic driver. Once
> BroadR-REACH able flag gets into IEEE802.3 it should be reasonable to
> support it in the generic PHY driver.

Hi Kirill

Thank for making the changes.

It is normal to put 'v2' after PATCH in the subject line. Also, make a
brief list of changes since the previous version, after a line with
---. They will get removed when the patch is committed, but help
reviewers see what has changed.

For network patches, you should also include which tree these patches
are for. net-next in this case. See the networking FAQ.

> 
> Signed-off-by: Kirill Kranke 
> 
> diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
> index 343989f..7014eb7 100644
> --- a/drivers/net/phy/Kconfig
> +++ b/drivers/net/phy/Kconfig
> @@ -422,6 +422,14 @@ config TERANETICS_PHY
>   ---help---
> Currently supports the Teranetics TN2020
>  
> +config TJA1100_PHY
> + tristate "NXP TJA1100 PHY"

Please call this NXP_TJA1100_PHY. Putting the vendor first is the
general pattern. Are are a few TI drivers which ignore this, but other
follow this. This also means moving it up so it comes after
NATIONAL_PHY.

> + help
> +   Support of NXP TJA1100 BroadR-REACH ethernet PHY.
> +   Generic driver is not suitable for TJA1100 PHY while the PHY does not
> +   advertise any standard IEEE capabilities. It uses BroadR-REACH able
> +   flag instead. This driver configures capabilities of the PHY properly.
>

Does 100Base-T1/cause 96 define a way to identify a PHY which
implements this? I'm just wondering if we can do this in the generic
code, for devices which correctly implement the standard?

 +
>  config VITESSE_PHY
>   tristate "Vitesse PHYs"
>   ---help---
> diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
> index 5805c0b..4d2a69d 100644
> --- a/drivers/net/phy/Makefile
> +++ b/drivers/net/phy/Makefile
> @@ -83,5 +83,6 @@ obj-$(CONFIG_ROCKCHIP_PHY)  += rockchip.o
>  obj-$(CONFIG_SMSC_PHY)   += smsc.o
>  obj-$(CONFIG_STE10XP)+= ste10Xp.o
>  obj-$(CONFIG_TERANETICS_PHY) += teranetics.o
> +obj-$(CONFIG_TJA1100_PHY)+= tja1100.o
>  obj-$(CONFIG_VITESSE_PHY)+= vitesse.o
>  obj-$(CONFIG_XILINX_GMII2RGMII) += xilinx_gmii2rgmii.o
> diff --git a/drivers/net/phy/tja1100.c b/drivers/net/phy/tja1100.c
> new file mode 100644
> index 000..cddf4d7
> --- /dev/null
> +++ b/drivers/net/phy/tja1100.c
> @@ -0,0 +1,68 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* tja1100.c: TJA1100 BoardR-REACH PHY driver.
> + *
> + * Copyright (c) 2017 Kirill Kranke 
> + * Author: Kirill Kranke 
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +static int tja1100_phy_config_init(struct phy_device *phydev)
> +{
> + phydev->autoneg = AUTONEG_DISABLE;
> + phydev->speed = SPEED_100;
> + phydev->duplex = DUPLEX_FULL;
> +
> + return 0;
> +}
> +
> +static int tja1100_phy_config_aneg(struct phy_device *phydev)
> +{
> + if (phydev->autoneg == AUTONEG_ENABLE) {
> + phydev_err(phydev, "autonegotiation is not supported\n");
> + return -EINVAL;
> + }
> +
> + if (phydev->speed != SPEED_100 || phydev->duplex != DUPLEX_FULL) {
> + phydev_err(phydev, "only 100MBps Full Duplex allowed\n");
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
> +static struct phy_driver tja1100_phy_driver[] = {
> + {
> + .phy_id = 0x0180dc48,
> + .phy_id_mask = 0xfff0,
> + .name = "NXP TJA1100",
> +
> + /* TJA1100 has only 100BASE-BroadR-REACH ability specified
> +  * at MII_ESTATUS register. Standard modes are not
> +  * supported. Therefore BroadR-REACH allow only 100Mbps
> +  * full duplex without autoneg.
> +  */
> + .features = SUPPORTED_100baseT_Full | SUPPORTED_MII,

This is the second T1 driver we have had recently. It might make sense to add a
PHY_T1_FEATURES macro the include/linux/phy.h

Don't you also want SUPPORTED_TP?

Andrew


RE: [PATCH ethtool 2/6] ethtool: fix RING_VF assignment

2018-06-08 Thread Keller, Jacob E
> -Original Message-
> From: Ivan Vecera [mailto:c...@cera.cz]
> Sent: Friday, June 08, 2018 2:20 AM
> To: linvi...@tuxdriver.com
> Cc: netdev@vger.kernel.org; Keller, Jacob E 
> Subject: [PATCH ethtool 2/6] ethtool: fix RING_VF assignment
> 
> Fixes: 36ee712 ("ethtool: support queue and VF fields for rxclass filters")
> Cc: Jacob Keller 
> Signed-off-by: Ivan Vecera 
> ---
>  rxclass.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/rxclass.c b/rxclass.c
> index ce4b382..42d122d 100644
> --- a/rxclass.c
> +++ b/rxclass.c
> @@ -1066,7 +1066,7 @@ static int rxclass_get_val(char *str, unsigned char *p,
> u32 *flags,
>   val++;
> 
>   *(u64 *)[opt->offset] &=
> ~ETHTOOL_RX_FLOW_SPEC_RING_VF;
> - *(u64 *)[opt->offset] = (u64)val <<
> ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF;
> + *(u64 *)[opt->offset] |= (u64)val <<
> ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF;
>   break;
>   }

Hah. Good catch.

Thanks,
Jake

>   case OPT_RING_QUEUE: {
> --
> 2.16.4



[bpf PATCH v2 0/2] bpf, sockmap IPv6/TCP state fixes

2018-06-08 Thread John Fastabend
ULP are only valid with TCP in ESTABLISHED states. Sockmap was not
following this rule so add a fix to only allow ESTABLISHED states to
be added from the userspace side. On the BPF side we continue to allow
adding sockets to maps from sock_ops events, but only events that are
triggered when entering the ESTABLISHED state. This blocks users from
adding sockets to maps that will not be in the correct TCP state.

Also we stomped on the tcpv6_prot pointer overwriting with the
tcp_prot. This was discovered by syzbot (thanks!) and not found by
selftests because we only have local tests in selftest so even with
ipv6 selftests we did not trigger the splat.

Will follow up with IPv6 tests for selftest regardless it seems like
a miss to not have any IPv6 selftests.

Also these need to go to stable. There will be a small conflict on
the second patch where we add check to the sockhash update function
which did not exist until recently.

---

John Fastabend (2):
  bpf: sockmap, fix crash when ipv6 sock is added
  bpf: sockmap only allow ESTABLISHED sock state


 kernel/bpf/sockmap.c |   56 ++
 1 file changed, 52 insertions(+), 4 deletions(-)

--
Signature


[bpf PATCH v2 2/2] bpf: sockmap only allow ESTABLISHED sock state

2018-06-08 Thread John Fastabend
Per the note in the TLS ULP (which is actually a generic statement
regarding ULPs)

 /* The TLS ulp is currently supported only for TCP sockets
  * in ESTABLISHED state.
  * Supporting sockets in LISTEN state will require us
  * to modify the accept implementation to clone rather then
  * share the ulp context.
  */

After this patch we only allow socks that are in ESTABLISHED state or
are being added via a sock_ops event that is transitioning into an
ESTABLISHED state. By allowing sock_ops events we allow users to
manage sockmaps directly from sock ops programs. The two supported
sock_ops ops are BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB and
BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB.

Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
'netperf -H [IPv4]'.

Reported-by: Eric Dumazet 
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |   32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index fa9b7f3..4921fb7 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1956,8 +1956,12 @@ static int sock_map_update_elem(struct bpf_map *map,
return -EINVAL;
}
 
+   /* ULPs are currently supported only for TCP sockets in ESTABLISHED
+* state.
+*/
if (skops.sk->sk_type != SOCK_STREAM ||
-   skops.sk->sk_protocol != IPPROTO_TCP) {
+   skops.sk->sk_protocol != IPPROTO_TCP ||
+   skops.sk->sk_state != TCP_ESTABLISHED) {
fput(socket->file);
return -EOPNOTSUPP;
}
@@ -2318,6 +2322,16 @@ static int sock_hash_update_elem(struct bpf_map *map,
return -EINVAL;
}
 
+   /* ULPs are currently supported only for TCP sockets in ESTABLISHED
+* state.
+*/
+   if (skops.sk->sk_type != SOCK_STREAM ||
+   skops.sk->sk_protocol != IPPROTO_TCP ||
+   skops.sk->sk_state != TCP_ESTABLISHED) {
+   fput(socket->file);
+   return -EOPNOTSUPP;
+   }
+
err = sock_hash_ctx_update_elem(, map, key, flags);
fput(socket->file);
return err;
@@ -2403,10 +2417,23 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map 
*map, void *key)
.map_delete_elem = sock_hash_delete_elem,
 };
 
+static bool bpf_is_valid_sock(struct bpf_sock_ops_kern *ops)
+{
+   return ops->op == BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB ||
+  ops->op == BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB;
+}
+
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
   struct bpf_map *, map, void *, key, u64, flags)
 {
WARN_ON_ONCE(!rcu_read_lock_held());
+
+   /* ULPs are currently supported only for TCP sockets in ESTABLISHED
+* state. This checks that the sock ops triggering the update is
+* one indicating we are (or will be soon) in an ESTABLISHED state.
+*/
+   if (!bpf_is_valid_sock(bpf_sock))
+   return -EOPNOTSUPP;
return sock_map_ctx_update_elem(bpf_sock, map, key, flags);
 }
 
@@ -2425,6 +2452,9 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map 
*map, void *key)
   struct bpf_map *, map, void *, key, u64, flags)
 {
WARN_ON_ONCE(!rcu_read_lock_held());
+
+   if (!bpf_is_valid_sock(bpf_sock))
+   return -EOPNOTSUPP;
return sock_hash_ctx_update_elem(bpf_sock, map, key, flags);
 }
 



[bpf PATCH v2 1/2] bpf: sockmap, fix crash when ipv6 sock is added

2018-06-08 Thread John Fastabend
This fixes a crash where we assign tcp_prot to IPv6 sockets instead
of tcpv6_prot.

Previously we overwrote the sk->prot field with tcp_prot even in the
AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
are used. Further, only allow ESTABLISHED connections to join the
map per note in TLS ULP,

   /* The TLS ulp is currently supported only for TCP sockets
* in ESTABLISHED state.
* Supporting sockets in LISTEN state will require us
* to modify the accept implementation to clone rather then
* share the ulp context.
*/

Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
crashing case here.

Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
Reported-by: syzbot+5c063698bdbfac19f...@syzkaller.appspotmail.com
Signed-off-by: John Fastabend 
Signed-off-by: Wei Wang 
Signed-off-by: Daniel Borkmann 
---
 kernel/bpf/sockmap.c |   24 +---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 52a91d8..fa9b7f3 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -140,6 +141,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr 
*msg, size_t len,
 static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags);
+static void bpf_tcp_close(struct sock *sk, long timeout);
 
 static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
 {
@@ -162,6 +164,8 @@ static bool bpf_tcp_stream_read(const struct sock *sk)
 }
 
 static struct proto tcp_bpf_proto;
+static struct proto tcpv6_bpf_proto;
+
 static int bpf_tcp_init(struct sock *sk)
 {
struct smap_psock *psock;
@@ -181,14 +185,30 @@ static int bpf_tcp_init(struct sock *sk)
psock->save_close = sk->sk_prot->close;
psock->sk_proto = sk->sk_prot;
 
+   if (sk->sk_family == AF_INET6) {
+   tcpv6_bpf_proto = *sk->sk_prot;
+   tcpv6_bpf_proto.close = bpf_tcp_close;
+   } else {
+   tcp_bpf_proto = *sk->sk_prot;
+   tcp_bpf_proto.close = bpf_tcp_close;
+   }
+
if (psock->bpf_tx_msg) {
+   tcpv6_bpf_proto.sendmsg = bpf_tcp_sendmsg;
+   tcpv6_bpf_proto.sendpage = bpf_tcp_sendpage;
+   tcpv6_bpf_proto.recvmsg = bpf_tcp_recvmsg;
+   tcpv6_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
tcp_bpf_proto.recvmsg = bpf_tcp_recvmsg;
tcp_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
}
 
-   sk->sk_prot = _bpf_proto;
+   if (sk->sk_family == AF_INET6)
+   sk->sk_prot = _bpf_proto;
+   else
+   sk->sk_prot = _bpf_proto;
+
rcu_read_unlock();
return 0;
 }
@@ -,8 +1131,6 @@ static void bpf_tcp_msg_add(struct smap_psock *psock,
 
 static int bpf_tcp_ulp_register(void)
 {
-   tcp_bpf_proto = tcp_prot;
-   tcp_bpf_proto.close = bpf_tcp_close;
/* Once BPF TX ULP is registered it is never unregistered. It
 * will be in the ULP list for the lifetime of the system. Doing
 * duplicate registers is not a problem.



Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.

2018-06-08 Thread Eric Dumazet



On 06/08/2018 07:10 AM, Ben Greear wrote:
> Maybe whoever put this code together can take a stab at it.
> 

This was one one the motivation for the Fixes: tag request.

By doing a git blame, you can find which commit(s) added this code,
and thus CC the author, who might not follow netdev@ closely.



Re: [PATCH net] kcm: fix races on sk_receive_queue

2018-06-08 Thread David Miller
From: Paolo Abeni 
Date: Wed,  6 Jun 2018 15:16:29 +0200

> @@ -1126,7 +1132,7 @@ static int kcm_recvmsg(struct socket *sock, struct 
> msghdr *msg,
>  
>   lock_sock(sk);
>  
> - skb = kcm_wait_data(sk, flags, timeo, );
> + skb = kcm_wait_data(sk, flags, peek, timeo, );
>   if (!skb)
>   goto out;
>  

Because kcm_wait_data() potentially unlinks now, you will have to kfree the
SKB in the error paths, for example if skb_copy_datagram_msg() fails.

Otherwise we have an SKB leak.

Yeah, it's kind of ugly that kcm_recvmsg() is going to become a pile of
conditional operations based upon the peek boolean. :-/



[PATCH] net: phy: Add TJA1100 BroadR-Reach PHY driver.

2018-06-08 Thread Kirill Kranke
Current generic PHY driver does not work with TJA1100 BroadR-REACH PHY
properly. TJA1100 does not have any standard ability enabled at MII_BMSR
register. Instead it has BroadR-REACH ability at MII_ESTATUS enabled, which
is not handled by generic driver yet. Therefore generic driver is unable to
guess required link speed, duplex etc. Device is started up with 10Mbps
halfduplex which is incorrect.

BroadR-REACH able flag is not specified in IEEE802.3-2015. Which is why I
did not add BroadR-REACH able flag support at generic driver. Once
BroadR-REACH able flag gets into IEEE802.3 it should be reasonable to
support it in the generic PHY driver.

Signed-off-by: Kirill Kranke 

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 343989f..7014eb7 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -422,6 +422,14 @@ config TERANETICS_PHY
---help---
  Currently supports the Teranetics TN2020
 
+config TJA1100_PHY
+   tristate "NXP TJA1100 PHY"
+   help
+ Support of NXP TJA1100 BroadR-REACH ethernet PHY.
+ Generic driver is not suitable for TJA1100 PHY while the PHY does not
+ advertise any standard IEEE capabilities. It uses BroadR-REACH able
+ flag instead. This driver configures capabilities of the PHY properly.
+
 config VITESSE_PHY
tristate "Vitesse PHYs"
---help---
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index 5805c0b..4d2a69d 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -83,5 +83,6 @@ obj-$(CONFIG_ROCKCHIP_PHY)+= rockchip.o
 obj-$(CONFIG_SMSC_PHY) += smsc.o
 obj-$(CONFIG_STE10XP)  += ste10Xp.o
 obj-$(CONFIG_TERANETICS_PHY)   += teranetics.o
+obj-$(CONFIG_TJA1100_PHY)  += tja1100.o
 obj-$(CONFIG_VITESSE_PHY)  += vitesse.o
 obj-$(CONFIG_XILINX_GMII2RGMII) += xilinx_gmii2rgmii.o
diff --git a/drivers/net/phy/tja1100.c b/drivers/net/phy/tja1100.c
new file mode 100644
index 000..cddf4d7
--- /dev/null
+++ b/drivers/net/phy/tja1100.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0
+/* tja1100.c: TJA1100 BoardR-REACH PHY driver.
+ *
+ * Copyright (c) 2017 Kirill Kranke 
+ * Author: Kirill Kranke 
+ */
+
+#include 
+#include 
+#include 
+
+static int tja1100_phy_config_init(struct phy_device *phydev)
+{
+   phydev->autoneg = AUTONEG_DISABLE;
+   phydev->speed = SPEED_100;
+   phydev->duplex = DUPLEX_FULL;
+
+   return 0;
+}
+
+static int tja1100_phy_config_aneg(struct phy_device *phydev)
+{
+   if (phydev->autoneg == AUTONEG_ENABLE) {
+   phydev_err(phydev, "autonegotiation is not supported\n");
+   return -EINVAL;
+   }
+
+   if (phydev->speed != SPEED_100 || phydev->duplex != DUPLEX_FULL) {
+   phydev_err(phydev, "only 100MBps Full Duplex allowed\n");
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static struct phy_driver tja1100_phy_driver[] = {
+   {
+   .phy_id = 0x0180dc48,
+   .phy_id_mask = 0xfff0,
+   .name = "NXP TJA1100",
+
+   /* TJA1100 has only 100BASE-BroadR-REACH ability specified
+* at MII_ESTATUS register. Standard modes are not
+* supported. Therefore BroadR-REACH allow only 100Mbps
+* full duplex without autoneg.
+*/
+   .features = SUPPORTED_100baseT_Full | SUPPORTED_MII,
+
+   .config_aneg = tja1100_phy_config_aneg,
+   .config_init = tja1100_phy_config_init,
+
+   .suspend = genphy_suspend,
+   .resume = genphy_resume,
+   }
+};
+
+module_phy_driver(tja1100_phy_driver);
+
+MODULE_DESCRIPTION("NXP TJA1100 driver");
+MODULE_AUTHOR("Kirill Kranke ");
+MODULE_LICENSE("GPL");
+
+static struct mdio_device_id __maybe_unused nxp_tbl[] = {
+   { 0x0180dc48, 0xfff0 },
+   {}
+};
+
+MODULE_DEVICE_TABLE(mdio, nxp_tbl);


Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.

2018-06-08 Thread Ben Greear




On 06/07/2018 05:13 PM, Cong Wang wrote:

On Thu, Jun 7, 2018 at 4:48 PM,   wrote:

From: Ben Greear 

While testing an ath10k firmware that often crashed under load,
I was seeing kernel crashes as well.  One of them appeared to
be a dereference of a NULL flow object in fq_tin_dequeue.

I have since fixed the firmware flaw, but I think it would be
worth adding the WARN_ON in case the problem appears again.

BUG: unable to handle kernel NULL pointer dereference at 003c
IP: ieee80211_tx_dequeue+0xfb/0xb10 [mac80211]


Instead of adding WARN_ON(), you need to think about
the locking there, it is suspicious:

fq is from struct ieee80211_local:

struct fq *fq = >fq;

tin is from struct txq_info:

struct fq_tin *tin = >tin;

I don't know if fq and tin are supposed to be 1:1, if not there is
a bug in the locking, because ->new_flows and ->old_flows are
both inside tin instead of fq, but they are protected by fq->lock


Maybe whoever put this code together can take a stab at it.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.

2018-06-08 Thread Ben Greear




On 06/07/2018 04:59 PM, Cong Wang wrote:

On Thu, Jun 7, 2018 at 4:48 PM,   wrote:

diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
index be7c0fa..cb911f0 100644
--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
return NULL;
}

-   flow = list_first_entry(head, struct fq_flow, flowchain);
+   flow = list_first_entry_or_null(head, struct fq_flow, flowchain);
+
+   if (WARN_ON_ONCE(!flow))
+   return NULL;


This does not make sense either. list_first_entry_or_null()
returns NULL only when the list is empty, but we already check
list_empty() right before this code, and it is protected by fq->lock.



Nevermind then.

Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH] net: phy: Add TJA1100 BroadR-Reach PHY driver.

2018-06-08 Thread Andrew Lunn
On Fri, Jun 08, 2018 at 12:56:39PM +0300, Kirill Kranke wrote:
> From: Kirill Kranke 
> 
> Current generic PHY driver does not work with TJA1100 BroadR-REACH PHY
> properly. TJA1100 does not have any standard ability enabled at MII_BMSR
> register. Instead it has BroadR-REACH ability at MII_ESTATUS enabled, which
> is not handled by generic driver yet. Therefore generic driver is unable to
> guess required link speed, duplex etc. Device is started up with 10Mbps
> halfduplex which is incorrect.
> 
> BroadR-REACH able flag is not specified in IEEE802.3-2015. Which is why I
> did not add BroadR-REACH able flag support at generic driver. Once
> BroadR-REACH able flag gets into IEEE802.3 it should be reasonable to
> support it in the generic PHY driver.
> 
> Signed-off-by: Kirill Kranke 
> 
> diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
> index 343989f..7014eb7 100644
> --- a/drivers/net/phy/Kconfig
> +++ b/drivers/net/phy/Kconfig
> @@ -422,6 +422,14 @@ config TERANETICS_PHY
>   ---help---
> Currently supports the Teranetics TN2020
>  
> +config TJA1100_PHY
> + tristate "NXP TJA1100 PHY"
> + help
> +   Support of NXP TJA1100 BroadR-REACH ethernet PHY.
> +   Generic driver is not suitable for TJA1100 PHY while the PHY does not
> +   advertise any standard IEEE capabilities. It uses BroadR-REACH able
> +   flag instead. This driver configures capabilities of the PHY properly.
> +
>  config VITESSE_PHY
>   tristate "Vitesse PHYs"
>   ---help---
> diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
> index 5805c0b..4d2a69d 100644
> --- a/drivers/net/phy/Makefile
> +++ b/drivers/net/phy/Makefile
> @@ -83,5 +83,6 @@ obj-$(CONFIG_ROCKCHIP_PHY)  += rockchip.o
>  obj-$(CONFIG_SMSC_PHY)   += smsc.o
>  obj-$(CONFIG_STE10XP)+= ste10Xp.o
>  obj-$(CONFIG_TERANETICS_PHY) += teranetics.o
> +obj-$(CONFIG_TJA1100_PHY)+= tja1100.o
>  obj-$(CONFIG_VITESSE_PHY)+= vitesse.o
>  obj-$(CONFIG_XILINX_GMII2RGMII) += xilinx_gmii2rgmii.o
> diff --git a/drivers/net/phy/tja1100.c b/drivers/net/phy/tja1100.c
> new file mode 100644
> index 000..081b580
> --- /dev/null
> +++ b/drivers/net/phy/tja1100.c
> @@ -0,0 +1,215 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* tja1100.c: TJA1100 BoardR-REACH PHY driver.
> + *
> + * Copyright (c) 2017 Kirill Kranke 
> + * Author: Kirill Kranke 
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +/* TJA1100 specific registers */
> +#define TJA1100_ECTRL0x11/* Extended control register */
> +#define TJA1100_CFG1 0x12/* Configuration register 1 */
> +#define TJA1100_CFG2 0x13/* Configuration register 2 */
> +#define TJA1100_SERRCNT  0x14/* Symbol error counter register 2 */
> +#define TJA1100_INTST0x15/* Interrupt status register */
> +#define TJA1100_INTEN0x16/* Interrupt enable register */
> +#define TJA1100_COMST0x17/* Communication status register */
> +#define TJA1100_GST  0x18/* General status register */
> +#define TJA1100_EXTST0x19/* External status register */
> +#define TJA1100_LFCNT0x1a/* Link fail counter register */
> +
> +/* Extended control register */
> +#define ECTRL_LC 0x8000  /* link control enable */
> +#define ECTRL_PM 0x7800  /* operating mode select */
> +#define ECTRL_PM_NOCNG   0x  /* PM == : no change */
> +#define ECTRL_PM_NORMAL  0x1800  /* PM == 0011: Normal mode */
> +#define ECTRL_PM_STANBY  0x6000  /* PM == 1100: Standby mode */
> +#define ECTRL_PM_SREQ0x5800  /* PM == 1011: Sleep Request mode */
> +#define ECTRL_SJ_TST 0x0400  /* enable/disable Slave jitter test */
> +#define ECTRL_TR_RST 0x0200  /* Autonegotiation process restart */
> +#define ECTRL_TST_MODE   0x01c0  /* test mode selection */
> +#define ECTRL_C_TST  0x0020  /* TDR-based cable test */
> +#define ECTRL_LOOPBACK   0x0018  /* loopback mode select */
> +#define ECTRL_CFGEN  0x0004  /* configuration register access */
> +#define ECTRL_CFGINH 0x0002  /* INH configuration */
> +#define ECTRL_WAKE_REQ   0x0001  /* wake-up request configuration */
> +
> +/* Configuration register 1 */
> +#define CFG1_MS  0x8000  /* PHY Master/Slave configuration */
> +#define CFG1_AUTO_OP 0x4000  /* managed/autonomous operation */
> +#define CFG1_LINKLEN 0x2000  /* cable length: 0 < 15 m; 1 > 15 m */
> +#define CFG1_TXAMP   0x0c00  /* nominal transmit amplitude */
> +#define CFG1_TXAMP_050   0x  /* TXAMP == 00: 500 mV */
> +#define CFG1_TXAMP_075   0x0200  /* TXAMP == 01: 750 mV */
> +#define CFG1_TXAMP_100   0x0400  /* TXAMP == 10: 1000 mV */
> +#define CFG1_TXAMP_125   0x0c00  /* TXAMP == 11: 1250 mV */
> +#define CFG1_MODE0x0300  /* MII/RMII mode */
> +#define CFG1_DRIVER  0x0080  /* MII output driver strength */
> +#define CFG1_SC  0x0040  /* sleep confirmation setting */
> +#define CFG1_LED_MODE0x0030  /* LED mode */

[PATCH 1/2] iproute2: Add support for a few routing protocols

2018-06-08 Thread Donald Sharp
Add support for:

BGP
ISIS
OSPF
RIP
EIGRP

Routing protocols to iproute2.

Signed-off-by: Donald Sharp 
---
 etc/iproute2/rt_protos| 5 +
 include/linux/rtnetlink.h | 5 +
 lib/rt_names.c| 5 +
 3 files changed, 15 insertions(+)

diff --git a/etc/iproute2/rt_protos b/etc/iproute2/rt_protos
index 82cf9c46..3ffe8a6c 100644
--- a/etc/iproute2/rt_protos
+++ b/etc/iproute2/rt_protos
@@ -16,6 +16,11 @@
 15 ntk
 16  dhcp
 42 babel
+186 bgp
+187 isis
+188 ospf
+189 rip
+192 eigrp
 
 #
 #  Used by me for gated
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 742ba078..2e83a267 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -248,6 +248,11 @@ enum {
 #define RTPROT_DHCP16  /* DHCP client */
 #define RTPROT_MROUTED 17  /* Multicast daemon */
 #define RTPROT_BABEL   42  /* Babel daemon */
+#define RTPROT_BGP 186 /* BGP Routes */
+#define RTPROT_ISIS187 /* ISIS Routes */
+#define RTPROT_OSPF188 /* OSPF Routes */
+#define RTPROT_RIP 189 /* RIP Routes */
+#define RTPROT_EIGRP   192 /* EIGRP Routes */
 
 /* rtm_scope
 
diff --git a/lib/rt_names.c b/lib/rt_names.c
index 253389a6..d3562d2d 100644
--- a/lib/rt_names.c
+++ b/lib/rt_names.c
@@ -137,6 +137,11 @@ static char * rtnl_rtprot_tab[256] = {
[RTPROT_XORP] = "xorp",
[RTPROT_NTK] = "ntk",
[RTPROT_DHCP] = "dhcp",
+   [RTPROT_BGP] = "bgp",
+   [RTPROT_ISIS] = "isis",
+   [RTPROT_OSPF] = "ospf",
+   [RTPROT_RIP] = "rip",
+   [RTPROT_EIGRP] = "eigrp",
 };
 
 
-- 
2.14.4



[PATCH 0/2] Addition of new routing protocols for iproute2

2018-06-08 Thread Donald Sharp
The linux kernel recently accepted some new RTPROT values for some
fairly standard routing protocols.  This commit brings in support
for iproute2 to handle these new values.

Additionally clean up some long standing cruft in etc/iproute2/rt_protos

Donald Sharp (2):
  iproute2: Add support for a few routing protocols
  iproute2: Remove leftover gated RT_PROT defines

 etc/iproute2/rt_protos| 18 +-
 include/linux/rtnetlink.h |  5 +
 lib/rt_names.c|  5 +
 3 files changed, 15 insertions(+), 13 deletions(-)

-- 
2.14.4



[PATCH 2/2] iproute2: Remove leftover gated RT_PROT defines

2018-06-08 Thread Donald Sharp
These values are not being used nor maintained, so remove.

Signed-off-by: Donald Sharp 
---
 etc/iproute2/rt_protos | 13 -
 1 file changed, 13 deletions(-)

diff --git a/etc/iproute2/rt_protos b/etc/iproute2/rt_protos
index 3ffe8a6c..a965ad16 100644
--- a/etc/iproute2/rt_protos
+++ b/etc/iproute2/rt_protos
@@ -21,16 +21,3 @@
 188 ospf
 189 rip
 192 eigrp
-
-#
-#  Used by me for gated
-#
-254gated/aggr
-253gated/bgp
-252gated/ospf
-251gated/ospfase
-250gated/rip
-249gated/static
-248gated/conn
-247gated/inet
-246gated/default
-- 
2.14.4



[PATCH] net: phy: dp83822: use BMCR_ANENABLE instead of BMSR_ANEGCAPABLE for DP83620

2018-06-08 Thread Alvaro Gamez Machado
DP83620 register set is compatible with the DP83848, but it also supports
100base-FX. When the hardware is configured such as that fiber mode is
enabled, autonegotiation is not possible.

The chip, however, doesn't expose this information via BMSR_ANEGCAPABLE.
Instead, this bit is always set high, even if the particular hardware
configuration makes it so that auto negotiation is not possible [1]. Under
these circumstances, the phy subsystem keeps trying for autonegotiation to
happen, without success.

Hereby, we inspect BMCR_ANENABLE bit after genphy_config_init, which on
reset is set to 0 when auto negotiation is disabled, and so we use this
value instead of BMSR_ANEGCAPABLE.

[1] https://e2e.ti.com/support/interface/ethernet/f/903/p/697165/2571170

Signed-off-by: Alvaro Gamez Machado 
---
 drivers/net/phy/dp83848.c | 35 +--
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/drivers/net/phy/dp83848.c b/drivers/net/phy/dp83848.c
index cd09c3af2117..6e8e42361fd5 100644
--- a/drivers/net/phy/dp83848.c
+++ b/drivers/net/phy/dp83848.c
@@ -74,6 +74,25 @@ static int dp83848_config_intr(struct phy_device *phydev)
return phy_write(phydev, DP83848_MICR, control);
 }
 
+static int dp83848_config_init(struct phy_device *phydev)
+{
+   int err;
+   int val;
+
+   err = genphy_config_init(phydev);
+   if (err < 0)
+   return err;
+
+   /* DP83620 always reports Auto Negotiation Ability on BMSR. Instead,
+* we check initial value of BMCR Auto negotiation enable bit
+*/
+   val = phy_read(phydev, MII_BMCR);
+   if (!(val & BMCR_ANENABLE))
+   phydev->autoneg = AUTONEG_DISABLE;
+
+   return 0;
+}
+
 static struct mdio_device_id __maybe_unused dp83848_tbl[] = {
{ TI_DP83848C_PHY_ID, 0xfff0 },
{ NS_DP83848C_PHY_ID, 0xfff0 },
@@ -83,7 +102,7 @@ static struct mdio_device_id __maybe_unused dp83848_tbl[] = {
 };
 MODULE_DEVICE_TABLE(mdio, dp83848_tbl);
 
-#define DP83848_PHY_DRIVER(_id, _name) \
+#define DP83848_PHY_DRIVER(_id, _name, _config_init)   \
{   \
.phy_id = _id,  \
.phy_id_mask= 0xfff0,   \
@@ -92,7 +111,7 @@ MODULE_DEVICE_TABLE(mdio, dp83848_tbl);
.flags  = PHY_HAS_INTERRUPT,\
\
.soft_reset = genphy_soft_reset,\
-   .config_init= genphy_config_init,   \
+   .config_init= _config_init, \
.suspend= genphy_suspend,   \
.resume = genphy_resume,\
\
@@ -102,10 +121,14 @@ MODULE_DEVICE_TABLE(mdio, dp83848_tbl);
}
 
 static struct phy_driver dp83848_driver[] = {
-   DP83848_PHY_DRIVER(TI_DP83848C_PHY_ID, "TI DP83848C 10/100 Mbps PHY"),
-   DP83848_PHY_DRIVER(NS_DP83848C_PHY_ID, "NS DP83848C 10/100 Mbps PHY"),
-   DP83848_PHY_DRIVER(TI_DP83620_PHY_ID, "TI DP83620 10/100 Mbps PHY"),
-   DP83848_PHY_DRIVER(TLK10X_PHY_ID, "TI TLK10X 10/100 Mbps PHY"),
+   DP83848_PHY_DRIVER(TI_DP83848C_PHY_ID, "TI DP83848C 10/100 Mbps PHY",
+  genphy_config_init),
+   DP83848_PHY_DRIVER(NS_DP83848C_PHY_ID, "NS DP83848C 10/100 Mbps PHY",
+  genphy_config_init),
+   DP83848_PHY_DRIVER(TI_DP83620_PHY_ID, "TI DP83620 10/100 Mbps PHY",
+  dp83848_config_init),
+   DP83848_PHY_DRIVER(TLK10X_PHY_ID, "TI TLK10X 10/100 Mbps PHY",
+  genphy_config_init),
 };
 module_phy_driver(dp83848_driver);
 
-- 
2.17.1



Re: Qualcomm rmnet driver and qmi_wwan

2018-06-08 Thread Daniele Palmas
Hi Dan and Subash,

2018-06-05 19:38 GMT+02:00 Subash Abhinov Kasiviswanathan
:
> On 2018-06-05 08:54, Dan Williams wrote:
>>
>> On Tue, 2018-06-05 at 11:38 +0200, Daniele Palmas wrote:
>>>
>>> Hi,
>>>
>>> 2018-02-21 20:47 GMT+01:00 Subash Abhinov Kasiviswanathan
>>> :
>>> > On 2018-02-21 04:38, Daniele Palmas wrote:
>>> > >
>>> > > Hello,
>>> > >
>>> > > in rmnet kernel documentation I read:
>>> > >
>>> > > "This driver can be used to register onto any physical network
>>> > > device in
>>> > > IP mode. Physical transports include USB, HSIC, PCIe and IP
>>> > > accelerator."
>>> > >
>>> > > Does this mean that it can be used in association with the
>>> > > qmi_wwan
>>> > > driver?
>>> > >
>>> > > If yes, can someone give me an hint on the steps to follow?
>>> > >
>>> > > If not, does anyone know if it is possible to modify qmi_wwan in
>>> > > order
>>> > > to take advantage of the features provided by the rmnet driver?
>>> > >
>>> > > In this case hint on the changes for modifying qmi_wwan are
>>> > > welcome.
>>> > >
>>> > > Thanks in advance,
>>> > > Daniele
>>> >
>>> >
>>> > Hi
>>> >
>>> > I havent used qmi_wwan so the following comment is based on code
>>> > inspection.
>>> > qmimux_register_device() is creating qmimux devices with usb net
>>> > device as
>>> > real_dev. The Multiplexing and aggregation header (qmimux_hdr) is
>>> > stripped
>>> > off
>>> > in qmimux_rx_fixup() and the packet is passed on to stack.
>>> >
>>> > You could instead create rmnet devices with the usb netdevice as
>>> > real dev.
>>> > The packets from the usb net driver can be queued to network stack
>>> > directly
>>> > as rmnet driver will setup a RX handler. rmnet driver will process
>>> > the
>>> > packets
>>> > further and then queue to network stack.
>>> >
>>>
>>> in kernel documentation I read that rmnet user space configuration is
>>> done through librmnetctl available at
>>>
>>> https://source.codeaurora.org/quic/la/platform/vendor/qcom-opensource
>>> /dataservices/tree/rmnetctl
>>>
>>> However it seems to me that this is a bit outdated (e.g. it does not
>>> properly build since it is looking for kernel header
>>> linux/rmnet_data.h that, as far as I understand, is no more present).
>>>
>>> Is there available a more recent version of the tool?
>
>
> Hi Daniele
>
> The attached patch should have an updated version of the tool.
> Usage -
>
> rmnetcli -n newlink wwan0 rmnet0 1 1
> where wwan0 is the physical device
> rmnet0 is the virtual device to be created
> 1 is the mux id
> the other 1 is the flag to configure DL de-aggregation by default
>
> To delete a device -
>
> ip link delete rmnet0
>
>>
>> I'd expect that somebody (Subash?) would add support for the
>> rmnet/qmimux options to iproute2 via 'ip link' like exists for most
>> other device types.
>
>
> Hi Dan
>
> Yes, I can do that and update the documentation to point to using iproute2.
>

I followed Dan's advice and prepared a very basic test patch
(attached) for testing it through ip link.

Basically things seem to be properly working with qmicli, but I needed
to modify a bit qmi_wwan, so I'm adding Bjørn that maybe can help.

Bjørn,

I'm trying to add support to rmnet in qmi_wwan: I had to modify the
code as in the attached test patch, but I'm not sure it is the right
way.

This is done under the assumption that the rmnet device would be the
only one to register an rx handler to qmi_wwan, but it is probably
wrong.

Basically I'm wondering if there is a more correct way to understand
if an rmnet device is linked to the real qmi_wwan device.

Thanks,
Daniele

>
> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> a Linux Foundation Collaborative Project
From 9c1777d4d93238703172c5e88aaeb9d8b3e372eb Mon Sep 17 00:00:00 2001
From: Daniele Palmas 
Date: Fri, 8 Jun 2018 12:02:49 +0200
Subject: [PATCH 1/1] usb: net: qmi_wwan: add support for rmnet device

This patch allows to use rmnet with qmi_wwan create network
interfaces.

Signed-off-by: Daniele Palmas 
---
 drivers/net/usb/qmi_wwan.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 0946808..dd5f278 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -479,6 +479,11 @@ static int qmi_wwan_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
 	if (info->flags & QMI_WWAN_FLAG_MUX)
 		return qmimux_rx_fixup(dev, skb);
 
+	if (rcu_access_pointer(dev->net->rx_handler)) {
+		skb->protocol = htons(ETH_P_MAP);
+		return 1;
+	}
+
 	switch (skb->data[0] & 0xf0) {
 	case 0x40:
 		proto = htons(ETH_P_IP);
-- 
2.7.4

From 88bdb27b6d535600c10ef446391a51fc56691350 Mon Sep 17 00:00:00 2001
From: Daniele Palmas 
Date: Fri, 8 Jun 2018 11:43:49 +0200
Subject: [PATCH 1/1] ip: add rmnet initial support

This patch adds basic support for rmnet devices.

Currently the only possible actions are creating a new link
with a specific mux id and removing a link.

Signed-off-by: Daniele Palmas 
---
 ip/Makefile   |  2 +-
 

Re: net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets

2018-06-08 Thread Maciej Żenczykowski
I think we probably need to make sk->sk_reuse back into a boolean.
(ie. eliminate SK_FORCE_REUSE)

Then add a new tcp/udp sk->ignore_bind_conflicts boolean setting...
(ie. not just for tcp, but sol_socket)  [or perhaps SO_REPAIR,
sk->repair or something]

What I'm not certain of is exactly what sorts of conflicts it should ignore...
all?  probably not, still seems utterly wrong to allow creation of 2 connected
tcp sockets with identical 5-tuples.

Would it only ignore conflicts against other i_b_c sockets?
ie. set it on all sockets as we're repairing, then clear it on them
all once we're done?

and ignore all the fast caching when checking conflicts for an i_b_c socket?

For CRIU is it safe to assume we're restoring an entire namespace into
a new namespace?

Could we perhaps instead allow a new namespace to ignore bind conflicts until
we flip it into enforcing mode?


[PATCH] net: phy: Add TJA1100 BroadR-Reach PHY driver.

2018-06-08 Thread Kirill Kranke
From: Kirill Kranke 

Current generic PHY driver does not work with TJA1100 BroadR-REACH PHY
properly. TJA1100 does not have any standard ability enabled at MII_BMSR
register. Instead it has BroadR-REACH ability at MII_ESTATUS enabled, which
is not handled by generic driver yet. Therefore generic driver is unable to
guess required link speed, duplex etc. Device is started up with 10Mbps
halfduplex which is incorrect.

BroadR-REACH able flag is not specified in IEEE802.3-2015. Which is why I
did not add BroadR-REACH able flag support at generic driver. Once
BroadR-REACH able flag gets into IEEE802.3 it should be reasonable to
support it in the generic PHY driver.

Signed-off-by: Kirill Kranke 

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 343989f..7014eb7 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -422,6 +422,14 @@ config TERANETICS_PHY
---help---
  Currently supports the Teranetics TN2020
 
+config TJA1100_PHY
+   tristate "NXP TJA1100 PHY"
+   help
+ Support of NXP TJA1100 BroadR-REACH ethernet PHY.
+ Generic driver is not suitable for TJA1100 PHY while the PHY does not
+ advertise any standard IEEE capabilities. It uses BroadR-REACH able
+ flag instead. This driver configures capabilities of the PHY properly.
+
 config VITESSE_PHY
tristate "Vitesse PHYs"
---help---
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index 5805c0b..4d2a69d 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -83,5 +83,6 @@ obj-$(CONFIG_ROCKCHIP_PHY)+= rockchip.o
 obj-$(CONFIG_SMSC_PHY) += smsc.o
 obj-$(CONFIG_STE10XP)  += ste10Xp.o
 obj-$(CONFIG_TERANETICS_PHY)   += teranetics.o
+obj-$(CONFIG_TJA1100_PHY)  += tja1100.o
 obj-$(CONFIG_VITESSE_PHY)  += vitesse.o
 obj-$(CONFIG_XILINX_GMII2RGMII) += xilinx_gmii2rgmii.o
diff --git a/drivers/net/phy/tja1100.c b/drivers/net/phy/tja1100.c
new file mode 100644
index 000..081b580
--- /dev/null
+++ b/drivers/net/phy/tja1100.c
@@ -0,0 +1,215 @@
+// SPDX-License-Identifier: GPL-2.0
+/* tja1100.c: TJA1100 BoardR-REACH PHY driver.
+ *
+ * Copyright (c) 2017 Kirill Kranke 
+ * Author: Kirill Kranke 
+ */
+
+#include 
+#include 
+#include 
+
+/* TJA1100 specific registers */
+#define TJA1100_ECTRL  0x11/* Extended control register */
+#define TJA1100_CFG1   0x12/* Configuration register 1 */
+#define TJA1100_CFG2   0x13/* Configuration register 2 */
+#define TJA1100_SERRCNT0x14/* Symbol error counter register 2 */
+#define TJA1100_INTST  0x15/* Interrupt status register */
+#define TJA1100_INTEN  0x16/* Interrupt enable register */
+#define TJA1100_COMST  0x17/* Communication status register */
+#define TJA1100_GST0x18/* General status register */
+#define TJA1100_EXTST  0x19/* External status register */
+#define TJA1100_LFCNT  0x1a/* Link fail counter register */
+
+/* Extended control register */
+#define ECTRL_LC   0x8000  /* link control enable */
+#define ECTRL_PM   0x7800  /* operating mode select */
+#define ECTRL_PM_NOCNG 0x  /* PM == : no change */
+#define ECTRL_PM_NORMAL0x1800  /* PM == 0011: Normal mode */
+#define ECTRL_PM_STANBY0x6000  /* PM == 1100: Standby mode */
+#define ECTRL_PM_SREQ  0x5800  /* PM == 1011: Sleep Request mode */
+#define ECTRL_SJ_TST   0x0400  /* enable/disable Slave jitter test */
+#define ECTRL_TR_RST   0x0200  /* Autonegotiation process restart */
+#define ECTRL_TST_MODE 0x01c0  /* test mode selection */
+#define ECTRL_C_TST0x0020  /* TDR-based cable test */
+#define ECTRL_LOOPBACK 0x0018  /* loopback mode select */
+#define ECTRL_CFGEN0x0004  /* configuration register access */
+#define ECTRL_CFGINH   0x0002  /* INH configuration */
+#define ECTRL_WAKE_REQ 0x0001  /* wake-up request configuration */
+
+/* Configuration register 1 */
+#define CFG1_MS0x8000  /* PHY Master/Slave configuration */
+#define CFG1_AUTO_OP   0x4000  /* managed/autonomous operation */
+#define CFG1_LINKLEN   0x2000  /* cable length: 0 < 15 m; 1 > 15 m */
+#define CFG1_TXAMP 0x0c00  /* nominal transmit amplitude */
+#define CFG1_TXAMP_050 0x  /* TXAMP == 00: 500 mV */
+#define CFG1_TXAMP_075 0x0200  /* TXAMP == 01: 750 mV */
+#define CFG1_TXAMP_100 0x0400  /* TXAMP == 10: 1000 mV */
+#define CFG1_TXAMP_125 0x0c00  /* TXAMP == 11: 1250 mV */
+#define CFG1_MODE  0x0300  /* MII/RMII mode */
+#define CFG1_DRIVER0x0080  /* MII output driver strength */
+#define CFG1_SC0x0040  /* sleep confirmation setting */
+#define CFG1_LED_MODE  0x0030  /* LED mode */
+#define CFG1_LED_EN0x0008  /* LED enable */
+#define CFG1_CFG_WAKE  0x0004  /* local wake configuration */
+#define CFG1_APWD  0x0002  /* autonomous power down */
+#define CFG1_LPS   0x0001  /* LPS code group reception */
+
+/* Configuration register 2 */
+#define CFG2_PHYAD_4_0 0xf800  /* PHY 

Re: [PATCH bpf-next v6 1/2] trace_helpers.c: Add helpers to poll multiple perf FDs for events

2018-06-08 Thread Toke Høiland-Jørgensen
Daniel Borkmann  writes:

> Hi Toke,
>
> On 06/06/2018 07:58 PM, Toke Høiland-Jørgensen wrote:
>> Add two new helper functions to trace_helpers that supports polling
>> multiple perf file descriptors for events. These are used to the XDP
>> perf_event_output example, which needs to work with one perf fd per CPU.
>> 
>> Reviewed-by: Jakub Kicinski 
>> Signed-off-by: Toke Høiland-Jørgensen 
>
> Didn't make it in time unfortunately as bpf-next closed, please resend
> these two once merge window is over and bpf-next open again.

Will do!

-Toke


Re: [PATCH ethtool 5/6] ethtool: correctly free hkey when get_stringset() fails

2018-06-08 Thread Gal Pressman
On 08-Jun-18 12:20, Ivan Vecera wrote:
> Memory allocated for 'hkey' is not freed when
> get_stringset(..., ETH_SS_RSS_HASH_FUNCS...) fails.
> 
> Fixes: b888f35 ("ethtool: Support for configurable RSS hash function")

Thanks for fixing this!
Please use the first 12 characters of the sha1 in the Fixes line.

> Cc: Gal Pressman 
> Signed-off-by: Ivan Vecera 
> ---
>  ethtool.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/ethtool.c b/ethtool.c
> index 2b90984..fb93ae8 100644
> --- a/ethtool.c
> +++ b/ethtool.c
> @@ -3910,7 +3910,7 @@ static int do_srxfhindir(struct cmd_context *ctx, int 
> rxfhindir_default,
>  static int do_srxfh(struct cmd_context *ctx)
>  {
>   struct ethtool_rxfh rss_head = {0};
> - struct ethtool_rxfh *rss;
> + struct ethtool_rxfh *rss = NULL;
>   struct ethtool_rxnfc ring_count;
>   int rxfhindir_equal = 0, rxfhindir_default = 0;
>   struct ethtool_gstrings *hfuncs = NULL;
> @@ -4064,7 +4064,8 @@ static int do_srxfh(struct cmd_context *ctx)
>   hfuncs = get_stringset(ctx, ETH_SS_RSS_HASH_FUNCS, 0, 1);
>   if (!hfuncs) {
>   perror("Cannot get hash functions names");
> - return 1;
> + err = 1;
> + goto free;
>   }
>  
>   for (i = 0; i < hfuncs->len && !req_hfunc ; i++) {
> @@ -4078,8 +4079,8 @@ static int do_srxfh(struct cmd_context *ctx)
>   if (!req_hfunc) {
>   fprintf(stderr,
>   "Unknown hash function: %s\n", req_hfunc_name);
> - free(hfuncs);
> - return 1;
> + err = 1;
> + goto free;
>   }
>   }
>  
> @@ -4120,9 +4121,7 @@ static int do_srxfh(struct cmd_context *ctx)
>   }
>  
>  free:
> - if (hkey)
> - free(hkey);
> -
> + free(hkey);
>   free(rss);
>   free(hfuncs);
>   return err;
> 



[PATCH net] udp: fix rx queue len reported by diag and proc interface

2018-06-08 Thread Paolo Abeni
After commit 6b229cf77d68 ("udp: add batching to udp_rmem_release()")
the sk_rmem_alloc field does not measure exactly anymore the
receive queue length, because we batch the rmem release. The issue
is really apparent only after commit 0d4a6608f68c ("udp: do rmem bulk
free even if the rx sk queue is empty"): the user space can easily
check for an empty socket with not-0 queue length reported by the 'ss'
tool or the procfs interface.

We need to use a custom UDP helper to report the correct queue length,
taking into account the forward allocation deficit.

Reported-by: trevor.fran...@46labs.com
Fixes: 6b229cf77d68 ("UDP: add batching to udp_rmem_release()")
Signed-off-by: Paolo Abeni 
---
 include/net/transp_v6.h | 11 +--
 include/net/udp.h   |  5 +
 net/ipv4/udp.c  |  2 +-
 net/ipv4/udp_diag.c |  2 +-
 net/ipv6/datagram.c |  6 +++---
 net/ipv6/udp.c  |  3 ++-
 6 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/include/net/transp_v6.h b/include/net/transp_v6.h
index c4f5caaf3778..f6a3543e5247 100644
--- a/include/net/transp_v6.h
+++ b/include/net/transp_v6.h
@@ -45,8 +45,15 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk, 
struct msghdr *msg,
  struct flowi6 *fl6, struct ipcm6_cookie *ipc6,
  struct sockcm_cookie *sockc);
 
-void ip6_dgram_sock_seq_show(struct seq_file *seq, struct sock *sp,
-__u16 srcp, __u16 destp, int bucket);
+void __ip6_dgram_sock_seq_show(struct seq_file *seq, struct sock *sp,
+  __u16 srcp, __u16 destp, int rqueue, int bucket);
+static inline void
+ip6_dgram_sock_seq_show(struct seq_file *seq, struct sock *sp, __u16 srcp,
+   __u16 destp, int bucket)
+{
+   __ip6_dgram_sock_seq_show(seq, sp, srcp, destp, sk_rmem_alloc_get(sp),
+ bucket);
+}
 
 #define LOOPBACK4_IPV6 cpu_to_be32(0x7f06)
 
diff --git a/include/net/udp.h b/include/net/udp.h
index 7ba0ed252c52..b1ea8b0f5e6a 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -247,6 +247,11 @@ static inline __be16 udp_flow_src_port(struct net *net, 
struct sk_buff *skb,
return htonsu64) hash * (max - min)) >> 32) + min);
 }
 
+static inline int udp_rqueue_get(struct sock *sk)
+{
+   return sk_rmem_alloc_get(sk) - READ_ONCE(udp_sk(sk)->forward_deficit);
+}
+
 /* net/ipv4/udp.c */
 void udp_destruct_sock(struct sock *sk);
 void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3365362cac88..9bb27df4dac5 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2772,7 +2772,7 @@ static void udp4_format_sock(struct sock *sp, struct 
seq_file *f,
" %02X %08X:%08X %02X:%08lX %08X %5u %8d %lu %d %pK %d",
bucket, src, srcp, dest, destp, sp->sk_state,
sk_wmem_alloc_get(sp),
-   sk_rmem_alloc_get(sp),
+   udp_rqueue_get(sp),
0, 0L, 0,
from_kuid_munged(seq_user_ns(f), sock_i_uid(sp)),
0, sock_i_ino(sp),
diff --git a/net/ipv4/udp_diag.c b/net/ipv4/udp_diag.c
index d0390d844ac8..d9ad986c7b2c 100644
--- a/net/ipv4/udp_diag.c
+++ b/net/ipv4/udp_diag.c
@@ -163,7 +163,7 @@ static int udp_diag_dump_one(struct sk_buff *in_skb, const 
struct nlmsghdr *nlh,
 static void udp_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
void *info)
 {
-   r->idiag_rqueue = sk_rmem_alloc_get(sk);
+   r->idiag_rqueue = udp_rqueue_get(sk);
r->idiag_wqueue = sk_wmem_alloc_get(sk);
 }
 
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index a02ad100f0d7..2ee08b6a86a4 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -1019,8 +1019,8 @@ int ip6_datagram_send_ctl(struct net *net, struct sock 
*sk,
 }
 EXPORT_SYMBOL_GPL(ip6_datagram_send_ctl);
 
-void ip6_dgram_sock_seq_show(struct seq_file *seq, struct sock *sp,
-__u16 srcp, __u16 destp, int bucket)
+void __ip6_dgram_sock_seq_show(struct seq_file *seq, struct sock *sp,
+  __u16 srcp, __u16 destp, int rqueue, int bucket)
 {
const struct in6_addr *dest, *src;
 
@@ -1036,7 +1036,7 @@ void ip6_dgram_sock_seq_show(struct seq_file *seq, struct 
sock *sp,
   dest->s6_addr32[2], dest->s6_addr32[3], destp,
   sp->sk_state,
   sk_wmem_alloc_get(sp),
-  sk_rmem_alloc_get(sp),
+  rqueue,
   0, 0L, 0,
   from_kuid_munged(seq_user_ns(seq), sock_i_uid(sp)),
   0,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 164afd31aebf..e6645cae403e 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1523,7 +1523,8 @@ int udp6_seq_show(struct seq_file *seq, void *v)
struct inet_sock *inet = inet_sk(v);
__u16 srcp = ntohs(inet->inet_sport);

[PATCH ethtool 6/6] ethtool: remove unreachable code

2018-06-08 Thread Ivan Vecera
The default switch case is unreachable as the MAX_CHANNEL_NUM == 4.

Fixes: a5e73bb ("ethtool:QSFP Plus/QSFP28 Diagnostics Information Support")
Cc: Vidya Sagar Ravipati 
Signed-off-by: Ivan Vecera 
---
 qsfp.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/qsfp.c b/qsfp.c
index aecd5bb..32e195d 100644
--- a/qsfp.c
+++ b/qsfp.c
@@ -661,9 +661,6 @@ static void sff8636_dom_parse(const __u8 *id, struct 
sff_diags *sd)
tx_power_offset = SFF8636_TX_PWR_4_OFFSET;
tx_bias_offset = SFF8636_TX_BIAS_4_OFFSET;
break;
-   default:
-   printf(" Invalid channel: %d\n", i);
-   break;
}
sd->scd[i].bias_cur = OFFSET_TO_U16(tx_bias_offset);
sd->scd[i].rx_power = OFFSET_TO_U16(rx_power_offset);
-- 
2.16.4



[PATCH ethtool 1/6] ethtool: fix uninitialized return value

2018-06-08 Thread Ivan Vecera
Fixes: b0fe96d ("Ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE 
and PHY downshift")
Signed-off-by: Ivan Vecera 
---
 ethtool.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/ethtool.c b/ethtool.c
index 2e87384..e7495fe 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -4723,8 +4723,8 @@ static int do_get_phy_tunable(struct cmd_context *ctx)
 {
int argc = ctx->argc;
char **argp = ctx->argp;
-   int err, i;
u8 downshift_changed = 0;
+   int i;
 
if (argc < 1)
exit_bad_args();
@@ -4750,8 +4750,7 @@ static int do_get_phy_tunable(struct cmd_context *ctx)
cont.ds.id = ETHTOOL_PHY_DOWNSHIFT;
cont.ds.type_id = ETHTOOL_TUNABLE_U8;
cont.ds.len = 1;
-   err = send_ioctl(ctx, );
-   if (err < 0) {
+   if (send_ioctl(ctx, ) < 0) {
perror("Cannot Get PHY downshift count");
return 87;
}
@@ -4762,7 +4761,7 @@ static int do_get_phy_tunable(struct cmd_context *ctx)
fprintf(stdout, "Downshift disabled\n");
}
 
-   return err;
+   return 0;
 }
 
 static __u32 parse_reset(char *val, __u32 bitset, char *arg, __u32 *data)
-- 
2.16.4



[PATCH ethtool 3/6] ethtool: remove unused global variable

2018-06-08 Thread Ivan Vecera
Fixes: 2c2ee7a ("ethtool: Add support for sfc register dumps")
Signed-off-by: Ivan Vecera 
---
 sfc.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/sfc.c b/sfc.c
index 9478b38..b4c590f 100644
--- a/sfc.c
+++ b/sfc.c
@@ -3083,9 +3083,6 @@ static const struct efx_nic_reg_field 
efx_nic_reg_fields_TX_PACE[] = {
REGISTER_FIELD_BZ(TX_PACE_SB_AF),
REGISTER_FIELD_BZ(TX_PACE_SB_NOT_AF),
 };
-static const struct efx_nic_reg_field efx_nic_reg_fields_TX_PACE_DROP_QID[] = {
-   REGISTER_FIELD_BZ(TX_PACE_QID_DRP_CNT),
-};
 static const struct efx_nic_reg_field efx_nic_reg_fields_TX_VLAN[] = {
REGISTER_FIELD_BB(TX_VLAN0),
REGISTER_FIELD_BB(TX_VLAN0_PORT0_EN),
-- 
2.16.4



[PATCH ethtool 4/6] ethtool: several fixes in do_gregs()

2018-06-08 Thread Ivan Vecera
- correctly close gregs_dump_file in case of fstat() failure
- check for error from realloc

Fixes: be4c2d0 ("ethtool.c: fix dump_regs heap corruption")
Cc: David Decotigny 
Signed-off-by: Ivan Vecera 
---
 ethtool.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/ethtool.c b/ethtool.c
index e7495fe..2b90984 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -3179,17 +3179,26 @@ static int do_gregs(struct cmd_context *ctx)
if (!gregs_dump_raw && gregs_dump_file != NULL) {
/* overwrite reg values from file dump */
FILE *f = fopen(gregs_dump_file, "r");
+   struct ethtool_regs *nregs;
struct stat st;
size_t nread;
 
if (!f || fstat(fileno(f), ) < 0) {
fprintf(stderr, "Can't open '%s': %s\n",
gregs_dump_file, strerror(errno));
+   if (f)
+   fclose(f);
free(regs);
return 75;
}
 
-   regs = realloc(regs, sizeof(*regs) + st.st_size);
+   nregs = realloc(regs, sizeof(*regs) + st.st_size);
+   if (!nregs) {
+   perror("Cannot allocate memory for register dump");
+   free(regs); /* was not freed by realloc */
+   return 73;
+   }
+   regs = nregs;
regs->len = st.st_size;
nread = fread(regs->data, regs->len, 1, f);
fclose(f);
-- 
2.16.4



[PATCH ethtool 5/6] ethtool: correctly free hkey when get_stringset() fails

2018-06-08 Thread Ivan Vecera
Memory allocated for 'hkey' is not freed when
get_stringset(..., ETH_SS_RSS_HASH_FUNCS...) fails.

Fixes: b888f35 ("ethtool: Support for configurable RSS hash function")
Cc: Gal Pressman 
Signed-off-by: Ivan Vecera 
---
 ethtool.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/ethtool.c b/ethtool.c
index 2b90984..fb93ae8 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -3910,7 +3910,7 @@ static int do_srxfhindir(struct cmd_context *ctx, int 
rxfhindir_default,
 static int do_srxfh(struct cmd_context *ctx)
 {
struct ethtool_rxfh rss_head = {0};
-   struct ethtool_rxfh *rss;
+   struct ethtool_rxfh *rss = NULL;
struct ethtool_rxnfc ring_count;
int rxfhindir_equal = 0, rxfhindir_default = 0;
struct ethtool_gstrings *hfuncs = NULL;
@@ -4064,7 +4064,8 @@ static int do_srxfh(struct cmd_context *ctx)
hfuncs = get_stringset(ctx, ETH_SS_RSS_HASH_FUNCS, 0, 1);
if (!hfuncs) {
perror("Cannot get hash functions names");
-   return 1;
+   err = 1;
+   goto free;
}
 
for (i = 0; i < hfuncs->len && !req_hfunc ; i++) {
@@ -4078,8 +4079,8 @@ static int do_srxfh(struct cmd_context *ctx)
if (!req_hfunc) {
fprintf(stderr,
"Unknown hash function: %s\n", req_hfunc_name);
-   free(hfuncs);
-   return 1;
+   err = 1;
+   goto free;
}
}
 
@@ -4120,9 +4121,7 @@ static int do_srxfh(struct cmd_context *ctx)
}
 
 free:
-   if (hkey)
-   free(hkey);
-
+   free(hkey);
free(rss);
free(hfuncs);
return err;
-- 
2.16.4



[PATCH ethtool 2/6] ethtool: fix RING_VF assignment

2018-06-08 Thread Ivan Vecera
Fixes: 36ee712 ("ethtool: support queue and VF fields for rxclass filters")
Cc: Jacob Keller 
Signed-off-by: Ivan Vecera 
---
 rxclass.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rxclass.c b/rxclass.c
index ce4b382..42d122d 100644
--- a/rxclass.c
+++ b/rxclass.c
@@ -1066,7 +1066,7 @@ static int rxclass_get_val(char *str, unsigned char *p, 
u32 *flags,
val++;
 
*(u64 *)[opt->offset] &= ~ETHTOOL_RX_FLOW_SPEC_RING_VF;
-   *(u64 *)[opt->offset] = (u64)val << 
ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF;
+   *(u64 *)[opt->offset] |= (u64)val << 
ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF;
break;
}
case OPT_RING_QUEUE: {
-- 
2.16.4



[PATCH v6 net] stmmac: strip all VLAN tag types when kernel 802.1Q support is selected

2018-06-08 Thread Elad Nachman
stmmac reception handler calls stmmac_rx_vlan() to strip the vlan before 
calling napi_gro_receive().

The function assumes VLAN tagged frames are always tagged with 
802.1Q protocol, and assigns ETH_P_8021Q to the skb by hard-coding
the parameter on call to __vlan_hwaccel_put_tag() .

This causes packets not to be passed to the VLAN slave if it was created 
with 802.1AD protocol
(ip link add link eth0 eth0.100 type vlan proto 802.1ad id 100).

This fix passes the protocol from the VLAN header into 
__vlan_hwaccel_put_tag() instead of using the hard-coded value of
ETH_P_8021Q.

NETIF_F_HW_VLAN_CTAG_RX check was removed and instead the strip action 
is dependent upon a preprocessor define which is defined when 802.1Q 
support is selected in the kernel config. 

NETIF_F_HW_VLAN_STAG_RX feature was added to be in line with the driver 
actual abilities.

Signed-off-by: Elad Nachman 


---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index b65e2d1..707917d 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3293,18 +3293,20 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
 static void stmmac_rx_vlan(struct net_device *dev, struct sk_buff *skb)
 {
-   struct ethhdr *ehdr;
+#ifdef STMMAC_VLAN_TAG_USED
+   struct vlan_ethhdr *veth;
+   __be16 vlan_proto;
u16 vlanid;
 
-   if ((dev->features & NETIF_F_HW_VLAN_CTAG_RX) ==
-   NETIF_F_HW_VLAN_CTAG_RX &&
-   !__vlan_get_tag(skb, )) {
+   if (!__vlan_get_tag(skb, )) {
/* pop the vlan tag */
-   ehdr = (struct ethhdr *)skb->data;
-   memmove(skb->data + VLAN_HLEN, ehdr, ETH_ALEN * 2);
+   veth = (struct vlan_ethhdr *)skb->data;
+   vlan_proto = veth->h_vlan_proto;
+   memmove(skb->data + VLAN_HLEN, veth, ETH_ALEN * 2);
skb_pull(skb, VLAN_HLEN);
-   __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlanid);
+   __vlan_hwaccel_put_tag(skb, vlan_proto, vlanid);
}
+#endif
 }
 
 
@@ -4344,7 +4346,7 @@ int stmmac_dvr_probe(struct device *device,
ndev->watchdog_timeo = msecs_to_jiffies(watchdog);
 #ifdef STMMAC_VLAN_TAG_USED
/* Both mac100 and gmac support receive VLAN tag detection */
-   ndev->features |= NETIF_F_HW_VLAN_CTAG_RX;
+   ndev->features |= NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_STAG_RX;
 #endif
priv->msg_enable = netif_msg_init(debug, default_msg_level);
 
-- 
2.7.4


Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in xfrm_lookup

2018-06-08 Thread Kristian Evensen
Hi,

On Wed, Jun 6, 2018 at 6:03 PM, Tobias Hommel  wrote:
> Sorry no progress until now, I currently do not get time to have a deeper look
> into that. We're back to 4.1.6 right now.

Thanks for letting me know. In the project I am currently involved in,
we unfortunately don't have the option of reverting the kernel, so we
are finding ways to live with the error. We have been looking into the
error a bit more, and have made the following observations:

* First of all, as discussed earlier in the thread, the error is
triggered by dst_orig being NULL. Our current work-around is just to
return from xfrm_lookup if dst_orig is NULL and this seems to work
fine, the error doesn't happen that often (in our use-cases at least).
* The machine we use for testing (and where we first saw the error) is
used as initiator.
* When we compare the logs from Strongswan with the ones from the
kernel, it seems that the error is typically triggered when a tunnels
is teared down/about to come up. We need quite a lot of tunnels for
the error to trigger, usually around 30+. I guess this might point to
some race or some condition not being met when packets are
sent/received.
* We see the error much more frequently when hardware encryption is enabled.
* Yesterday, we upgraded the kernel from 4.14.34 to 4.14.48, and the
error happens much less frequently. I see that 4.14.48 includes
several IPsec fixes (for example the previously mentioned ("xfrm: Fix
a race in the xdst pcpu cache.")).

BR,
Kristian


Re: [PATCH net v2] net/sched: act_simple: fix parsing of TCA_DEF_DATA

2018-06-08 Thread Simon Horman
On Fri, Jun 08, 2018 at 05:02:31AM +0200, Davide Caratti wrote:
> use nla_strlcpy() to avoid copying data beyond the length of TCA_DEF_DATA
> netlink attribute, in case it is less than SIMP_MAX_DATA and it does not
> end with '\0' character.
> 
> v2: fix errors in the commit message, thanks Hangbin Liu
> 
> Fixes: fa1b1cff3d06 ("net_cls_act: Make act_simple use of netlink policy.")
> Signed-off-by: Davide Caratti 

Reviewed-by: Simon Horman 


Re: [PATCH ipsec] vti6: fix PMTU caching and reporting on xmit

2018-06-08 Thread Steffen Klassert
On Thu, Jun 07, 2018 at 10:11:02AM +0300, Eyal Birger wrote:
> When setting the skb->dst before doing the MTU check, the route PMTU
> caching and reporting is done on the new dst which is about to be
> released.
> 
> Instead, PMTU handling should be done using the original dst.
> 
> This is aligned with IPv4 VTI.
> 
> Signed-off-by: Eyal Birger 
> Fixes: ccd740cbc6 ("vti6: Add pmtu handling to vti6_xmit.")

Patch applied, thanks for catching this Eyal!


Re: [PATCH net-next v2 0/2] net: phy: improve PHY suspend/resume

2018-06-08 Thread Heiner Kallweit
On 05.06.2018 21:39, Heiner Kallweit wrote:
> On 02.06.2018 22:27, Heiner Kallweit wrote:
>> On 01.06.2018 02:10, Andrew Lunn wrote:
 Configuring the different WoL options isn't handled by writing to
 the PHY registers but by writing to chip / MAC registers.
 Therefore phy_suspend() isn't able to figure out whether WoL is
 enabled or not. Only the parent has the full picture.
>>>
>>> Hi Heiner
>>>
>>> I think you need to look at your different runtime PM domains.  If i
>>> understand the code right, you runtime suspend if there is no
>>> link. But for this to work correctly, your PHY needs to keep working.
>>> You also cannot assume all accesses to the PHY go via the MAC. Some
>>> calls will go direct to the PHY, and they can trigger MDIO bus
>>> accesses.  So i think you need two runtime PM domains. MAC and MDIO
>>> bus.  Maybe just the pll? An MDIO bus is a device, so it can have its
>>> on PM callbacks. It is not clear what you need to resume in order to
>>> make MDIO work.
>>>
>> Thanks for your comments!
>> The actual problem is quite small: I get an error at MDIO suspend,
>> the PHY however is suspended later by the driver's suspend callback
>> anyway. Because the problem is small I'm somewhat reluctant to
>> consider bigger changes like introducing different PM domains.
>>
>> Primary reason for the error is that the network chip is in PCI D3hot
>> at that moment. In addition to that for some of the chips supported by
>> the driver also MDIO-relevant PLL's might be disabled.
>>
>> By the way:
>> When checking PM handling for PHY/MDIO I stumbled across something
>> that can be improved IMO, I'll send a patch for your review.
>>
> I experimented a little and with adding Runtime PM to MDIO bus and
> PHY device I can make it work:
> PHY runtime-resumes before entering suspend and resumes its parent
> (MDIO bus) which then resumes its parent (PCI device).
> However this needs quite some code and is hard to read / understand
> w/o reading through this mail thread.
> 
> And in general I still have doubts this is the right way. Let's
> consider the following scenario:
> 
> A network driver configures WoL in its suspend callback
> (incl. setting the device to wakeup-enabled).
> The suspend callback of the PHY is called before this and therefore
> has no clue that WoL will be configured a little later, with the
> consequence that it will do an unsolicited power-down.
> The network driver then has to detect this and power-up the PHY again.
> This doesn't seem to make much sense and still my best idea is to
> establish a mechanism that a device can state: I orchestrate PM
> of my children.
> 
There's one more way to deal with the issue, an empty dev_pm_domain.
We could do:

struct dev_pm_domain empty = { .ops = NULL };
phydev->mdio.dev.pm_domain = empty;

This overrides the device_type pm ops, however I wouldn't necessarily
consider it an elegant solution. Do you have an opinion on that?

I also checked device links, however this doesn't seem to be the right
concept in our case.

> Heiner
> 
>>> It might also help if you do the phy_connect in .ndo_open and
>>> disconnect in .ndo_stop. This is a common pattern in drivers. But some
>>> also do it is probe and remove.
>>>
>> Thanks for the hint. I will move phy_connect_direct accordingly.
>>
>>>  Andrew
>>>
>> Heiner
>>
>