date:20180704

Re: [PATCH net] qed: Fix reading stale configuration information

2018-07-04 Thread David Miller

From: Denis Bolotin 
Date: Wed, 4 Jul 2018 17:06:46 +0300

> Configuration information read at driver load can become stale after it is
> updated. Mark information as not valid and re-populate when this happens.
> 
> Signed-off-by: Denis Bolotin 
> Signed-off-by: Ariel Elior 

Applied.

Re: [PATCH net-next] r8169: fix runtime suspend

2018-07-04 Thread David Miller

From: Heiner Kallweit 
Date: Wed, 4 Jul 2018 21:11:29 +0200

> When runtime-suspending we configure WoL w/o touching saved_wolopts.
> If saved_wolopts == 0 we would power down the PHY in this case what's
> wrong. Therefore we have to check the actual chip WoL settings here.
> 
> Fixes: 433f9d0ddcc6 ("r8169: improve saved_wolopts handling")
> Signed-off-by: Heiner Kallweit 

Applied.

Re: [PATCH net] net/ipv6: Revert attempt to simplify route replace and append

2018-07-04 Thread David Ahern

On 7/4/18 8:29 PM, David Miller wrote:
> From: Ido Schimmel 
> Date: Thu, 5 Jul 2018 00:10:41 +0300
> 
>> We can have the IPv4/IPv6 code only generate a REPLACE / DELETE
>> notification for routes that are actually used for forwarding and
>> relieve listeners from the need to implement this logic themselves. I
>> think this should work.
> 
> Whilst this could reduce the duplication, I worry that in the long
> term that this might end up being error prone.
> 

Duplication of data and duplication of logic is not ideal. Especially in
this case where the duplication of both is only to handle one case -
duplicate routes where only the first is programmed. I suspect it will
have to be dealt with at some point (e.g., scaling to a million routes),
but right now there are more important factors to deal with - like the
rtnl_lock. Something to keep in mind for the future.

Re: [PATCH net] net/ipv6: Revert attempt to simplify route replace and append

2018-07-04 Thread David Miller

From: Ido Schimmel 
Date: Thu, 5 Jul 2018 00:10:41 +0300

> We can have the IPv4/IPv6 code only generate a REPLACE / DELETE
> notification for routes that are actually used for forwarding and
> relieve listeners from the need to implement this logic themselves. I
> think this should work.

Whilst this could reduce the duplication, I worry that in the long
term that this might end up being error prone.

Just my $0.02

Re: [PATCH net-next] net: ipv4: fix drop handling in ip_list_rcv() and ip_list_rcv_finish()

2018-07-04 Thread David Miller

From: Edward Cree 
Date: Wed, 4 Jul 2018 19:23:50 +0100

> Since callees (ip_rcv_core() and ip_rcv_finish_core()) might free or steal
>  the skb, we can't use the list_cut_before() method; we can't even do a
>  list_del(>list) in the drop case, because skb might have already been
>  freed and reused.
> So instead, take each skb off the source list before processing, and add it
>  to the sublist afterwards if it wasn't freed or stolen.
> 
> Fixes: 5fa12739a53d net: ipv4: listify ip_rcv_finish
> Fixes: 17266ee93984 net: ipv4: listified version of ip_rcv
> Signed-off-by: Edward Cree 

Applied, thanks Edward.

Re: [PATCH][net-next] net: limit each hash list length to MAX_GRO_SKBS

2018-07-04 Thread David Miller

From: Li RongQing 
Date: Wed,  4 Jul 2018 16:42:48 +0800

> @@ -4989,10 +4988,11 @@ static int napi_gro_complete(struct sk_buff *skb)
>   return netif_receive_skb_internal(skb);
>  }
>  
> -static void __napi_gro_flush_chain(struct napi_struct *napi, struct 
> list_head *head,
> +static void __napi_gro_flush_chain(struct napi_struct *napi, int index,
>  bool flush_old)

Hash is usually u32, so please lets make the type consistent.  Make index here
a 'u32' and also make 'i' in napi_gro_flush() a 'u32' too.

> @@ -5138,6 +5122,7 @@ static enum gro_result dev_gro_receive(struct 
> napi_struct *napi, struct sk_buff
>   enum gro_result ret;
>   int same_flow;
>   int grow;
> + u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);

Please preserve reverse christmas tree local variable ordering.

Otherwise this looks great.

Thank you.

Re: [PATCH net] net: mv __dev_notify_flags from void to int

2018-07-04 Thread David Miller

From: Hangbin Liu 
Date: Wed,  4 Jul 2018 14:29:46 +0800

> @@ -7062,8 +7073,7 @@ int dev_change_flags(struct net_device *dev, unsigned 
> int flags)
>   return ret;
>  
>   changes = (old_flags ^ dev->flags) | (old_gflags ^ dev->gflags);
> - __dev_notify_flags(dev, old_flags, changes);
> - return ret;
> + return __dev_notify_flags(dev, old_flags, changes);

By doing this, you will lose any error code returned from dev_open().
That is why we return 'ret' instead of plain '0' here.

Re: [PATCH net] net: phy: fix flag masking in __set_phy_supported

2018-07-04 Thread David Miller

From: Heiner Kallweit 
Date: Tue, 3 Jul 2018 22:34:54 +0200

> Currently also the pause flags are removed from phydev->supported because
> they're not included in PHY_DEFAULT_FEATURES. I don't think this is
> intended, especially when considering that this function can be called
> via phy_set_max_speed() anywhere in a driver. Change the masking to mask
> out only the values we're going to change. In addition remove the
> misleading comment, job of this small function is just to adjust the
> supported and advertised speeds.
> 
> Fixes: f3a6bd393c2c ("phylib: Add phy_set_max_speed helper")
> Signed-off-by: Heiner Kallweit 

Applied and queued up for -stable.

[PATCH net-next 2/4] vxlan: add new fdb alloc and create helpers

2018-07-04 Thread Roopa Prabhu

From: Roopa Prabhu 

- Add new vxlan_fdb_alloc helper
- rename existing vxlan_fdb_create into vxlan_fdb_update:
because it really creates or updates an existing
fdb entry
- move new fdb creation into a separate vxlan_fdb_create

Main motivation for this change is to introduce the ability
to decouple vxlan fdb creation and notify, used in a later patch.

Signed-off-by: Roopa Prabhu 
---
 drivers/net/vxlan.c | 91 -
 1 file changed, 62 insertions(+), 29 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 601ae17..aa88beb 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -637,9 +637,62 @@ static int vxlan_gro_complete(struct sock *sk, struct 
sk_buff *skb, int nhoff)
return eth_gro_complete(skb, nhoff + sizeof(struct vxlanhdr));
 }
 
-/* Add new entry to forwarding table -- assumes lock held */
+static struct vxlan_fdb *vxlan_fdb_alloc(struct vxlan_dev *vxlan,
+const u8 *mac, __u16 state,
+__be32 src_vni, __u8 ndm_flags)
+{
+   struct vxlan_fdb *f;
+
+   f = kmalloc(sizeof(*f), GFP_ATOMIC);
+   if (!f)
+   return NULL;
+   f->state = state;
+   f->flags = ndm_flags;
+   f->updated = f->used = jiffies;
+   f->vni = src_vni;
+   INIT_LIST_HEAD(>remotes);
+   memcpy(f->eth_addr, mac, ETH_ALEN);
+
+   return f;
+}
+
 static int vxlan_fdb_create(struct vxlan_dev *vxlan,
const u8 *mac, union vxlan_addr *ip,
+   __u16 state, __be16 port, __be32 src_vni,
+   __be32 vni, __u32 ifindex, __u8 ndm_flags,
+   struct vxlan_fdb **fdb)
+{
+   struct vxlan_rdst *rd = NULL;
+   struct vxlan_fdb *f;
+   int rc;
+
+   if (vxlan->cfg.addrmax &&
+   vxlan->addrcnt >= vxlan->cfg.addrmax)
+   return -ENOSPC;
+
+   netdev_dbg(vxlan->dev, "add %pM -> %pIS\n", mac, ip);
+   f = vxlan_fdb_alloc(vxlan, mac, state, src_vni, ndm_flags);
+   if (!f)
+   return -ENOMEM;
+
+   rc = vxlan_fdb_append(f, ip, port, vni, ifindex, );
+   if (rc < 0) {
+   kfree(f);
+   return rc;
+   }
+
+   ++vxlan->addrcnt;
+   hlist_add_head_rcu(>hlist,
+  vxlan_fdb_head(vxlan, mac, src_vni));
+
+   *fdb = f;
+
+   return 0;
+}
+
+/* Add new entry to forwarding table -- assumes lock held */
+static int vxlan_fdb_update(struct vxlan_dev *vxlan,
+   const u8 *mac, union vxlan_addr *ip,
__u16 state, __u16 flags,
__be16 port, __be32 src_vni, __be32 vni,
__u32 ifindex, __u8 ndm_flags)
@@ -688,37 +741,17 @@ static int vxlan_fdb_create(struct vxlan_dev *vxlan,
if (!(flags & NLM_F_CREATE))
return -ENOENT;
 
-   if (vxlan->cfg.addrmax &&
-   vxlan->addrcnt >= vxlan->cfg.addrmax)
-   return -ENOSPC;
-
/* Disallow replace to add a multicast entry */
if ((flags & NLM_F_REPLACE) &&
(is_multicast_ether_addr(mac) || is_zero_ether_addr(mac)))
return -EOPNOTSUPP;
 
netdev_dbg(vxlan->dev, "add %pM -> %pIS\n", mac, ip);
-   f = kmalloc(sizeof(*f), GFP_ATOMIC);
-   if (!f)
-   return -ENOMEM;
-
-   notify = 1;
-   f->state = state;
-   f->flags = ndm_flags;
-   f->updated = f->used = jiffies;
-   f->vni = src_vni;
-   INIT_LIST_HEAD(>remotes);
-   memcpy(f->eth_addr, mac, ETH_ALEN);
-
-   rc = vxlan_fdb_append(f, ip, port, vni, ifindex, );
-   if (rc < 0) {
-   kfree(f);
+   rc = vxlan_fdb_create(vxlan, mac, ip, state, port, src_vni,
+ vni, ifindex, ndm_flags, );
+   if (rc < 0)
return rc;
-   }
-
-   ++vxlan->addrcnt;
-   hlist_add_head_rcu(>hlist,
-  vxlan_fdb_head(vxlan, mac, src_vni));
+   notify = 1;
}
 
if (notify) {
@@ -864,7 +897,7 @@ static int vxlan_fdb_add(struct ndmsg *ndm, struct nlattr 
*tb[],
return -EAFNOSUPPORT;
 
spin_lock_bh(>hash_lock);
-   err = vxlan_fdb_create(vxlan, addr, , ndm->ndm_state, flags,
+   err = vxlan_fdb_update(vxlan, addr, , ndm->ndm_state, flags,
   port, src_vni, vni, ifindex, ndm->ndm_flags);
spin_unlock_bh(>hash_lock);
 
@@ -1007,7 +1040,7 @@ static bool vxlan_snoop(struct net_device *dev,
 
/* close off race between vxlan_flush and incoming packets */

[PATCH net-next 4/4] vxlan: fix default fdb entry netlink notify ordering during netdev create

2018-07-04 Thread Roopa Prabhu

From: Roopa Prabhu 

Problem:
In vxlan_newlink, a default fdb entry is added before register_netdev.
The default fdb creation function also notifies user-space of the
fdb entry on the vxlan device which user-space does not know about yet.
(RTM_NEWNEIGH goes before RTM_NEWLINK for the same ifindex).

This patch fixes the user-space netlink notification ordering issue
with the following changes:
- decouple fdb notify from fdb create.
- Move fdb notify after register_netdev.
- Call rtnl_configure_link in vxlan newlink handler to notify
userspace about the newlink before fdb notify and
hence avoiding the user-space race.

Fixes: afbd8bae9c79 ("vxlan: add implicit fdb entry for default destination")
Signed-off-by: Roopa Prabhu 
---
 drivers/net/vxlan.c | 29 +
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 794a9a7..ababba3 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -3197,6 +3197,7 @@ static int __vxlan_dev_create(struct net *net, struct 
net_device *dev,
 {
struct vxlan_net *vn = net_generic(net, vxlan_net_id);
struct vxlan_dev *vxlan = netdev_priv(dev);
+   struct vxlan_fdb *f = NULL;
int err;
 
err = vxlan_dev_configure(net, dev, conf, false, extack);
@@ -3207,27 +3208,38 @@ static int __vxlan_dev_create(struct net *net, struct 
net_device *dev,
 
/* create an fdb entry for a valid default destination */
if (!vxlan_addr_any(>default_dst.remote_ip)) {
-   err = vxlan_fdb_update(vxlan, all_zeros_mac,
+   err = vxlan_fdb_create(vxlan, all_zeros_mac,
   >default_dst.remote_ip,
   NUD_REACHABLE | NUD_PERMANENT,
-  NLM_F_EXCL | NLM_F_CREATE,
   vxlan->cfg.dst_port,
   vxlan->default_dst.remote_vni,
   vxlan->default_dst.remote_vni,
   vxlan->default_dst.remote_ifindex,
-  NTF_SELF);
+  NTF_SELF, );
if (err)
return err;
}
 
err = register_netdevice(dev);
+   if (err)
+   goto errout;
+
+   err = rtnl_configure_link(dev, NULL);
if (err) {
-   vxlan_fdb_delete_default(vxlan, vxlan->default_dst.remote_vni);
-   return err;
+   unregister_netdevice(dev);
+   goto errout;
}
 
+   /* notify default fdb entry */
+   if (f)
+   vxlan_fdb_notify(vxlan, f, first_remote_rtnl(f), RTM_NEWNEIGH);
+
list_add(>next, >vxlan_list);
return 0;
+errout:
+   if (f)
+   vxlan_fdb_destroy(vxlan, f, false);
+   return err;
 }
 
 static int vxlan_nl2conf(struct nlattr *tb[], struct nlattr *data[],
@@ -3462,6 +3474,7 @@ static int vxlan_changelink(struct net_device *dev, 
struct nlattr *tb[],
struct vxlan_rdst *dst = >default_dst;
struct vxlan_rdst old_dst;
struct vxlan_config conf;
+   struct vxlan_fdb *f = NULL;
int err;
 
err = vxlan_nl2conf(tb, data,
@@ -3487,19 +3500,19 @@ static int vxlan_changelink(struct net_device *dev, 
struct nlattr *tb[],
   old_dst.remote_ifindex, 0);
 
if (!vxlan_addr_any(>remote_ip)) {
-   err = vxlan_fdb_update(vxlan, all_zeros_mac,
+   err = vxlan_fdb_create(vxlan, all_zeros_mac,
   >remote_ip,
   NUD_REACHABLE | NUD_PERMANENT,
-  NLM_F_CREATE | NLM_F_APPEND,
   vxlan->cfg.dst_port,
   dst->remote_vni,
   dst->remote_vni,
   dst->remote_ifindex,
-  NTF_SELF);
+  NTF_SELF, );
if (err) {
spin_unlock_bh(>hash_lock);
return err;
}
+   vxlan_fdb_notify(vxlan, f, first_remote_rtnl(f), 
RTM_NEWNEIGH);
}
spin_unlock_bh(>hash_lock);
}
-- 
2.1.4

[PATCH net-next 3/4] vxlan: make netlink notify in vxlan_fdb_destroy optional

2018-07-04 Thread Roopa Prabhu

From: Roopa Prabhu 

Add a new option do_notify to vxlan_fdb_destroy to make
sending netlink notify optional. Used by a later patch.

Signed-off-by: Roopa Prabhu 
---
 drivers/net/vxlan.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index aa88beb..794a9a7 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -775,13 +775,15 @@ static void vxlan_fdb_free(struct rcu_head *head)
kfree(f);
 }
 
-static void vxlan_fdb_destroy(struct vxlan_dev *vxlan, struct vxlan_fdb *f)
+static void vxlan_fdb_destroy(struct vxlan_dev *vxlan, struct vxlan_fdb *f,
+ bool do_notify)
 {
netdev_dbg(vxlan->dev,
"delete %pM\n", f->eth_addr);
 
--vxlan->addrcnt;
-   vxlan_fdb_notify(vxlan, f, first_remote_rtnl(f), RTM_DELNEIGH);
+   if (do_notify)
+   vxlan_fdb_notify(vxlan, f, first_remote_rtnl(f), RTM_DELNEIGH);
 
hlist_del_rcu(>hlist);
call_rcu(>rcu, vxlan_fdb_free);
@@ -931,7 +933,7 @@ static int __vxlan_fdb_delete(struct vxlan_dev *vxlan,
goto out;
}
 
-   vxlan_fdb_destroy(vxlan, f);
+   vxlan_fdb_destroy(vxlan, f, true);
 
 out:
return 0;
@@ -2399,7 +2401,7 @@ static void vxlan_cleanup(struct timer_list *t)
   "garbage collect %pM\n",
   f->eth_addr);
f->state = NUD_STALE;
-   vxlan_fdb_destroy(vxlan, f);
+   vxlan_fdb_destroy(vxlan, f, true);
} else if (time_before(timeout, next_timer))
next_timer = timeout;
}
@@ -2450,7 +2452,7 @@ static void vxlan_fdb_delete_default(struct vxlan_dev 
*vxlan, __be32 vni)
spin_lock_bh(>hash_lock);
f = __vxlan_find_mac(vxlan, all_zeros_mac, vni);
if (f)
-   vxlan_fdb_destroy(vxlan, f);
+   vxlan_fdb_destroy(vxlan, f, true);
spin_unlock_bh(>hash_lock);
 }
 
@@ -2504,7 +2506,7 @@ static void vxlan_flush(struct vxlan_dev *vxlan, bool 
do_all)
continue;
/* the all_zeros_mac entry is deleted at vxlan_uninit */
if (!is_zero_ether_addr(f->eth_addr))
-   vxlan_fdb_destroy(vxlan, f);
+   vxlan_fdb_destroy(vxlan, f, true);
}
}
spin_unlock_bh(>hash_lock);
-- 
2.1.4

[PATCH net-next 1/4] rtnetlink: add rtnl_link_state check in rtnl_configure_link

2018-07-04 Thread Roopa Prabhu

From: Roopa Prabhu 

rtnl_configure_link sets dev->rtnl_link_state to
RTNL_LINK_INITIALIZED and unconditionally calls
__dev_notify_flags to notify user-space of dev flags.

current call sequence for rtnl_configure_link
rtnetlink_newlink
rtnl_link_ops->newlink
rtnl_configure_link (unconditionally notifies userspace of
 default and new dev flags)

If a newlink handler wants to call rtnl_configure_link
early, we will end up with duplicate notifications to
user-space.

This patch fixes rtnl_configure_link to check rtnl_link_state
and call __dev_notify_flags with gchanges = 0 if already
RTNL_LINK_INITIALIZED.

Later in the series, this patch will help the following sequence
where a driver implementing newlink can call rtnl_configure_link
to initialize the link early.

makes the following call sequence work:
rtnetlink_newlink
rtnl_link_ops->newlink (vxlan) -> rtnl_configure_link (initializes
link and notifies
user-space of default
dev flags)
rtnl_configure_link (updates dev flags if requested by user ifm
 and notifies user-space of new dev flags)

Signed-off-by: Roopa Prabhu 
---
 net/core/rtnetlink.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 5ef6122..e3f743c 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2759,9 +2759,12 @@ int rtnl_configure_link(struct net_device *dev, const 
struct ifinfomsg *ifm)
return err;
}
 
-   dev->rtnl_link_state = RTNL_LINK_INITIALIZED;
-
-   __dev_notify_flags(dev, old_flags, ~0U);
+   if (dev->rtnl_link_state == RTNL_LINK_INITIALIZED) {
+   __dev_notify_flags(dev, old_flags, 0U);
+   } else {
+   dev->rtnl_link_state = RTNL_LINK_INITIALIZED;
+   __dev_notify_flags(dev, old_flags, ~0U);
+   }
return 0;
 }
 EXPORT_SYMBOL(rtnl_configure_link);
-- 
2.1.4

[PATCH net-next 0/4] vxlan: fix default fdb entry user-space notify ordering/race

2018-07-04 Thread Roopa Prabhu

From: Roopa Prabhu 

Problem:
In vxlan_newlink, a default fdb entry is added before register_netdev.
The default fdb creation function notifies user-space of the
fdb entry on the vxlan device which user-space does not know about yet.
(RTM_NEWNEIGH goes before RTM_NEWLINK for the same ifindex).

This series fixes the user-space netlink notification ordering issue
with the following changes:
- decouple fdb notify from fdb create.
- Move fdb notify after register_netdev.
- modify rtnl_configure_link to allow configuring a link early.
- Call rtnl_configure_link in vxlan newlink handler to notify
userspace about the newlink before fdb notify and
hence avoiding the user-space race.

Fixes: afbd8bae9c79 ("vxlan: add implicit fdb entry for default destination")
Signed-off-by: Roopa Prabhu 

Roopa Prabhu (4):
  vxlan: add new fdb alloc and create helpers
  vxlan: make netlink notify in vxlan_fdb_destroy optional
  rtnetlink: add rtnl_link_state check in rtnl_configure_link
  vxlan: fix default fdb netlink notify ordering during netdev create

 drivers/net/vxlan.c  | 130 ++-
 net/core/rtnetlink.c |   9 ++--
 2 files changed, 94 insertions(+), 45 deletions(-)

-- 
2.1.4

Re: [offlist] Re: Crash in netlink/sk_filter_trim_cap on ARMv7 on 4.18rc1

2018-07-04 Thread Russell King - ARM Linux

Subject says offlist, but this isn't...

On Wed, Jul 04, 2018 at 08:33:20AM +0100, Peter Robinson wrote:
> Sorry for the delay on this from my end. I noticed there was some bpf
> bits land in the last net fixes pull request landed Monday so I built
> a kernel with the JIT reenabled. It seems it's improved in that the
> completely dead no output boot has gone but the original problem that
> arrived in the merge window still persists:
> 
> [   17.564142] note: systemd-udevd[194] exited with preempt_count 1
> [   17.592739] Unable to handle kernel NULL pointer dereference at
> virtual address 000c
> [   17.601002] pgd = (ptrval)
> [   17.603819] [000c] *pgd=
> [   17.607487] Internal error: Oops: 805 [#10] SMP ARM
> [   17.612396] Modules linked in:
> [   17.615484] CPU: 0 PID: 195 Comm: systemd-udevd Tainted: G  D
> 4.18.0-0.rc3.git1.1.bpf1.fc29.armv7hl #1
> [   17.626056] Hardware name: Generic AM33XX (Flattened Device Tree)
> [   17.632198] PC is at sk_filter_trim_cap+0x218/0x2fc
> [   17.637102] LR is at   (null)
> [   17.640086] pc : []lr : [<>]psr: 6013
> [   17.646384] sp : cfe1dd48  ip :   fp : 
> [   17.651635] r10: d837e000  r9 : d833be00  r8 : 
> [   17.656887] r7 : 0001  r6 : e003d000  r5 :   r4 : 
> [   17.663447] r3 : 0007  r2 :   r1 :   r0 : 
> [   17.670009] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment 
> none
> [   17.677180] Control: 10c5387d  Table: 8fe20019  DAC: 0051
> [   17.682956] Process systemd-udevd (pid: 195, stack limit = 0x(ptrval))
> [   17.689518] Stack: (0xcfe1dd48 to 0xcfe1e000)

Can you provide a full disassembly of sk_filter_trim_cap from vmlinux
(iow, annotated with its linked address) for the above dump please -
alternatively a new dump with matching disassembly.  Thanks.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 13.8Mbps down 630kbps up
According to speedtest.net: 13Mbps down 490kbps up

Re: [offlist] Re: Crash in netlink/sk_filter_trim_cap on ARMv7 on 4.18rc1

2018-07-04 Thread Daniel Borkmann

On 07/04/2018 09:33 AM, Peter Robinson wrote:
> On Tue, Jun 26, 2018 at 1:52 PM, Daniel Borkmann  wrote:
>> On 06/26/2018 02:23 PM, Peter Robinson wrote:
>> On 06/24/2018 11:24 AM, Peter Robinson wrote:
> I'm seeing this netlink/sk_filter_trim_cap crash on ARMv7 across quite
> a few ARMv7 platforms on Fedora with 4.18rc1. I've tested RPi2/RPi3
> (doesn't happen on aarch64), AllWinner H3, BeagleBone and a few
> others, both LPAE/normal kernels.
>>
>> So this is arm32 right?
>
> Correct.
>
> I'm a bit out of my depth in this part of the kernel but I'm wondering
> if it's known, I couldn't find anything that looked obvious on a few
> mailing lists.
>
> Peter

 Hi Peter

 Could you provide symbolic information ?
>>>
>>> I passed in through scripts/decode_stacktrace.sh is that what you were 
>>> after:
>>>
>>> [8.673880] Internal error: Oops: a06 [#10] SMP ARM
>>> [8.673949] ---[ end trace 049df4786ea3140a ]---
>>> [8.678754] Modules linked in:
>>> [8.678766] CPU: 1 PID: 206 Comm: systemd-udevd Tainted: G  D
>>> 4.18.0-0.rc1.git0.1.fc29.armv7hl+lpae #1
>>> [8.678769] Hardware name: Allwinner sun8i Family
>>> [8.678781] PC is at sk_filter_trim_cap ()
>>> [8.678790] LR is at   (null)
>>> [8.709463] pc : lr : psr: 6013 ()
>>> [8.715722] sp : c996bd60  ip :   fp : 
>>> [8.720939] r10: ee79dc00  r9 : c12c9f80  r8 : 
>>> [8.726157] r7 :   r6 : 0001  r5 : f1648000  r4 : 
>>> 
>>> [8.732674] r3 : 0007  r2 :   r1 :   r0 : 
>>> 
>>> [8.739193] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  
>>> Segment user
>>> [8.746318] Control: 30c5387d  Table: 6e7bc880  DAC: ffe75ece
>>> [8.752055] Process systemd-udevd (pid: 206, stack limit = 
>>> 0x(ptrval))
>>> [8.758574] Stack: (0xc996bd60 to 0xc996c000)
>>
>> Do you have BPF JIT enabled or disabled? Does it happen with disabled?
>
> Enabled, I can test with it disabled, BPF configs bits are:
> CONFIG_BPF_EVENTS=y
> # CONFIG_BPFILTER is not set
> CONFIG_BPF_JIT_ALWAYS_ON=y
> CONFIG_BPF_JIT=y
> CONFIG_BPF_STREAM_PARSER=y
> CONFIG_BPF_SYSCALL=y
> CONFIG_BPF=y
> CONFIG_CGROUP_BPF=y
> CONFIG_HAVE_EBPF_JIT=y
> CONFIG_IPV6_SEG6_BPF=y
> CONFIG_LWTUNNEL_BPF=y
> # CONFIG_NBPFAXI_DMA is not set
> CONFIG_NET_ACT_BPF=m
> CONFIG_NET_CLS_BPF=m
> CONFIG_NETFILTER_XT_MATCH_BPF=m
> # CONFIG_TEST_BPF is not set
>
>> I can see one bug, but your stack trace seems unrelated.
>>
>> Anyway, could you try with this?
>
> Build in process.
>
>> diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
>> index 6e8b716..f6a62ae 100644
>> --- a/arch/arm/net/bpf_jit_32.c
>> +++ b/arch/arm/net/bpf_jit_32.c
>> @@ -1844,7 +1844,7 @@ struct bpf_prog *bpf_int_jit_compile(struct 
>> bpf_prog *prog)
>> /* there are 2 passes here */
>> bpf_jit_dump(prog->len, image_size, 2, ctx.target);
>>
>> -   set_memory_ro((unsigned long)header, header->pages);
>> +   bpf_jit_binary_lock_ro(header);
>> prog->bpf_func = (void *)ctx.target;
>> prog->jited = 1;
>> prog->jited_len = image_size;

 So with that and the other fix there was no improvement, with those
 and the BPF JIT disabled it works, I'm not sure if the two patches
 have any effect with the JIT disabled though.

 Will look at the other patches shortly, there's been some other issue
 introduced between rc1 and rc2 which I have to work out before I can
 test those though.
>>>
>>> Quick update, with linus's head as of yesterday, basically rc2 plus
>>> davem's network fixes it works if the JIT is disabled IE:
>>> # CONFIG_BPF_JIT_ALWAYS_ON is not set
>>> # CONFIG_BPF_JIT is not set
>>>
>>> If I enable it the boot breaks even worse than the errors above in
>>> that I get no console output at all, even with earlycon, so we've gone
>>> backwards since rc1 somehow.
>>>
>>> I'll try the above two reverted unless you have any other suggestions.
>>
>> Ok, thanks, lets do that!
>>
>> I'm still working on fixes meanwhile, should have something by end of day.
> 
> Sorry for the delay on this from my end. I noticed there was some bpf
> bits land in the last net fixes pull request landed Monday so I built
> a kernel with the JIT reenabled. It seems it's improved in that the
> completely dead no output boot has gone but the original problem that
> arrived in the merge window still persists:

Okay, thanks a lot! And on top of that tree could you try with the below
applied to check whether it fixes the issue?

diff --git

Re: [PATCH net] net/ipv6: Revert attempt to simplify route replace and append

2018-07-04 Thread Ido Schimmel

On Tue, Jul 03, 2018 at 02:02:06PM -0600, David Ahern wrote:
> It is unfortunate that mlxsw has to replicate the node lookup code.

The kernel can store multiple routes with the same prefix/length, but
only one is used for forwarding. Thus when a route is deleted it should
be potentially overwritten by a different route in the device's tables.
This is why mlxsw stores these nodes.

We can have the IPv4/IPv6 code only generate a REPLACE / DELETE
notification for routes that are actually used for forwarding and
relieve listeners from the need to implement this logic themselves. I
think this should work.

?

Re: [PATCH net-next 2/5 v2] net: gemini: Improve connection prints

2018-07-04 Thread Andrew Lunn

On Wed, Jul 04, 2018 at 08:33:21PM +0200, Linus Walleij wrote:
> Switch over to using a module parameter and debug prints
> that can be controlled by this or ethtool like everyone
> else. Depromote all other prints to debug messages.
> 
> Signed-off-by: Linus Walleij 
> ---
> ChangeLog v1->v2:
> - Use a module parameter and the message levels like all
>   other drivers and stop trying to be special.
> ---
>  drivers/net/ethernet/cortina/gemini.c | 44 +++
>  1 file changed, 25 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cortina/gemini.c 
> b/drivers/net/ethernet/cortina/gemini.c
> index 8fc31723f700..f219208d2351 100644
> --- a/drivers/net/ethernet/cortina/gemini.c
> +++ b/drivers/net/ethernet/cortina/gemini.c
> @@ -46,6 +46,11 @@
>  #define DRV_NAME "gmac-gemini"
>  #define DRV_VERSION  "1.0"
>  
> +#define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV|NETIF_MSG_PROBE|NETIF_MSG_LINK)
> +static int debug = -1;
> +module_param(debug, int, 0);
> +MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
> +
>  #define HSIZE_8  0x00
>  #define HSIZE_16 0x01
>  #define HSIZE_32 0x02
> @@ -300,23 +305,26 @@ static void gmac_speed_set(struct net_device *netdev)
>   status.bits.speed = GMAC_SPEED_1000;
>   if (phydev->interface == PHY_INTERFACE_MODE_RGMII)
>   status.bits.mii_rmii = GMAC_PHY_RGMII_1000;
> - netdev_info(netdev, "connect to RGMII @ 1Gbit\n");
> + netdev_dbg(netdev, "connect %s to RGMII @ 1Gbit\n",
> +phydev_name(phydev));

Hi Linus

Since these are netdev_dbg, they will generally never be seen. So
could you add a call to phy_print_status() at the end of this
function. That is what most MAC drivers do.

> - netdev_info(netdev, "connected to PHY \"%s\"\n",
> - phydev_name(phy));
> - phy_attached_print(phy, "phy_id=0x%.8lx, phy_mode=%s\n",
> -(unsigned long)phy->phy_id,
> -phy_modes(phy->interface));
> + netdev_dbg(netdev, "connected to PHY \"%s\"\n",
> +phydev_name(phy));
>  

It would be nice to call phy_attached_info(), as most other MAC
drivers do.

Andrew

Re: [PATCHv2 net-next 2/2] selftests: add a selftest for directed broadcast forwarding

2018-07-04 Thread Ido Schimmel

On Thu, Jul 05, 2018 at 01:56:23AM +0800, Xin Long wrote:
> On Wed, Jul 4, 2018 at 3:23 AM, David Ahern  wrote:
> > your commands are not a proper test. The test should succeed and fail
> > based on the routing lookup, not iptables rules.
> A proper test can be done easily with netns, as vrf can't isolate much.
> I don't want to bother forwarding/ directory with netns, so I will probably
> just drop this selftest, and let the feature patch go first.
> 
> What do you think?

You can add a tc rule on the ingress of h2 and make sure that in the
first case ping succeeds and the tc rule wasn't hit. In the second case
ping should also succeed, but the tc rule should be hit. This is similar
to your original netns test.

You can look at tc_flower.sh for reference and in particular at
tc_check_packets().

Re: [PATCH net-next 5/5 v2] net: gemini: Indicate that we can handle jumboframes

2018-07-04 Thread Andrew Lunn

On Wed, Jul 04, 2018 at 08:33:24PM +0200, Linus Walleij wrote:
> The hardware supposedly handles frames up to 10236 bytes and
> implements .ndo_change_mtu() so accept 10236 minus the ethernet
> header for a VLAN tagged frame on the netdevices. Use
> ETH_MIN_MTU as minimum MTU.
> 
> Signed-off-by: Linus Walleij 

Hi Linus

Did you try with an MTU of 68? Maybe the vendor picked 256 because of
a hardware limit?

Otherwise the change looks good.

  Andrew

Re: [PATCH net] net: phy: fix flag masking in __set_phy_supported

2018-07-04 Thread Florian Fainelli

On July 3, 2018 10:34:54 PM GMT+02:00, Heiner Kallweit  
wrote:
>Currently also the pause flags are removed from phydev->supported
>because
>they're not included in PHY_DEFAULT_FEATURES. I don't think this is
>intended, especially when considering that this function can be called
>via phy_set_max_speed() anywhere in a driver. Change the masking to
>mask
>out only the values we're going to change. In addition remove the
>misleading comment, job of this small function is just to adjust the
>supported and advertised speeds.
>
>Fixes: f3a6bd393c2c ("phylib: Add phy_set_max_speed helper")
>Signed-off-by: Heiner Kallweit 

Reviewed-by: Florian Fainelli 

-- 
Florian

Re: [PATCH rdma-next 0/3] Dump and fill MKEY

2018-07-04 Thread Jason Gunthorpe

On Wed, Jul 04, 2018 at 09:54:59PM +0300, Leon Romanovsky wrote:
> On Wed, Jul 04, 2018 at 11:47:39AM -0600, Jason Gunthorpe wrote:
> > On Tue, Jun 19, 2018 at 08:47:21AM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky 
> > >
> > > MLX5 IB HCA offers the memory key, dump_fill_mkey to increase
> > > performance, when used in a send or receive operations.
> > >
> > > It is used to force local HCA operations to skip the PCI bus access,
> > > while keeping track of the processed length in the ibv_sge handling.
> > >
> > > In this three patch series, we expose various bits in our HW
> > > spec file (mlx5_ifc.h), move unneeded for mlx5_core FW command and
> > > export such memory key to user space thought our mlx5-abi header file.
> > >
> > > Thanks
> >
> > This looks fine, can you send a pull request off the mlx5 branch
> > please?
> 
> Updated mlx5-next with first two commits,
> b183ee27f5fb net/mlx5: Add hardware definitions for dump_fill_mkey
> 4d4fb5dc988a net/mlx5: Limit scope of dump_fill_mkey function

Okay, applied to for-next, with the missing 'if (err)' fixed.

Thanks,
Jason

Re: [PATCH rdma-next 3/3] IB/mlx5: Expose dump and fill memory key

2018-07-04 Thread Leon Romanovsky

On Wed, Jul 04, 2018 at 01:09:37PM -0600, Jason Gunthorpe wrote:
> On Tue, Jun 19, 2018 at 08:47:24AM +0300, Leon Romanovsky wrote:
> > From: Yonatan Cohen 
> >
> > MLX5 IB HCA offers the memory key, dump_fill_mkey to boost
> > performance, when used in a send or receive operations.
> >
> > It is used to force local HCA operations to skip the PCI bus access,
> > while keeping track of the processed length in the ibv_sge handling.
> >
> > Meaning, instead of a PCI write access the HCA leaves the target
> > memory untouched, and skips filling that packet section. Similar
> > behavior is done upon send, the HCA skips data in memory relevant
> > to this key and saves PCI bus access.
> >
> > This functionality saves PCI read/write operations.
> >
> > Signed-off-by: Yonatan Cohen 
> > Reviewed-by: Yishai Hadas 
> > Reviewed-by: Guy Levi 
> > Signed-off-by: Leon Romanovsky 
> >  drivers/infiniband/hw/mlx5/main.c | 16 +++-
> >  include/uapi/rdma/mlx5-abi.h  |  3 ++-
> >  2 files changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/infiniband/hw/mlx5/main.c 
> > b/drivers/infiniband/hw/mlx5/main.c
> > index c29c7c838980..97113957398d 100644
> > +++ b/drivers/infiniband/hw/mlx5/main.c
> > @@ -1634,6 +1634,7 @@ static struct ib_ucontext 
> > *mlx5_ib_alloc_ucontext(struct ib_device *ibdev,
> > int err;
> > size_t min_req_v2 = offsetof(struct mlx5_ib_alloc_ucontext_req_v2,
> >  max_cqe_version);
> > +   u32 dump_fill_mkey;
> > bool lib_uar_4k;
> >
> > if (!dev->ib_active)
> > @@ -1743,8 +1744,12 @@ static struct ib_ucontext 
> > *mlx5_ib_alloc_ucontext(struct ib_device *ibdev,
> > }
> >
> > err = mlx5_ib_devx_create(dev, context);
> > +   }
> > +
> > +   if (MLX5_CAP_GEN(dev->mdev, dump_fill_mkey)) {
> > +   err = mlx5_cmd_dump_fill_mkey(dev->mdev, _fill_mkey);
> > if (err)
> > -   goto out_td;
> > +   goto out_mdev;
> > }
>
> Dropping the if (err) after mlx5_ib_devx_create is a rebasing error,
> right?

Sorry, you are right, the fixup is pretty straightforward.

diff --git a/drivers/infiniband/hw/mlx5/main.c
b/drivers/infiniband/hw/mlx5/main.c
index 2bbafee6976c..71f3e9677622 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1739,6 +1739,8 @@ static struct ib_ucontext
*mlx5_ib_alloc_ucontext(struct ib_device *ibdev,
}

err = mlx5_ib_devx_create(dev,  context);
if (err)
 +   goto out_td;
 }

 if (MLX5_CAP_GEN(dev->mdev, dump_fill_mkey)) {


>
> Jason


signature.asc
Description: PGP signature

[PATCH net-next] r8169: fix runtime suspend

2018-07-04 Thread Heiner Kallweit

When runtime-suspending we configure WoL w/o touching saved_wolopts.
If saved_wolopts == 0 we would power down the PHY in this case what's
wrong. Therefore we have to check the actual chip WoL settings here.

Fixes: 433f9d0ddcc6 ("r8169: improve saved_wolopts handling")
Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index f80ac894..d598fdf0 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1534,12 +1534,6 @@ static void rtl8169_check_link_status(struct net_device 
*dev,
 
 #define WAKE_ANY (WAKE_PHY | WAKE_MAGIC | WAKE_UCAST | WAKE_BCAST | WAKE_MCAST)
 
-/* Currently we only enable WoL if explicitly told by userspace to circumvent
- * issues on certain platforms, see commit bde135a672bf ("r8169: only enable
- * PCI wakeups when WOL is active"). Let's keep __rtl8169_get_wol() for the
- * case that we want to respect BIOS settings again.
- */
-#if 0
 static u32 __rtl8169_get_wol(struct rtl8169_private *tp)
 {
u8 options;
@@ -1574,7 +1568,6 @@ static u32 __rtl8169_get_wol(struct rtl8169_private *tp)
 
return wolopts;
 }
-#endif
 
 static void rtl8169_get_wol(struct net_device *dev, struct ethtool_wolinfo 
*wol)
 {
@@ -4470,7 +4463,7 @@ static void rtl_wol_suspend_quirk(struct rtl8169_private 
*tp)
 
 static bool rtl_wol_pll_power_down(struct rtl8169_private *tp)
 {
-   if (!netif_running(tp->dev) || !tp->saved_wolopts)
+   if (!netif_running(tp->dev) || !__rtl8169_get_wol(tp))
return false;
 
rtl_speed_down(tp);
-- 
2.18.0

Re: [PATCH rdma-next 3/3] IB/mlx5: Expose dump and fill memory key

2018-07-04 Thread Jason Gunthorpe

On Tue, Jun 19, 2018 at 08:47:24AM +0300, Leon Romanovsky wrote:
> From: Yonatan Cohen 
> 
> MLX5 IB HCA offers the memory key, dump_fill_mkey to boost
> performance, when used in a send or receive operations.
> 
> It is used to force local HCA operations to skip the PCI bus access,
> while keeping track of the processed length in the ibv_sge handling.
> 
> Meaning, instead of a PCI write access the HCA leaves the target
> memory untouched, and skips filling that packet section. Similar
> behavior is done upon send, the HCA skips data in memory relevant
> to this key and saves PCI bus access.
> 
> This functionality saves PCI read/write operations.
> 
> Signed-off-by: Yonatan Cohen 
> Reviewed-by: Yishai Hadas 
> Reviewed-by: Guy Levi 
> Signed-off-by: Leon Romanovsky 
>  drivers/infiniband/hw/mlx5/main.c | 16 +++-
>  include/uapi/rdma/mlx5-abi.h  |  3 ++-
>  2 files changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/mlx5/main.c 
> b/drivers/infiniband/hw/mlx5/main.c
> index c29c7c838980..97113957398d 100644
> +++ b/drivers/infiniband/hw/mlx5/main.c
> @@ -1634,6 +1634,7 @@ static struct ib_ucontext 
> *mlx5_ib_alloc_ucontext(struct ib_device *ibdev,
>   int err;
>   size_t min_req_v2 = offsetof(struct mlx5_ib_alloc_ucontext_req_v2,
>max_cqe_version);
> + u32 dump_fill_mkey;
>   bool lib_uar_4k;
>  
>   if (!dev->ib_active)
> @@ -1743,8 +1744,12 @@ static struct ib_ucontext 
> *mlx5_ib_alloc_ucontext(struct ib_device *ibdev,
>   }
>  
>   err = mlx5_ib_devx_create(dev, context);
> + }
> +
> + if (MLX5_CAP_GEN(dev->mdev, dump_fill_mkey)) {
> + err = mlx5_cmd_dump_fill_mkey(dev->mdev, _fill_mkey);
>   if (err)
> - goto out_td;
> + goto out_mdev;
>   }

Dropping the if (err) after mlx5_ib_devx_create is a rebasing error,
right?

Jason

Re: [PATCH rdma-next 0/3] Dump and fill MKEY

2018-07-04 Thread Leon Romanovsky

On Wed, Jul 04, 2018 at 11:47:39AM -0600, Jason Gunthorpe wrote:
> On Tue, Jun 19, 2018 at 08:47:21AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky 
> >
> > MLX5 IB HCA offers the memory key, dump_fill_mkey to increase
> > performance, when used in a send or receive operations.
> >
> > It is used to force local HCA operations to skip the PCI bus access,
> > while keeping track of the processed length in the ibv_sge handling.
> >
> > In this three patch series, we expose various bits in our HW
> > spec file (mlx5_ifc.h), move unneeded for mlx5_core FW command and
> > export such memory key to user space thought our mlx5-abi header file.
> >
> > Thanks
>
> This looks fine, can you send a pull request off the mlx5 branch
> please?

Updated mlx5-next with first two commits,
b183ee27f5fb net/mlx5: Add hardware definitions for dump_fill_mkey
4d4fb5dc988a net/mlx5: Limit scope of dump_fill_mkey function

Thanks

>
> Thanks,
> Jason


signature.asc
Description: PGP signature

[PATCH net-next 5/5 v2] net: gemini: Indicate that we can handle jumboframes

2018-07-04 Thread Linus Walleij

The hardware supposedly handles frames up to 10236 bytes and
implements .ndo_change_mtu() so accept 10236 minus the ethernet
header for a VLAN tagged frame on the netdevices. Use
ETH_MIN_MTU as minimum MTU.

Signed-off-by: Linus Walleij 
---
ChangeLog v1->v2:
- Change the min MTU from 256 (vendor code) to ETH_MIN_MTU
  which makes more sense.
---
 drivers/net/ethernet/cortina/gemini.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/cortina/gemini.c 
b/drivers/net/ethernet/cortina/gemini.c
index 4e341570047f..af38f9869734 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -2476,6 +2476,11 @@ static int gemini_ethernet_port_probe(struct 
platform_device *pdev)
 
netdev->hw_features = GMAC_OFFLOAD_FEATURES;
netdev->features |= GMAC_OFFLOAD_FEATURES | NETIF_F_GRO;
+   /* We can handle jumbo frames up to 10236 bytes so, let's accept
+* payloads of 10236 bytes minus VLAN and ethernet header
+*/
+   netdev->min_mtu = ETH_MIN_MTU;
+   netdev->max_mtu = 10236 - VLAN_ETH_HLEN;
 
port->freeq_refill = 0;
netif_napi_add(netdev, >napi, gmac_napi_poll,
-- 
2.17.1

[PATCH net-next 3/5 v2] net: gemini: Allow multiple ports to instantiate

2018-07-04 Thread Linus Walleij

The code was not tested with two ports actually in use at
the same time. (I blame this on lack of actual hardware using
that feature.) Now after locating a system using both ports,
add necessary fix to make both ports come up.

Signed-off-by: Linus Walleij 
---
ChangeLog v1->v2:
- No changes, just resending with the rest.
---
 drivers/net/ethernet/cortina/gemini.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cortina/gemini.c 
b/drivers/net/ethernet/cortina/gemini.c
index f219208d2351..6b5aa5704c4f 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -1789,7 +1789,10 @@ static int gmac_open(struct net_device *netdev)
phy_start(netdev->phydev);
 
err = geth_resize_freeq(port);
-   if (err) {
+   /* It's fine if it's just busy, the other port has set up
+* the freeq in that case.
+*/
+   if (err && (err != -EBUSY)) {
netdev_err(netdev, "could not resize freeq\n");
goto err_stop_phy;
}
-- 
2.17.1

[PATCH net-next 4/5 v2] net: gemini: Move main init to port

2018-07-04 Thread Linus Walleij

The initialization sequence for the ethernet, setting up
interrupt routing and such things, need to be done after
both the ports are clocked and reset. Before this the
config will not "take". Move the initialization to the
port probe function and keep track of init status in
the state.

Signed-off-by: Linus Walleij 
---
ChangeLog v1->v2:
- No changes, just resending with the rest.
---
 drivers/net/ethernet/cortina/gemini.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cortina/gemini.c 
b/drivers/net/ethernet/cortina/gemini.c
index 6b5aa5704c4f..4e341570047f 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -151,6 +151,7 @@ struct gemini_ethernet {
void __iomem *base;
struct gemini_ethernet_port *port0;
struct gemini_ethernet_port *port1;
+   bool initialized;
 
spinlock_t  irq_lock; /* Locks IRQ-related registers */
unsigned intfreeq_order;
@@ -2303,6 +2304,14 @@ static void gemini_port_remove(struct 
gemini_ethernet_port *port)
 
 static void gemini_ethernet_init(struct gemini_ethernet *geth)
 {
+   /* Only do this once both ports are online */
+   if (geth->initialized)
+   return;
+   if (geth->port0 && geth->port1)
+   geth->initialized = true;
+   else
+   return;
+
writel(0, geth->base + GLOBAL_INTERRUPT_ENABLE_0_REG);
writel(0, geth->base + GLOBAL_INTERRUPT_ENABLE_1_REG);
writel(0, geth->base + GLOBAL_INTERRUPT_ENABLE_2_REG);
@@ -2450,6 +2459,10 @@ static int gemini_ethernet_port_probe(struct 
platform_device *pdev)
geth->port0 = port;
else
geth->port1 = port;
+
+   /* This will just be done once both ports are up and reset */
+   gemini_ethernet_init(geth);
+
platform_set_drvdata(pdev, port);
 
/* Set up and register the netdev */
@@ -2567,7 +2580,6 @@ static int gemini_ethernet_probe(struct platform_device 
*pdev)
 
spin_lock_init(>irq_lock);
spin_lock_init(>freeq_lock);
-   gemini_ethernet_init(geth);
 
/* The children will use this */
platform_set_drvdata(pdev, geth);
@@ -2580,8 +2592,8 @@ static int gemini_ethernet_remove(struct platform_device 
*pdev)
 {
struct gemini_ethernet *geth = platform_get_drvdata(pdev);
 
-   gemini_ethernet_init(geth);
geth_cleanup_freeq(geth);
+   geth->initialized = false;
 
return 0;
 }
-- 
2.17.1

[PATCH net-next 2/5 v2] net: gemini: Improve connection prints

2018-07-04 Thread Linus Walleij

Switch over to using a module parameter and debug prints
that can be controlled by this or ethtool like everyone
else. Depromote all other prints to debug messages.

Signed-off-by: Linus Walleij 
---
ChangeLog v1->v2:
- Use a module parameter and the message levels like all
  other drivers and stop trying to be special.
---
 drivers/net/ethernet/cortina/gemini.c | 44 +++
 1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/cortina/gemini.c 
b/drivers/net/ethernet/cortina/gemini.c
index 8fc31723f700..f219208d2351 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -46,6 +46,11 @@
 #define DRV_NAME   "gmac-gemini"
 #define DRV_VERSION"1.0"
 
+#define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV|NETIF_MSG_PROBE|NETIF_MSG_LINK)
+static int debug = -1;
+module_param(debug, int, 0);
+MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
+
 #define HSIZE_80x00
 #define HSIZE_16   0x01
 #define HSIZE_32   0x02
@@ -300,23 +305,26 @@ static void gmac_speed_set(struct net_device *netdev)
status.bits.speed = GMAC_SPEED_1000;
if (phydev->interface == PHY_INTERFACE_MODE_RGMII)
status.bits.mii_rmii = GMAC_PHY_RGMII_1000;
-   netdev_info(netdev, "connect to RGMII @ 1Gbit\n");
+   netdev_dbg(netdev, "connect %s to RGMII @ 1Gbit\n",
+  phydev_name(phydev));
break;
case 100:
status.bits.speed = GMAC_SPEED_100;
if (phydev->interface == PHY_INTERFACE_MODE_RGMII)
status.bits.mii_rmii = GMAC_PHY_RGMII_100_10;
-   netdev_info(netdev, "connect to RGMII @ 100 Mbit\n");
+   netdev_dbg(netdev, "connect %s to RGMII @ 100 Mbit\n",
+  phydev_name(phydev));
break;
case 10:
status.bits.speed = GMAC_SPEED_10;
if (phydev->interface == PHY_INTERFACE_MODE_RGMII)
status.bits.mii_rmii = GMAC_PHY_RGMII_100_10;
-   netdev_info(netdev, "connect to RGMII @ 10 Mbit\n");
+   netdev_dbg(netdev, "connect %s to RGMII @ 10 Mbit\n",
+  phydev_name(phydev));
break;
default:
-   netdev_warn(netdev, "Not supported PHY speed (%d)\n",
-   phydev->speed);
+   netdev_warn(netdev, "Unsupported PHY speed (%d) on %s\n",
+   phydev->speed, phydev_name(phydev));
}
 
if (phydev->duplex == DUPLEX_FULL) {
@@ -363,11 +371,8 @@ static int gmac_setup_phy(struct net_device *netdev)
return -ENODEV;
netdev->phydev = phy;
 
-   netdev_info(netdev, "connected to PHY \"%s\"\n",
-   phydev_name(phy));
-   phy_attached_print(phy, "phy_id=0x%.8lx, phy_mode=%s\n",
-  (unsigned long)phy->phy_id,
-  phy_modes(phy->interface));
+   netdev_dbg(netdev, "connected to PHY \"%s\"\n",
+  phydev_name(phy));
 
phy->supported &= PHY_GBIT_FEATURES;
phy->supported |= SUPPORTED_Asym_Pause | SUPPORTED_Pause;
@@ -376,19 +381,19 @@ static int gmac_setup_phy(struct net_device *netdev)
/* set PHY interface type */
switch (phy->interface) {
case PHY_INTERFACE_MODE_MII:
-   netdev_info(netdev, "set GMAC0 to GMII mode, GMAC1 disabled\n");
+   netdev_dbg(netdev,
+  "MII: set GMAC0 to GMII mode, GMAC1 disabled\n");
status.bits.mii_rmii = GMAC_PHY_MII;
-   netdev_info(netdev, "connect to MII\n");
break;
case PHY_INTERFACE_MODE_GMII:
-   netdev_info(netdev, "set GMAC0 to GMII mode, GMAC1 disabled\n");
+   netdev_dbg(netdev,
+  "GMII: set GMAC0 to GMII mode, GMAC1 disabled\n");
status.bits.mii_rmii = GMAC_PHY_GMII;
-   netdev_info(netdev, "connect to GMII\n");
break;
case PHY_INTERFACE_MODE_RGMII:
-   dev_info(dev, "set GMAC0 and GMAC1 to MII/RGMII mode\n");
+   netdev_dbg(netdev,
+  "RGMII: set GMAC0 and GMAC1 to MII/RGMII mode\n");
status.bits.mii_rmii = GMAC_PHY_RGMII_100_10;
-   netdev_info(netdev, "connect to RGMII\n");
break;
default:
netdev_err(netdev, "Unsupported MII interface\n");
@@ -1307,8 +1312,8 @@ static void gmac_enable_irq(struct net_device *netdev, 
int enable)
unsigned long flags;
u32 val, mask;
 
-   netdev_info(netdev, "%s device %d %s\n", __func__,
-   netdev->dev_id, enable ? "enable" : "disable");
+   netdev_dbg(netdev, "%s device %d %s\n", __func__,
+

Re: [PATCHv2 net-next 2/2] selftests: add a selftest for directed broadcast forwarding

2018-07-04 Thread David Ahern

On 7/4/18 11:56 AM, Xin Long wrote:

>> your commands are not a proper test. The test should succeed and fail
>> based on the routing lookup, not iptables rules.
> A proper test can be done easily with netns, as vrf can't isolate much.
> I don't want to bother forwarding/ directory with netns, so I will probably
> just drop this selftest, and let the feature patch go first.
> 

BTW, VRF isolates at the routing layer and this is a routing change. We
need to understand why it does not work with VRF. Perhaps another tweak
is needed for VRF.

[PATCH net-next 1/5 v2] net: gemini: Look up L3 maxlen from table

2018-07-04 Thread Linus Walleij

The code to calculate the hardware register enumerator
for the maximum L3 length isn't entirely simple to read.
Use the existing defines and rewrite the function into a
table look-up.

Signed-off-by: Linus Walleij 
---
ChangeLog v1->v2:
- No changes, just resending with the rest.
---
 drivers/net/ethernet/cortina/gemini.c | 61 ---
 1 file changed, 46 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/cortina/gemini.c 
b/drivers/net/ethernet/cortina/gemini.c
index 6d7404f66f84..8fc31723f700 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -401,26 +401,57 @@ static int gmac_setup_phy(struct net_device *netdev)
return 0;
 }
 
-static int gmac_pick_rx_max_len(int max_l3_len)
-{
-   /* index = CONFIG_MAXLEN_XXX values */
-   static const int max_len[8] = {
-   1536, 1518, 1522, 1542,
-   9212, 10236, 1518, 1518
-   };
-   int i, n = 5;
+/* The maximum frame length is not logically enumerated in the
+ * hardware, so we do a table lookup to find the applicable max
+ * frame length.
+ */
+struct gmac_max_framelen {
+   unsigned int max_l3_len;
+   u8 val;
+};
 
-   max_l3_len += ETH_HLEN + VLAN_HLEN;
+static const struct gmac_max_framelen gmac_maxlens[] = {
+   {
+   .max_l3_len = 1518,
+   .val = CONFIG0_MAXLEN_1518,
+   },
+   {
+   .max_l3_len = 1522,
+   .val = CONFIG0_MAXLEN_1522,
+   },
+   {
+   .max_l3_len = 1536,
+   .val = CONFIG0_MAXLEN_1536,
+   },
+   {
+   .max_l3_len = 1542,
+   .val = CONFIG0_MAXLEN_1542,
+   },
+   {
+   .max_l3_len = 9212,
+   .val = CONFIG0_MAXLEN_9k,
+   },
+   {
+   .max_l3_len = 10236,
+   .val = CONFIG0_MAXLEN_10k,
+   },
+};
+
+static int gmac_pick_rx_max_len(unsigned int max_l3_len)
+{
+   const struct gmac_max_framelen *maxlen;
+   int maxtot;
+   int i;
 
-   if (max_l3_len > max_len[n])
-   return -1;
+   maxtot = max_l3_len + ETH_HLEN + VLAN_HLEN;
 
-   for (i = 0; i < 5; i++) {
-   if (max_len[i] >= max_l3_len && max_len[i] < max_len[n])
-   n = i;
+   for (i = 0; i < ARRAY_SIZE(gmac_maxlens); i++) {
+   maxlen = _maxlens[i];
+   if (maxtot <= maxlen->max_l3_len)
+   return maxlen->val;
}
 
-   return n;
+   return -1;
 }
 
 static int gmac_init(struct net_device *netdev)
-- 
2.17.1

Re: [PATCHv2 net-next 2/2] selftests: add a selftest for directed broadcast forwarding

2018-07-04 Thread David Ahern

On 7/4/18 11:56 AM, Xin Long wrote:
> A proper test can be done easily with netns, as vrf can't isolate much.
> I don't want to bother forwarding/ directory with netns, so I will probably
> just drop this selftest, and let the feature patch go first.
> 
> What do you think?
> 

I think I would like to see a proper test case for this.

If it does not fit the model that Ido and others are using under the
forwarding directory, then how about one in tools/testing/selftests/net
then.

[PATCH net-next] net: ipv4: fix drop handling in ip_list_rcv() and ip_list_rcv_finish()

2018-07-04 Thread Edward Cree

Since callees (ip_rcv_core() and ip_rcv_finish_core()) might free or steal
 the skb, we can't use the list_cut_before() method; we can't even do a
 list_del(>list) in the drop case, because skb might have already been
 freed and reused.
So instead, take each skb off the source list before processing, and add it
 to the sublist afterwards if it wasn't freed or stolen.

Fixes: 5fa12739a53d net: ipv4: listify ip_rcv_finish
Fixes: 17266ee93984 net: ipv4: listified version of ip_rcv
Signed-off-by: Edward Cree 
---
 net/ipv4/ip_input.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 24b9b0210aeb..14ba628b2761 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -540,24 +540,27 @@ static void ip_list_rcv_finish(struct net *net, struct 
sock *sk,
struct sk_buff *skb, *next;
struct list_head sublist;
 
+   INIT_LIST_HEAD();
list_for_each_entry_safe(skb, next, head, list) {
struct dst_entry *dst;
 
+   list_del(>list);
if (ip_rcv_finish_core(net, sk, skb) == NET_RX_DROP)
continue;
 
dst = skb_dst(skb);
if (curr_dst != dst) {
/* dispatch old sublist */
-   list_cut_before(, head, >list);
if (!list_empty())
ip_sublist_rcv_finish();
/* start new sublist */
+   INIT_LIST_HEAD();
curr_dst = dst;
}
+   list_add_tail(>list, );
}
/* dispatch final sublist */
-   ip_sublist_rcv_finish(head);
+   ip_sublist_rcv_finish();
 }
 
 static void ip_sublist_rcv(struct list_head *head, struct net_device *dev,
@@ -577,24 +580,27 @@ void ip_list_rcv(struct list_head *head, struct 
packet_type *pt,
struct sk_buff *skb, *next;
struct list_head sublist;
 
+   INIT_LIST_HEAD();
list_for_each_entry_safe(skb, next, head, list) {
struct net_device *dev = skb->dev;
struct net *net = dev_net(dev);
 
+   list_del(>list);
skb = ip_rcv_core(skb, net);
if (skb == NULL)
continue;
 
if (curr_dev != dev || curr_net != net) {
/* dispatch old sublist */
-   list_cut_before(, head, >list);
if (!list_empty())
-   ip_sublist_rcv(, dev, net);
+   ip_sublist_rcv(, curr_dev, curr_net);
/* start new sublist */
+   INIT_LIST_HEAD();
curr_dev = dev;
curr_net = net;
}
+   list_add_tail(>list, );
}
/* dispatch final sublist */
-   ip_sublist_rcv(head, curr_dev, curr_net);
+   ip_sublist_rcv(, curr_dev, curr_net);
 }

Re: [PATCH net-next 08/10] r8169: remove rtl8169_set_speed_xmii

2018-07-04 Thread Heiner Kallweit

On 04.07.2018 16:46, Andrew Lunn wrote:
> On Mon, Jul 02, 2018 at 11:54:54PM +0200, Heiner Kallweit wrote:
>> On 02.07.2018 23:21, Andrew Lunn wrote:
 -  auto_nego |= ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM;
>>>
>>> This bit you probably want to keep. The PHY never says it support
>>> Pause. The MAC needs to enable pause if the MAC supports pause.
>>>
>> Actually I assumed that phylib would do this for me. But:
>> In phy_probe() first phydev->supported is copied to
>> phydev->advertising, and only after this both pause flags are added
>> to phydev->supported. Therefore I think they are not advertised.
>> Is this intentional? It sounds a little weird to me to add the
>> pause flags to the supported features per default, but not
>> advertise them.
> 
> phylib has no idea if the MAC supports Pause. So it should not enable
> it by default. The MAC needs to enable it. And a lot of MAC drivers
> get this wrong...
> 
>> Except e.g. we call by chance phy_set_max_speed(), which copies
>> phydev->supported to phydev->advertising after having adjusted
>> the supported speeds.
> 
> As you correctly pointed out, phy_set_max_speed() is masking out too
> much.
> 
>> If this is not a bug, then where would be the right place to add
>> the pause flags to phydev->advertising?
> 
> Before you call phy_start().
> 
Thanks for the clarification. I think beginning of next week I can
provide a v2 of the patch series.

Heiner


>Andrew
>

Re: [PATCHv2 net-next 2/2] selftests: add a selftest for directed broadcast forwarding

2018-07-04 Thread Xin Long

On Wed, Jul 4, 2018 at 3:23 AM, David Ahern  wrote:
> On 7/3/18 5:36 AM, Xin Long wrote:
>> On Mon, Jul 2, 2018 at 11:12 PM, David Ahern  wrote:
>>> On 7/2/18 12:30 AM, Xin Long wrote:
 +ping_ipv4()
 +{
 + sysctl_set net.ipv4.icmp_echo_ignore_broadcasts 0
 + bc_forwarding_disable
 + ping_test $h1 198.51.100.255
 +
 + iptables -A INPUT -i vrf-r1 -p icmp -j DROP
 + bc_forwarding_restore
 + bc_forwarding_enable
 + ping_test $h1 198.51.100.255
 +
 + bc_forwarding_restore
 + iptables -D INPUT -i vrf-r1 -p icmp -j DROP
 + sysctl_restore net.ipv4.icmp_echo_ignore_broadcasts
 +}
>>>
>>> Both tests fail for me:
>>> TEST: ping  [FAIL]
>>> TEST: ping  [FAIL]
>> I think 'ip vrf exec ...' is not working in your env, while
>> the testing is using "ip vrf exec vrf-h1 ping ..."
>>
>> You can test it by:
>> # ip link add dev vrf-test type vrf table 
>> # ip vrf exec vrf-test ls
>
> well, that's embarrassing. yes, I updated ip and forgot to apply the bpf
> workaround to define the syscall number (not defined in jessie).
>
>>
>>>
>>> Why the need for the iptables rule?
>> This iptables rule is to block the echo_request packet going to
>> route's local_in.
>> When bc_forwarding is NOT doing forwarding well but the packet
>> goes to the route's local_in, it will fail.
>>
>> Without this rule, the 2nd ping will always succeed, we can't tell the
>> echo_reply is from route or h2.
>>
>> Or you have a better way to test this?
>
> your commands are not a proper test. The test should succeed and fail
> based on the routing lookup, not iptables rules.
A proper test can be done easily with netns, as vrf can't isolate much.
I don't want to bother forwarding/ directory with netns, so I will probably
just drop this selftest, and let the feature patch go first.

What do you think?

>
>>
>>>
>>> And, PAUSE_ON_FAIL is not working to take a look at why tests are
>>> failing. e.g.,
>>>
>>> PAUSE_ON_FAIL=yes ./router_broadcast.sh
>>>
>>> just continues on. Might be something with the infrastructure scripts.
>> Yes, in ./router_broadcast.sh, it loads lib.sh where it loads 
>> forwarding.config
>> where it has "PAUSE_ON_FAIL=no", which would override your
>> "PAUSE_ON_FAIL=yes".
>>
>
> ack. bit by that as well.

Re: [PATCH rdma-next 0/3] Dump and fill MKEY

2018-07-04 Thread Jason Gunthorpe

On Tue, Jun 19, 2018 at 08:47:21AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> MLX5 IB HCA offers the memory key, dump_fill_mkey to increase
> performance, when used in a send or receive operations.
> 
> It is used to force local HCA operations to skip the PCI bus access,
> while keeping track of the processed length in the ibv_sge handling.
> 
> In this three patch series, we expose various bits in our HW
> spec file (mlx5_ifc.h), move unneeded for mlx5_core FW command and
> export such memory key to user space thought our mlx5-abi header file.
> 
> Thanks

This looks fine, can you send a pull request off the mlx5 branch
please?

Thanks,
Jason

Re: [PATCH v4 net-next 7/9] net: ipv4: listified version of ip_rcv

2018-07-04 Thread Edward Cree

On 02/07/18 16:14, Edward Cree wrote:
> +/* Receive a list of IP packets */
> +void ip_list_rcv(struct list_head *head, struct packet_type *pt,
> +  struct net_device *orig_dev)
> +{
> + struct net_device *curr_dev = NULL;
> + struct net *curr_net = NULL;
> + struct sk_buff *skb, *next;
> + struct list_head sublist;
> +
> + list_for_each_entry_safe(skb, next, head, list) {
> + struct net_device *dev = skb->dev;
> + struct net *net = dev_net(dev);
> +
> + skb = ip_rcv_core(skb, net);
> + if (skb == NULL)
> + continue;
I've spotted a bug here, in that if ip_rcv_core() eats the skb (e.g. by
 freeing it) it won't list_del() it, so when we process the sublist we'll
 end up trying to process this skb anyway.
Thus, places where an skb could get freed (possibly remotely, as in nf
 hooks that can steal packets) need to use the dequeue/enqueue model
 rather than the list_cut_before() approach.
Followup patches soon.

-Ed

Re: [PATCH net-next 08/10] r8169: remove rtl8169_set_speed_xmii

2018-07-04 Thread Andrew Lunn

On Mon, Jul 02, 2018 at 11:54:54PM +0200, Heiner Kallweit wrote:
> On 02.07.2018 23:21, Andrew Lunn wrote:
> >> -  auto_nego |= ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM;
> > 
> > This bit you probably want to keep. The PHY never says it support
> > Pause. The MAC needs to enable pause if the MAC supports pause.
> > 
> Actually I assumed that phylib would do this for me. But:
> In phy_probe() first phydev->supported is copied to
> phydev->advertising, and only after this both pause flags are added
> to phydev->supported. Therefore I think they are not advertised.
> Is this intentional? It sounds a little weird to me to add the
> pause flags to the supported features per default, but not
> advertise them.

phylib has no idea if the MAC supports Pause. So it should not enable
it by default. The MAC needs to enable it. And a lot of MAC drivers
get this wrong...

> Except e.g. we call by chance phy_set_max_speed(), which copies
> phydev->supported to phydev->advertising after having adjusted
> the supported speeds.

As you correctly pointed out, phy_set_max_speed() is masking out too
much.

> If this is not a bug, then where would be the right place to add
> the pause flags to phydev->advertising?

Before you call phy_start().

   Andrew

[PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-07-04 Thread Matteo Croce

From: Stefano Brivio 

Open vSwitch sends to userspace all received packets that have
no associated flow (thus doing an "upcall"). Then the userspace
program creates a new flow and determines the actions to apply
based on its configuration.

When a single port generates a high rate of upcalls, it can
prevent other ports from dispatching their own upcalls. vswitchd
overcomes this problem by creating many netlink sockets for each
port, but it quickly exceeds any reasonable maximum number of
open files when dealing with huge amounts of ports.

This patch queues all the upcalls into a list, ordering them in
a per-port round-robin fashion, and schedules a deferred work to
queue them to userspace.

The algorithm to queue upcalls in a round-robin fashion,
provided by Stefano, is based on these two rules:
 - upcalls for a given port must be inserted after all the other
   occurrences of upcalls for the same port already in the queue,
   in order to avoid out-of-order upcalls for a given port
 - insertion happens once the highest upcall count for any given
   port (excluding the one currently at hand) is greater than the
   count for the port we're queuing to -- if this condition is
   never true, upcall is queued at the tail. This results in a
   per-port round-robin order.

In order to implement a fair round-robin behaviour, a variable
queueing delay is introduced. This will be zero if the upcalls
rate is below a given threshold, and grows linearly with the
queue utilisation (i.e. upcalls rate) otherwise.

This ensures fairness among ports under load and with few
netlink sockets.

Signed-off-by: Matteo Croce 
Co-authored-by: Stefano Brivio 
---
 net/openvswitch/datapath.c | 143 ++---
 net/openvswitch/datapath.h |  27 ++-
 2 files changed, 161 insertions(+), 9 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 0f5ce77460d4..2cfd504562d8 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -59,6 +59,10 @@
 #include "vport-internal_dev.h"
 #include "vport-netdev.h"
 
+#define UPCALL_QUEUE_TIMEOUT   msecs_to_jiffies(10)
+#define UPCALL_QUEUE_MAX_DELAY msecs_to_jiffies(10)
+#define UPCALL_QUEUE_MAX_LEN   200
+
 unsigned int ovs_net_id __read_mostly;
 
 static struct genl_family dp_packet_genl_family;
@@ -225,6 +229,116 @@ void ovs_dp_detach_port(struct vport *p)
ovs_vport_del(p);
 }
 
+static void ovs_dp_upcall_dequeue(struct work_struct *work)
+{
+   struct datapath *dp = container_of(work, struct datapath,
+  upcalls.work.work);
+   struct dp_upcall_info *u, *n;
+
+   spin_lock_bh(>upcalls.lock);
+   list_for_each_entry_safe(u, n, >upcalls.list, list) {
+   if (unlikely(ovs_dp_upcall(dp, u->skb, >key, u, 0)))
+   kfree_skb(u->skb);
+   else
+   consume_skb(u->skb);
+   kfree(u);
+   }
+   dp->upcalls.len = 0;
+   INIT_LIST_HEAD(>upcalls.list);
+   spin_unlock_bh(>upcalls.lock);
+}
+
+/* Calculate the delay of the deferred work which sends the upcalls. If it ran
+ * more than UPCALL_QUEUE_TIMEOUT ago, schedule the work immediately. Otherwise
+ * return a time between 0 and UPCALL_QUEUE_MAX_DELAY, depending linearly on 
the
+ * queue utilisation.
+ */
+static unsigned long ovs_dp_upcall_delay(int queue_len, unsigned long last_run)
+{
+   if (jiffies - last_run >= UPCALL_QUEUE_TIMEOUT)
+   return 0;
+
+   return UPCALL_QUEUE_MAX_DELAY -
+  UPCALL_QUEUE_MAX_DELAY * queue_len / UPCALL_QUEUE_MAX_LEN;
+}
+
+static int ovs_dp_upcall_queue_roundrobin(struct datapath *dp,
+ struct dp_upcall_info *upcall)
+{
+   struct list_head *head = >upcalls.list;
+   struct dp_upcall_info *here = NULL, *pos;
+   bool find_next = true;
+   unsigned long delay;
+   int err = 0;
+   u8 count;
+
+   spin_lock_bh(>upcalls.lock);
+   if (dp->upcalls.len > UPCALL_QUEUE_MAX_LEN) {
+   err = -ENOSPC;
+   goto out;
+   }
+
+   /* Insert upcalls in the list in a per-port round-robin fashion, look
+* for insertion point:
+* - to avoid out-of-order per-port upcalls, we can insert only after
+*   the last occurrence of upcalls for the same port
+* - insert upcall only after we reach a count of occurrences for a
+*   given port greater than the one we're inserting this upcall for
+*/
+   list_for_each_entry(pos, head, list) {
+   /* Count per-port upcalls. */
+   if (dp->upcalls.count[pos->port_no] == U8_MAX - 1) {
+   err = -ENOSPC;
+   goto out_clear;
+   }
+   dp->upcalls.count[pos->port_no]++;
+
+   if (pos->port_no == upcall->port_no) {
+   /* Another upcall for the same port: move insertion
+

Re: [PATCH net] net: phy: fix flag masking in __set_phy_supported

2018-07-04 Thread Andrew Lunn

On Tue, Jul 03, 2018 at 10:34:54PM +0200, Heiner Kallweit wrote:
> Currently also the pause flags are removed from phydev->supported because
> they're not included in PHY_DEFAULT_FEATURES. I don't think this is
> intended, especially when considering that this function can be called
> via phy_set_max_speed() anywhere in a driver. Change the masking to mask
> out only the values we're going to change. In addition remove the
> misleading comment, job of this small function is just to adjust the
> supported and advertised speeds.
> 
> Fixes: f3a6bd393c2c ("phylib: Add phy_set_max_speed helper")
> Signed-off-by: Heiner Kallweit 

Reviewed-by: Andrew Lunn 

Andrew

[PATCH net] qed: Fix reading stale configuration information

2018-07-04 Thread Denis Bolotin

Configuration information read at driver load can become stale after it is
updated. Mark information as not valid and re-populate when this happens.

Signed-off-by: Denis Bolotin 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/qlogic/qed/qed.h |  1 +
 drivers/net/ethernet/qlogic/qed/qed_mcp.c | 39 +--
 2 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed.h 
b/drivers/net/ethernet/qlogic/qed/qed.h
index 00db340..1dfaccd 100644
--- a/drivers/net/ethernet/qlogic/qed/qed.h
+++ b/drivers/net/ethernet/qlogic/qed/qed.h
@@ -502,6 +502,7 @@ enum BAR_ID {
 struct qed_nvm_image_info {
u32 num_images;
struct bist_nvm_image_att *image_att;
+   bool valid;
 };
 
 #define DRV_MODULE_VERSION   \
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c 
b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index 4e0b443..9d9e533 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -592,6 +592,9 @@ int qed_mcp_nvm_wr_cmd(struct qed_hwfn *p_hwfn,
*o_mcp_resp = mb_params.mcp_resp;
*o_mcp_param = mb_params.mcp_param;
 
+   /* nvm_info needs to be updated */
+   p_hwfn->nvm_info.valid = false;
+
return 0;
 }
 
@@ -2555,11 +2558,14 @@ int qed_mcp_bist_nvm_get_image_att(struct qed_hwfn 
*p_hwfn,
 
 int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn)
 {
-   struct qed_nvm_image_info *nvm_info = _hwfn->nvm_info;
+   struct qed_nvm_image_info nvm_info;
struct qed_ptt *p_ptt;
int rc;
u32 i;
 
+   if (p_hwfn->nvm_info.valid)
+   return 0;
+
p_ptt = qed_ptt_acquire(p_hwfn);
if (!p_ptt) {
DP_ERR(p_hwfn, "failed to acquire ptt\n");
@@ -2567,29 +2573,29 @@ int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn)
}
 
/* Acquire from MFW the amount of available images */
-   nvm_info->num_images = 0;
+   nvm_info.num_images = 0;
rc = qed_mcp_bist_nvm_get_num_images(p_hwfn,
-p_ptt, _info->num_images);
+p_ptt, _info.num_images);
if (rc == -EOPNOTSUPP) {
DP_INFO(p_hwfn, "DRV_MSG_CODE_BIST_TEST is not supported\n");
goto out;
-   } else if (rc || !nvm_info->num_images) {
+   } else if (rc || !nvm_info.num_images) {
DP_ERR(p_hwfn, "Failed getting number of images\n");
goto err0;
}
 
-   nvm_info->image_att = kmalloc_array(nvm_info->num_images,
-   sizeof(struct bist_nvm_image_att),
-   GFP_KERNEL);
-   if (!nvm_info->image_att) {
+   nvm_info.image_att = kmalloc_array(nvm_info.num_images,
+  sizeof(struct bist_nvm_image_att),
+  GFP_KERNEL);
+   if (!nvm_info.image_att) {
rc = -ENOMEM;
goto err0;
}
 
/* Iterate over images and get their attributes */
-   for (i = 0; i < nvm_info->num_images; i++) {
+   for (i = 0; i < nvm_info.num_images; i++) {
rc = qed_mcp_bist_nvm_get_image_att(p_hwfn, p_ptt,
-   _info->image_att[i], i);
+   _info.image_att[i], i);
if (rc) {
DP_ERR(p_hwfn,
   "Failed getting image index %d attributes\n", i);
@@ -2597,14 +2603,22 @@ int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn)
}
 
DP_VERBOSE(p_hwfn, QED_MSG_SP, "image index %d, size %x\n", i,
-  nvm_info->image_att[i].len);
+  nvm_info.image_att[i].len);
}
 out:
+   /* Update hwfn's nvm_info */
+   if (nvm_info.num_images) {
+   p_hwfn->nvm_info.num_images = nvm_info.num_images;
+   kfree(p_hwfn->nvm_info.image_att);
+   p_hwfn->nvm_info.image_att = nvm_info.image_att;
+   p_hwfn->nvm_info.valid = true;
+   }
+
qed_ptt_release(p_hwfn, p_ptt);
return 0;
 
 err1:
-   kfree(nvm_info->image_att);
+   kfree(nvm_info.image_att);
 err0:
qed_ptt_release(p_hwfn, p_ptt);
return rc;
@@ -2641,6 +2655,7 @@ int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn)
return -EINVAL;
}
 
+   qed_mcp_nvm_info_populate(p_hwfn);
for (i = 0; i < p_hwfn->nvm_info.num_images; i++)
if (type == p_hwfn->nvm_info.image_att[i].image_type)
break;
-- 
1.8.3.1

Re: [PATCH v2 net-next 00/14] Scheduled packet Transmission: ETF

2018-07-04 Thread David Miller

From: Jesus Sanchez-Palencia 
Date: Tue,  3 Jul 2018 15:42:46 -0700

> Overview
> 
> 
> This work consists of a set of kernel interfaces that can be used by
> applications that require (time-based) Scheduled Tx of packets.
> It is comprised by 3 new components to the kernel:
> 
>   - SO_TXTIME: socket option + cmsg programming interfaces.
> 
>   - etf: the "earliest txtime first" qdisc, that provides per-queue
>TxTime-based scheduling. This has been renamed from 'tbs' to
>'etf' to better describe its functionality.
> 
>   - taprio: the "time-aware priority scheduler" qdisc, that provides
>   per-port Time-Aware scheduling;
> 
> This patchset is providing the first 2 components, which have been
> developed for longer. The taprio qdisc will be shared as an RFC separately
> (shortly).
 ...

I don't have any problems with this, series applied, thanks.

[PATCH v2 2/4] samples/bpf: Check the result of system()

2018-07-04 Thread Taeung Song

To avoid the below build warning message,
use new generate_load() checking the return value.

  ignoring return value of ‘system’, declared with attribute warn_unused_result

And it also refactors the duplicate code of both
test_perf_event_all_cpu() and test_perf_event_task()

Cc: Teng Qin 
Signed-off-by: Taeung Song 
---
 samples/bpf/trace_event_user.c | 27 ---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/samples/bpf/trace_event_user.c b/samples/bpf/trace_event_user.c
index 1fa1becfa641..d08046ab81f0 100644
--- a/samples/bpf/trace_event_user.c
+++ b/samples/bpf/trace_event_user.c
@@ -122,6 +122,16 @@ static void print_stacks(void)
}
 }
 
+static inline int generate_load(void)
+{
+   if (system("dd if=/dev/zero of=/dev/null count=5000k status=none") < 0) 
{
+   printf("failed to generate some load with dd: %s\n", 
strerror(errno));
+   return -1;
+   }
+
+   return 0;
+}
+
 static void test_perf_event_all_cpu(struct perf_event_attr *attr)
 {
int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
@@ -142,7 +152,11 @@ static void test_perf_event_all_cpu(struct perf_event_attr 
*attr)
assert(ioctl(pmu_fd[i], PERF_EVENT_IOC_SET_BPF, prog_fd[0]) == 
0);
assert(ioctl(pmu_fd[i], PERF_EVENT_IOC_ENABLE) == 0);
}
-   system("dd if=/dev/zero of=/dev/null count=5000k status=none");
+
+   if (generate_load() < 0) {
+   error = 1;
+   goto all_cpu_err;
+   }
print_stacks();
 all_cpu_err:
for (i--; i >= 0; i--) {
@@ -156,7 +170,7 @@ static void test_perf_event_all_cpu(struct perf_event_attr 
*attr)
 
 static void test_perf_event_task(struct perf_event_attr *attr)
 {
-   int pmu_fd;
+   int pmu_fd, error = 0;
 
/* per task perf event, enable inherit so the "dd ..." command can be 
traced properly.
 * Enabling inherit will cause bpf_perf_prog_read_time helper failure.
@@ -171,10 +185,17 @@ static void test_perf_event_task(struct perf_event_attr 
*attr)
}
assert(ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd[0]) == 0);
assert(ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE) == 0);
-   system("dd if=/dev/zero of=/dev/null count=5000k status=none");
+
+   if (generate_load() < 0) {
+   error = 1;
+   goto err;
+   }
print_stacks();
+err:
ioctl(pmu_fd, PERF_EVENT_IOC_DISABLE);
close(pmu_fd);
+   if (error)
+   int_exit(0);
 }
 
 static void test_bpf_perf_event(void)
-- 
2.17.1

[PATCH v2 4/4] samples/bpf: add .gitignore file

2018-07-04 Thread Taeung Song

For untracked executables of samples/bpf, add this.

  Untracked files:
(use "git add ..." to include in what will be committed)

samples/bpf/cpustat
samples/bpf/fds_example
samples/bpf/lathist
samples/bpf/load_sock_ops
...

Signed-off-by: Taeung Song 
---
 samples/bpf/.gitignore | 49 ++
 1 file changed, 49 insertions(+)
 create mode 100644 samples/bpf/.gitignore

diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
new file mode 100644
index ..8ae4940025f8
--- /dev/null
+++ b/samples/bpf/.gitignore
@@ -0,0 +1,49 @@
+cpustat
+fds_example
+lathist
+load_sock_ops
+lwt_len_hist
+map_perf_test
+offwaketime
+per_socket_stats_example
+sampleip
+sock_example
+sockex1
+sockex2
+sockex3
+spintest
+syscall_nrs.h
+syscall_tp
+task_fd_query
+tc_l2_redirect
+test_cgrp2_array_pin
+test_cgrp2_attach
+test_cgrp2_attach2
+test_cgrp2_sock
+test_cgrp2_sock2
+test_current_task_under_cgroup
+test_lru_dist
+test_map_in_map
+test_overhead
+test_probe_write_user
+trace_event
+trace_output
+tracex1
+tracex2
+tracex3
+tracex4
+tracex5
+tracex6
+tracex7
+xdp1
+xdp2
+xdp_adjust_tail
+xdp_fwd
+xdp_monitor
+xdp_redirect
+xdp_redirect_cpu
+xdp_redirect_map
+xdp_router_ipv4
+xdp_rxq_info
+xdp_tx_iptunnel
+xdpsock
-- 
2.17.1

[PATCH net-next 16/18] net/mlx5: Accel, add common metadata functions

2018-07-04 Thread Boris Pismenny

This patch adds common functions to handle mellanox metadata headers.
These functions are used by IPsec and TLS to process FPGA metadata.

Signed-off-by: Boris Pismenny 
Reviewed-by: Aviad Yehezkel 
Reviewed-by: Tariq Toukan 
---
 .../net/ethernet/mellanox/mlx5/core/accel/accel.h  | 37 ++
 .../mellanox/mlx5/core/en_accel/ipsec_rxtx.c   | 19 +++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c | 18 +++
 3 files changed, 45 insertions(+), 29 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/accel.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/accel.h 
b/drivers/net/ethernet/mellanox/mlx5/core/accel/accel.h
new file mode 100644
index 000..c132604
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/accel.h
@@ -0,0 +1,37 @@
+#ifndef __MLX5E_ACCEL_H__
+#define __MLX5E_ACCEL_H__
+
+#ifdef CONFIG_MLX5_ACCEL
+
+#include 
+#include 
+#include "en.h"
+
+static inline bool is_metadata_hdr_valid(struct sk_buff *skb)
+{
+   __be16 *ethtype;
+
+   if (unlikely(skb->len < ETH_HLEN + MLX5E_METADATA_ETHER_LEN))
+   return false;
+   ethtype = (__be16 *)(skb->data + ETH_ALEN * 2);
+   if (*ethtype != cpu_to_be16(MLX5E_METADATA_ETHER_TYPE))
+   return false;
+   return true;
+}
+
+static inline void remove_metadata_hdr(struct sk_buff *skb)
+{
+   struct ethhdr *old_eth;
+   struct ethhdr *new_eth;
+
+   /* Remove the metadata from the buffer */
+   old_eth = (struct ethhdr *)skb->data;
+   new_eth = (struct ethhdr *)(skb->data + MLX5E_METADATA_ETHER_LEN);
+   memmove(new_eth, old_eth, 2 * ETH_ALEN);
+   /* Ethertype is already in its new place */
+   skb_pull_inline(skb, MLX5E_METADATA_ETHER_LEN);
+}
+
+#endif /* CONFIG_MLX5_ACCEL */
+
+#endif /* __MLX5E_EN_ACCEL_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
index c245d8e..fda7929 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
@@ -37,6 +37,7 @@
 
 #include "en_accel/ipsec_rxtx.h"
 #include "en_accel/ipsec.h"
+#include "accel/accel.h"
 #include "en.h"
 
 enum {
@@ -346,19 +347,12 @@ struct sk_buff *mlx5e_ipsec_handle_tx_skb(struct 
net_device *netdev,
 }
 
 struct sk_buff *mlx5e_ipsec_handle_rx_skb(struct net_device *netdev,
- struct sk_buff *skb)
+ struct sk_buff *skb, u32 *cqe_bcnt)
 {
struct mlx5e_ipsec_metadata *mdata;
-   struct ethhdr *old_eth;
-   struct ethhdr *new_eth;
struct xfrm_state *xs;
-   __be16 *ethtype;
 
-   /* Detect inline metadata */
-   if (skb->len < ETH_HLEN + MLX5E_METADATA_ETHER_LEN)
-   return skb;
-   ethtype = (__be16 *)(skb->data + ETH_ALEN * 2);
-   if (*ethtype != cpu_to_be16(MLX5E_METADATA_ETHER_TYPE))
+   if (!is_metadata_hdr_valid(skb))
return skb;
 
/* Use the metadata */
@@ -369,12 +363,7 @@ struct sk_buff *mlx5e_ipsec_handle_rx_skb(struct 
net_device *netdev,
return NULL;
}
 
-   /* Remove the metadata from the buffer */
-   old_eth = (struct ethhdr *)skb->data;
-   new_eth = (struct ethhdr *)(skb->data + MLX5E_METADATA_ETHER_LEN);
-   memmove(new_eth, old_eth, 2 * ETH_ALEN);
-   /* Ethertype is already in its new place */
-   skb_pull_inline(skb, MLX5E_METADATA_ETHER_LEN);
+   remove_metadata_hdr(skb);
 
return skb;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
index ecfc764..92d3745 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -33,6 +33,8 @@
 
 #include "en_accel/tls.h"
 #include "en_accel/tls_rxtx.h"
+#include "accel/accel.h"
+
 #include 
 #include 
 
@@ -350,16 +352,9 @@ void mlx5e_tls_handle_rx_skb(struct net_device *netdev, 
struct sk_buff *skb,
 u32 *cqe_bcnt)
 {
struct mlx5e_tls_metadata *mdata;
-   struct ethhdr *old_eth;
-   struct ethhdr *new_eth;
-   __be16 *ethtype;
struct mlx5e_priv *priv;
 
-   /* Detect inline metadata */
-   if (skb->len < ETH_HLEN + MLX5E_METADATA_ETHER_LEN)
-   return;
-   ethtype = (__be16 *)(skb->data + ETH_ALEN * 2);
-   if (*ethtype != cpu_to_be16(MLX5E_METADATA_ETHER_TYPE))
+   if (!is_metadata_hdr_valid(skb))
return;
 
/* Use the metadata */
@@ -383,11 +378,6 @@ void mlx5e_tls_handle_rx_skb(struct net_device *netdev, 
struct sk_buff *skb,
return;
}
 
-   /* Remove the metadata from the buffer */
-   old_eth = (struct ethhdr *)skb->data;
-   new_eth = (struct ethhdr

[PATCH net-next 07/18] tls: Split tls_sw_release_resources_rx

2018-07-04 Thread Boris Pismenny

This patch splits tls_sw_release_resources_rx into two functions one
which releases all inner software tls structures and another that also
frees the containing structure.

In TLS_DEVICE we will need to release the software structures without
freeeing the containing structure, which contains other information.

Signed-off-by: Boris Pismenny 
---
 include/net/tls.h |  1 +
 net/tls/tls_sw.c  | 10 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 49b8922..7a485de 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -223,6 +223,7 @@ int tls_sw_sendpage(struct sock *sk, struct page *page,
 void tls_sw_close(struct sock *sk, long timeout);
 void tls_sw_free_resources_tx(struct sock *sk);
 void tls_sw_free_resources_rx(struct sock *sk);
+void tls_sw_release_resources_rx(struct sock *sk);
 int tls_sw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
   int nonblock, int flags, int *addr_len);
 unsigned int tls_sw_poll(struct file *file, struct socket *sock,
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 8f53b92..7a5c36c 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1034,7 +1034,7 @@ void tls_sw_free_resources_tx(struct sock *sk)
kfree(ctx);
 }
 
-void tls_sw_free_resources_rx(struct sock *sk)
+void tls_sw_release_resources_rx(struct sock *sk)
 {
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
@@ -1053,6 +1053,14 @@ void tls_sw_free_resources_rx(struct sock *sk)
strp_done(>strp);
lock_sock(sk);
}
+}
+
+void tls_sw_free_resources_rx(struct sock *sk)
+{
+   struct tls_context *tls_ctx = tls_get_ctx(sk);
+   struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
+
+   tls_sw_release_resources_rx(sk);
 
kfree(ctx);
 }
-- 
1.8.3.1

[PATCH net-next 11/18] net/mlx5: Accel, add TLS rx offload routines

2018-07-04 Thread Boris Pismenny

In Innova TLS, TLS contexts are added or deleted
via a command message over the SBU connection.
The HW then sends a response message over the same connection.

Complete the implementation for Innova TLS (FPGA-based) hardware by
adding support for rx inline crypto offload.

Signed-off-by: Boris Pismenny 
Signed-off-by: Ilya Lesokhin 
Reviewed-by: Aviad Yehezkel 
Reviewed-by: Tariq Toukan 
---
 .../net/ethernet/mellanox/mlx5/core/accel/tls.c|  23 +++--
 .../net/ethernet/mellanox/mlx5/core/accel/tls.h|  26 +++--
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c | 113 -
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h |  18 ++--
 include/linux/mlx5/mlx5_ifc_fpga.h |   1 +
 5 files changed, 135 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
index 77ac19f..da7bd26 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
@@ -37,17 +37,26 @@
 #include "mlx5_core.h"
 #include "fpga/tls.h"
 
-int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
-  struct tls_crypto_info *crypto_info,
-  u32 start_offload_tcp_sn, u32 *p_swid)
+int mlx5_accel_tls_add_flow(struct mlx5_core_dev *mdev, void *flow,
+   struct tls_crypto_info *crypto_info,
+   u32 start_offload_tcp_sn, u32 *p_swid,
+   bool direction_sx)
 {
-   return mlx5_fpga_tls_add_tx_flow(mdev, flow, crypto_info,
-start_offload_tcp_sn, p_swid);
+   return mlx5_fpga_tls_add_flow(mdev, flow, crypto_info,
+ start_offload_tcp_sn, p_swid,
+ direction_sx);
 }
 
-void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid)
+void mlx5_accel_tls_del_flow(struct mlx5_core_dev *mdev, u32 swid,
+bool direction_sx)
 {
-   mlx5_fpga_tls_del_tx_flow(mdev, swid, GFP_KERNEL);
+   mlx5_fpga_tls_del_flow(mdev, swid, GFP_KERNEL, direction_sx);
+}
+
+int mlx5_accel_tls_resync_rx(struct mlx5_core_dev *mdev, u32 handle, u32 seq,
+u64 rcd_sn)
+{
+   return mlx5_fpga_tls_resync_rx(mdev, handle, seq, rcd_sn);
 }
 
 bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h 
b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
index 6f9c9f4..2228c10 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
@@ -60,10 +60,14 @@ struct mlx5_ifc_tls_flow_bits {
u8 reserved_at_2[0x1e];
 };
 
-int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
-  struct tls_crypto_info *crypto_info,
-  u32 start_offload_tcp_sn, u32 *p_swid);
-void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid);
+int mlx5_accel_tls_add_flow(struct mlx5_core_dev *mdev, void *flow,
+   struct tls_crypto_info *crypto_info,
+   u32 start_offload_tcp_sn, u32 *p_swid,
+   bool direction_sx);
+void mlx5_accel_tls_del_flow(struct mlx5_core_dev *mdev, u32 swid,
+bool direction_sx);
+int mlx5_accel_tls_resync_rx(struct mlx5_core_dev *mdev, u32 handle, u32 seq,
+u64 rcd_sn);
 bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev);
 u32 mlx5_accel_tls_device_caps(struct mlx5_core_dev *mdev);
 int mlx5_accel_tls_init(struct mlx5_core_dev *mdev);
@@ -71,11 +75,15 @@ int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, 
void *flow,
 
 #else
 
-static inline int
-mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
-  struct tls_crypto_info *crypto_info,
-  u32 start_offload_tcp_sn, u32 *p_swid) { return 0; }
-static inline void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 
swid) { }
+static int
+mlx5_accel_tls_add_flow(struct mlx5_core_dev *mdev, void *flow,
+   struct tls_crypto_info *crypto_info,
+   u32 start_offload_tcp_sn, u32 *p_swid,
+   bool direction_sx) { return -ENOTSUPP; }
+static inline void mlx5_accel_tls_del_flow(struct mlx5_core_dev *mdev, u32 
swid,
+  bool direction_sx) { }
+static inline int mlx5_accel_tls_resync_rx(struct mlx5_core_dev *mdev, u32 
handle,
+  u32 seq, u64 rcd_sn) { return 0; }
 static inline bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev) { 
return false; }
 static inline u32 mlx5_accel_tls_device_caps(struct mlx5_core_dev *mdev) { 
return 0; }
 static inline int

[PATCH net-next 06/18] tls: Split decrypt_skb to two functions

2018-07-04 Thread Boris Pismenny

Previously, decrypt_skb also updated the TLS context.
Now, decrypt_skb only decrypts the payload using the current context,
while decrypt_skb_update also updates the state.

Later, in the tls_device Rx flow, we will use decrypt_skb directly.

Signed-off-by: Boris Pismenny 
---
 include/net/tls.h |  2 ++
 net/tls/tls_sw.c  | 44 ++--
 2 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 5dcd808..49b8922 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -390,6 +390,8 @@ int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg,
  unsigned char *record_type);
 void tls_register_device(struct tls_device *device);
 void tls_unregister_device(struct tls_device *device);
+int decrypt_skb(struct sock *sk, struct sk_buff *skb,
+   struct scatterlist *sgout);
 
 struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
  struct net_device *dev,
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 0d670c8..8f53b92 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -53,7 +53,6 @@ static int tls_do_decryption(struct sock *sk,
 {
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
-   struct strp_msg *rxm = strp_msg(skb);
struct aead_request *aead_req;
 
int ret;
@@ -74,18 +73,6 @@ static int tls_do_decryption(struct sock *sk,
 
ret = crypto_wait_req(crypto_aead_decrypt(aead_req), >async_wait);
 
-   if (ret < 0)
-   goto out;
-
-   rxm->offset += tls_ctx->rx.prepend_size;
-   rxm->full_len -= tls_ctx->rx.overhead_size;
-   tls_advance_record_sn(sk, _ctx->rx);
-
-   ctx->decrypted = true;
-
-   ctx->saved_data_ready(sk);
-
-out:
kfree(aead_req);
return ret;
 }
@@ -670,8 +657,29 @@ static struct sk_buff *tls_wait_data(struct sock *sk, int 
flags,
return skb;
 }
 
-static int decrypt_skb(struct sock *sk, struct sk_buff *skb,
-  struct scatterlist *sgout)
+static int decrypt_skb_update(struct sock *sk, struct sk_buff *skb,
+ struct scatterlist *sgout)
+{
+   struct tls_context *tls_ctx = tls_get_ctx(sk);
+   struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
+   struct strp_msg *rxm = strp_msg(skb);
+   int err;
+
+   err = decrypt_skb(sk, skb, sgout);
+   if (err < 0)
+   return err;
+
+   rxm->offset += tls_ctx->rx.prepend_size;
+   rxm->full_len -= tls_ctx->rx.overhead_size;
+   tls_advance_record_sn(sk, _ctx->rx);
+   ctx->decrypted = true;
+   ctx->saved_data_ready(sk);
+
+   return err;
+}
+
+int decrypt_skb(struct sock *sk, struct sk_buff *skb,
+   struct scatterlist *sgout)
 {
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
@@ -816,7 +824,7 @@ int tls_sw_recvmsg(struct sock *sk,
if (err < 0)
goto fallback_to_reg_recv;
 
-   err = decrypt_skb(sk, skb, sgin);
+   err = decrypt_skb_update(sk, skb, sgin);
for (; pages > 0; pages--)
put_page(sg_page([pages]));
if (err < 0) {
@@ -825,7 +833,7 @@ int tls_sw_recvmsg(struct sock *sk,
}
} else {
 fallback_to_reg_recv:
-   err = decrypt_skb(sk, skb, NULL);
+   err = decrypt_skb_update(sk, skb, NULL);
if (err < 0) {
tls_err_abort(sk, EBADMSG);
goto recv_end;
@@ -896,7 +904,7 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t 
*ppos,
}
 
if (!ctx->decrypted) {
-   err = decrypt_skb(sk, skb, NULL);
+   err = decrypt_skb_update(sk, skb, NULL);
 
if (err < 0) {
tls_err_abort(sk, EBADMSG);
-- 
1.8.3.1

[PATCH net-next 08/18] tls: Fill software context without allocation

2018-07-04 Thread Boris Pismenny

This patch allows tls_set_sw_offload to fill the context in case it was
already allocated previously.

We will use it in TLS_DEVICE to fill the RX software context.

Signed-off-by: Boris Pismenny 
---
 net/tls/tls_sw.c | 34 ++
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 7a5c36c..e36a2ec 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1085,28 +1085,38 @@ int tls_set_sw_offload(struct sock *sk, struct 
tls_context *ctx, int tx)
}
 
if (tx) {
-   sw_ctx_tx = kzalloc(sizeof(*sw_ctx_tx), GFP_KERNEL);
-   if (!sw_ctx_tx) {
-   rc = -ENOMEM;
-   goto out;
+   if (!ctx->priv_ctx_tx) {
+   sw_ctx_tx = kzalloc(sizeof(*sw_ctx_tx), GFP_KERNEL);
+   if (!sw_ctx_tx) {
+   rc = -ENOMEM;
+   goto out;
+   }
+   ctx->priv_ctx_tx = sw_ctx_tx;
+   } else {
+   sw_ctx_tx =
+   (struct tls_sw_context_tx *)ctx->priv_ctx_tx;
}
-   crypto_init_wait(_ctx_tx->async_wait);
-   ctx->priv_ctx_tx = sw_ctx_tx;
} else {
-   sw_ctx_rx = kzalloc(sizeof(*sw_ctx_rx), GFP_KERNEL);
-   if (!sw_ctx_rx) {
-   rc = -ENOMEM;
-   goto out;
+   if (!ctx->priv_ctx_rx) {
+   sw_ctx_rx = kzalloc(sizeof(*sw_ctx_rx), GFP_KERNEL);
+   if (!sw_ctx_rx) {
+   rc = -ENOMEM;
+   goto out;
+   }
+   ctx->priv_ctx_rx = sw_ctx_rx;
+   } else {
+   sw_ctx_rx =
+   (struct tls_sw_context_rx *)ctx->priv_ctx_rx;
}
-   crypto_init_wait(_ctx_rx->async_wait);
-   ctx->priv_ctx_rx = sw_ctx_rx;
}
 
if (tx) {
+   crypto_init_wait(_ctx_tx->async_wait);
crypto_info = >crypto_send;
cctx = >tx;
aead = _ctx_tx->aead_send;
} else {
+   crypto_init_wait(_ctx_rx->async_wait);
crypto_info = >crypto_recv;
cctx = >rx;
aead = _ctx_rx->aead_recv;
-- 
1.8.3.1

[PATCH net-next 00/18] TLS offload rx, netdev & mlx5

2018-07-04 Thread Boris Pismenny

Hi,

This series completes the generic infrastructure to offload TLS crypto to
a network devices. It enables the kernel TLS socket to skip decryption and
authentication operations for SKBs marked as decrypted on the receive
side of the data path. Leaving those computationally expensive operations
to the NIC.

This infrastructure doesn't require a TCP offload engine. Instead, the
NIC decrypts a packet's payload if the packet contains the expected TCP
sequence number. The TLS record authentication tag remains unmodified
regardless of decryption. If the packet is decrypted successfully and it
contains an authentication tag, then the authentication check has passed.
Otherwise, if the authentication fails, then the packet is provided
unmodified and the KTLS layer is responsible for handling it.
Out-Of-Order TCP packets are provided unmodified. As a result,
in the slow path some of the SKBs are decrypted while others remain as
ciphertext.

The GRO and TCP layers must not coalesce decrypted and non-decrypted SKBs. 
At the worst case a received TLS record consists of both plaintext
and ciphertext packets. These partially decrypted records must be
reencrypted, only to be decrypted.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. Partial decryption - Software must handle the case of a TLS record
that was only partially decrypted by HW. This can happen due to packet
reordering.
2. Resynchronization - tls_read_size calls the device driver to
resynchronize HW whenever it lost track of the TLS record framing in
the TCP stream.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC identifies packets that should be offloaded according to
the 5-tuple and the TCP sequence number. If these match and the
packet is decrypted and authenticated successfully, then a syndrome
is provided to software. Otherwise, the packet is unmodified.
Decrypted and non-decrypted packets aren't coalesced by the network stack,
and the KTLS layer decrypts and authenticates partially decrypted records.
The NIC provides an indication whenever a resync is required. The resync
operation is triggered by the KTLS layer while parsing TLS record headers.

Finally, we measure the performance obtained by running single stream
iperf with two Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz machines connected
back-to-back with Innova TLS (40Gbps) NICs. We compare TCP (upper bound)
and KTLS-Offload running both in Tx and Rx. The results show that the
performance of offload is comparable to TCP.

  | Bandwidth (Gbps) | CPU Tx (%) | CPU rx (%)
TCP   | 28.8 | 5  | 12
KTLS-Offload-Tx-Rx| 28.6 | 7  | 14

Paper: https://netdevconf.org/2.2/papers/pismenny-tlscrypto-talk.pdf



Boris Pismenny (17):
  net: Add decrypted field to skb
  net: Add TLS rx resync NDO
  tcp: Don't coalesce decrypted and encrypted SKBs
  tls: Refactor tls_offload variable names
  tls: Split decrypt_skb to two functions
  tls: Split tls_sw_release_resources_rx
  tls: Fill software context without allocation
  tls: Add rx inline crypto offload
  net/mlx5e: TLS, refactor variable names
  net/mlx5: Accel, add TLS rx offload routines
  net/mlx5e: TLS, add innova rx support
  net/mlx5e: TLS, add Innova TLS rx data path
  net/mlx5e: TLS, add software statistics
  net/mlx5e: TLS, build TLS netdev from capabilities
  net/mlx5: Accel, add common metadata functions
  net/mlx5e: IPsec, fix byte count in CQE
  net/mlx5e: Kconfig, mutually exclude compilation of TLS and IPsec
accel

Ilya Lesokhin (1):
  net: Add TLS RX offload feature

 drivers/net/ethernet/mellanox/mlx5/core/Kconfig|   1 +
 .../net/ethernet/mellanox/mlx5/core/accel/accel.h  |  37 +++
 .../net/ethernet/mellanox/mlx5/core/accel/tls.c|  23 +-
 .../net/ethernet/mellanox/mlx5/core/accel/tls.h|  26 +-
 .../mellanox/mlx5/core/en_accel/ipsec_rxtx.c   |  20 +-
 .../mellanox/mlx5/core/en_accel/ipsec_rxtx.h   |   2 +-
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c |  68 +++--
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h |  33 ++-
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c | 117 +++-
 .../mellanox/mlx5/core/en_accel/tls_rxtx.h |   3 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c|   8 +-
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c | 113 ++--
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h |  18 +-
 include/linux/mlx5/mlx5_ifc_fpga.h |   1 +
 include/linux/netdev_features.h|   2 +
 include/linux/netdevice.h  |   2 +
 include/linux/skbuff.h |   7 +-
 include/net/tls.h  |  82 +-
 net/core/ethtool.c |   1 +
 net/core/skbuff.c  |   6 +

[PATCH net-next 05/18] tls: Refactor tls_offload variable names

2018-07-04 Thread Boris Pismenny

For symmetry, we rename tls_offload_context to
tls_offload_context_tx before we add tls_offload_context_rx.

Signed-off-by: Boris Pismenny 
---
 include/net/tls.h | 16 
 net/tls/tls_device.c  | 26 +-
 net/tls/tls_device_fallback.c |  8 
 3 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 70c2737..5dcd808 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -128,7 +128,7 @@ struct tls_record_info {
skb_frag_t frags[MAX_SKB_FRAGS];
 };
 
-struct tls_offload_context {
+struct tls_offload_context_tx {
struct crypto_aead *aead_send;
spinlock_t lock;/* protects records list */
struct list_head records_list;
@@ -147,8 +147,8 @@ struct tls_offload_context {
 #define TLS_DRIVER_STATE_SIZE (max_t(size_t, 8, sizeof(void *)))
 };
 
-#define TLS_OFFLOAD_CONTEXT_SIZE   
\
-   (ALIGN(sizeof(struct tls_offload_context), sizeof(void *)) +   \
+#define TLS_OFFLOAD_CONTEXT_SIZE_TX
\
+   (ALIGN(sizeof(struct tls_offload_context_tx), sizeof(void *)) +\
 TLS_DRIVER_STATE_SIZE)
 
 enum {
@@ -239,7 +239,7 @@ int tls_device_sendpage(struct sock *sk, struct page *page,
 void tls_device_init(void);
 void tls_device_cleanup(void);
 
-struct tls_record_info *tls_get_record(struct tls_offload_context *context,
+struct tls_record_info *tls_get_record(struct tls_offload_context_tx *context,
   u32 seq, u64 *p_record_sn);
 
 static inline bool tls_record_is_start_marker(struct tls_record_info *rec)
@@ -380,10 +380,10 @@ static inline struct tls_sw_context_tx *tls_sw_ctx_tx(
return (struct tls_sw_context_tx *)tls_ctx->priv_ctx_tx;
 }
 
-static inline struct tls_offload_context *tls_offload_ctx(
-   const struct tls_context *tls_ctx)
+static inline struct tls_offload_context_tx *
+tls_offload_ctx_tx(const struct tls_context *tls_ctx)
 {
-   return (struct tls_offload_context *)tls_ctx->priv_ctx_tx;
+   return (struct tls_offload_context_tx *)tls_ctx->priv_ctx_tx;
 }
 
 int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg,
@@ -396,7 +396,7 @@ struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
  struct sk_buff *skb);
 
 int tls_sw_fallback_init(struct sock *sk,
-struct tls_offload_context *offload_ctx,
+struct tls_offload_context_tx *offload_ctx,
 struct tls_crypto_info *crypto_info);
 
 #endif /* _TLS_OFFLOAD_H */
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index a7a8f8e..c01de7d 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -52,9 +52,9 @@
 
 static void tls_device_free_ctx(struct tls_context *ctx)
 {
-   struct tls_offload_context *offload_ctx = tls_offload_ctx(ctx);
+   if (ctx->tx_conf == TLS_HW)
+   kfree(tls_offload_ctx_tx(ctx));
 
-   kfree(offload_ctx);
kfree(ctx);
 }
 
@@ -125,7 +125,7 @@ static void destroy_record(struct tls_record_info *record)
kfree(record);
 }
 
-static void delete_all_records(struct tls_offload_context *offload_ctx)
+static void delete_all_records(struct tls_offload_context_tx *offload_ctx)
 {
struct tls_record_info *info, *temp;
 
@@ -141,14 +141,14 @@ static void tls_icsk_clean_acked(struct sock *sk, u32 
acked_seq)
 {
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_record_info *info, *temp;
-   struct tls_offload_context *ctx;
+   struct tls_offload_context_tx *ctx;
u64 deleted_records = 0;
unsigned long flags;
 
if (!tls_ctx)
return;
 
-   ctx = tls_offload_ctx(tls_ctx);
+   ctx = tls_offload_ctx_tx(tls_ctx);
 
spin_lock_irqsave(>lock, flags);
info = ctx->retransmit_hint;
@@ -179,7 +179,7 @@ static void tls_icsk_clean_acked(struct sock *sk, u32 
acked_seq)
 void tls_device_sk_destruct(struct sock *sk)
 {
struct tls_context *tls_ctx = tls_get_ctx(sk);
-   struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
+   struct tls_offload_context_tx *ctx = tls_offload_ctx_tx(tls_ctx);
 
if (ctx->open_record)
destroy_record(ctx->open_record);
@@ -219,7 +219,7 @@ static void tls_append_frag(struct tls_record_info *record,
 
 static int tls_push_record(struct sock *sk,
   struct tls_context *ctx,
-  struct tls_offload_context *offload_ctx,
+  struct tls_offload_context_tx *offload_ctx,
   struct tls_record_info *record,
   struct page_frag *pfrag,
   int flags,
@@ -264,7 +264,7 @@ static int tls_push_record(struct sock *sk,
return tls_push_sg(sk, ctx, offload_ctx->sg_tx_data, 0, flags);

[PATCH net-next 17/18] net/mlx5e: IPsec, fix byte count in CQE

2018-07-04 Thread Boris Pismenny

This patch fixes the byte count indication in CQE for processed IPsec
packets that contain a metadata header.

Signed-off-by: Boris Pismenny 
Reviewed-by: Aviad Yehezkel 
Reviewed-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h | 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
index fda7929..128a82b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
@@ -364,6 +364,7 @@ struct sk_buff *mlx5e_ipsec_handle_rx_skb(struct net_device 
*netdev,
}
 
remove_metadata_hdr(skb);
+   *cqe_bcnt -= MLX5E_METADATA_ETHER_LEN;
 
return skb;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h
index 2bfbbef..ca47c05 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.h
@@ -41,7 +41,7 @@
 #include "en.h"
 
 struct sk_buff *mlx5e_ipsec_handle_rx_skb(struct net_device *netdev,
- struct sk_buff *skb);
+ struct sk_buff *skb, u32 *cqe_bcnt);
 void mlx5e_ipsec_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 
 void mlx5e_ipsec_inverse_table_init(void);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 0f2bcc2..1d5295e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1547,7 +1547,7 @@ void mlx5e_ipsec_handle_rx_cqe(struct mlx5e_rq *rq, 
struct mlx5_cqe64 *cqe)
mlx5e_free_rx_wqe(rq, wi);
goto wq_cyc_pop;
}
-   skb = mlx5e_ipsec_handle_rx_skb(rq->netdev, skb);
+   skb = mlx5e_ipsec_handle_rx_skb(rq->netdev, skb, _bcnt);
if (unlikely(!skb)) {
mlx5e_free_rx_wqe(rq, wi);
goto wq_cyc_pop;
-- 
1.8.3.1

[PATCH net-next 03/18] net: Add TLS rx resync NDO

2018-07-04 Thread Boris Pismenny

Add new netdev tls op for resynchornizing HW tls context

Signed-off-by: Boris Pismenny 
---
 include/linux/netdevice.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c1ef749..022c55b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -897,6 +897,8 @@ struct tlsdev_ops {
void (*tls_dev_del)(struct net_device *netdev,
struct tls_context *ctx,
enum tls_offload_ctx_dir direction);
+   void (*tls_dev_resync_rx)(struct net_device *netdev,
+ struct sock *sk, u32 seq, u64 rcd_sn);
 };
 #endif
 
-- 
1.8.3.1

[PATCH net-next 02/18] net: Add TLS RX offload feature

2018-07-04 Thread Boris Pismenny

From: Ilya Lesokhin 

This patch adds a netdev feature to configure TLS RX inline crypto offload.

Signed-off-by: Ilya Lesokhin 
Signed-off-by: Boris Pismenny 
---
 include/linux/netdev_features.h | 2 ++
 net/core/ethtool.c  | 1 +
 2 files changed, 3 insertions(+)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 623bb8c..2b2a6dc 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -79,6 +79,7 @@ enum {
NETIF_F_HW_ESP_TX_CSUM_BIT, /* ESP with TX checksum offload */
NETIF_F_RX_UDP_TUNNEL_PORT_BIT, /* Offload of RX port for UDP tunnels */
NETIF_F_HW_TLS_TX_BIT,  /* Hardware TLS TX offload */
+   NETIF_F_HW_TLS_RX_BIT,  /* Hardware TLS RX offload */
 
NETIF_F_GRO_HW_BIT, /* Hardware Generic receive offload */
NETIF_F_HW_TLS_RECORD_BIT,  /* Offload TLS record */
@@ -151,6 +152,7 @@ enum {
 #define NETIF_F_HW_TLS_RECORD  __NETIF_F(HW_TLS_RECORD)
 #define NETIF_F_GSO_UDP_L4 __NETIF_F(GSO_UDP_L4)
 #define NETIF_F_HW_TLS_TX  __NETIF_F(HW_TLS_TX)
+#define NETIF_F_HW_TLS_RX  __NETIF_F(HW_TLS_RX)
 
 #define for_each_netdev_feature(mask_addr, bit)\
for_each_set_bit(bit, (unsigned long *)mask_addr, NETDEV_FEATURE_COUNT)
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index e677a20..c9993c6 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -111,6 +111,7 @@ int ethtool_op_get_ts_info(struct net_device *dev, struct 
ethtool_ts_info *info)
[NETIF_F_RX_UDP_TUNNEL_PORT_BIT] =   "rx-udp_tunnel-port-offload",
[NETIF_F_HW_TLS_RECORD_BIT] =   "tls-hw-record",
[NETIF_F_HW_TLS_TX_BIT] ="tls-hw-tx-offload",
+   [NETIF_F_HW_TLS_RX_BIT] ="tls-hw-rx-offload",
 };
 
 static const char
-- 
1.8.3.1

[PATCH net-next 15/18] net/mlx5e: TLS, build TLS netdev from capabilities

2018-07-04 Thread Boris Pismenny

This patch enables TLS Rx based on available HW capabilities.

Signed-off-by: Boris Pismenny 
Reviewed-by: Aviad Yehezkel 
Reviewed-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index 7f26869..66b642a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -182,13 +182,27 @@ static void mlx5e_tls_resync_rx(struct net_device 
*netdev, struct sock *sk,
 
 void mlx5e_tls_build_netdev(struct mlx5e_priv *priv)
 {
+   u32 caps = mlx5_accel_tls_device_caps(priv->mdev);
struct net_device *netdev = priv->netdev;
 
if (!mlx5_accel_is_tls_device(priv->mdev))
return;
 
-   netdev->features |= NETIF_F_HW_TLS_TX;
-   netdev->hw_features |= NETIF_F_HW_TLS_TX;
+   if (caps & MLX5_ACCEL_TLS_TX) {
+   netdev->features  |= NETIF_F_HW_TLS_TX;
+   netdev->hw_features   |= NETIF_F_HW_TLS_TX;
+   }
+
+   if (caps & MLX5_ACCEL_TLS_RX) {
+   netdev->features  |= NETIF_F_HW_TLS_RX;
+   netdev->hw_features   |= NETIF_F_HW_TLS_RX;
+   }
+
+   if (!(caps & MLX5_ACCEL_TLS_LRO)) {
+   netdev->features  &= ~NETIF_F_LRO;
+   netdev->hw_features   &= ~NETIF_F_LRO;
+   }
+
netdev->tlsdev_ops = _tls_ops;
 }
 
-- 
1.8.3.1

[PATCH net-next 18/18] net/mlx5e: Kconfig, mutually exclude compilation of TLS and IPsec accel

2018-07-04 Thread Boris Pismenny

We currently have no devices that support both TLS and IPsec using the
accel framework, and the current code does not support both IPsec and
TLS. This patch prevents such combinations.

Signed-off-by: Boris Pismenny 
Reviewed-by: Aviad Yehezkel 
Reviewed-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig 
b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 2545296..d3e8c70 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -93,6 +93,7 @@ config MLX5_EN_TLS
depends on TLS_DEVICE
depends on TLS=y || MLX5_CORE=m
depends on MLX5_ACCEL
+   depends on !MLX5_EN_IPSEC
default n
---help---
  Build support for TLS cryptography-offload accelaration in the NIC.
-- 
1.8.3.1

[PATCH net-next 01/18] net: Add decrypted field to skb

2018-07-04 Thread Boris Pismenny

The decrypted bit is propogated to cloned/copied skbs.
This will be used later by the inline crypto receive side offload
of tls.

Signed-off-by: Boris Pismenny 
Signed-off-by: Ilya Lesokhin 
---
 include/linux/skbuff.h | 7 ++-
 net/core/skbuff.c  | 6 ++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7601838..3ceb8dc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -630,6 +630,7 @@ enum {
  * @hash: the packet hash
  * @queue_mapping: Queue mapping for multiqueue devices
  * @xmit_more: More SKBs are pending for this queue
+ * @decrypted: Decrypted SKB
  * @ndisc_nodetype: router type (from link layer)
  * @ooo_okay: allow the mapping of a socket to a queue to be changed
  * @l4_hash: indicate hash is a canonical 4-tuple hash over transport
@@ -736,7 +737,11 @@ struct sk_buff {
peeked:1,
head_frag:1,
xmit_more:1,
-   __unused:1; /* one bit hole */
+#ifdef CONFIG_TLS_DEVICE
+   decrypted:1;
+#else
+   __unused:1;
+#endif
 
/* fields enclosed in headers_start/headers_end are copied
 * using a single memcpy() in __copy_skb_header()
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1357f36..64180d4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -805,6 +805,9 @@ static void __copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
 * It is not yet because we do not want to have a 16 bit hole
 */
new->queue_mapping = old->queue_mapping;
+#ifdef CONFIG_TLS_DEVICE
+   new->decrypted = old->decrypted;
+#endif
 
memcpy(>headers_start, >headers_start,
   offsetof(struct sk_buff, headers_end) -
@@ -865,6 +868,9 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, 
struct sk_buff *skb)
C(head_frag);
C(data);
C(truesize);
+#ifdef CONFIG_TLS_DEVICE
+   C(decrypted);
+#endif
refcount_set(>users, 1);
 
atomic_inc(&(skb_shinfo(skb)->dataref));
-- 
1.8.3.1

[PATCH net-next 13/18] net/mlx5e: TLS, add Innova TLS rx data path

2018-07-04 Thread Boris Pismenny

Implement the TLS rx offload data path according to the
requirements of the TLS generic NIC offload infrastructure.

Special metadata ethertype is used to pass information to
the hardware.

When hardware loses synchronization a special resync request
metadata message is used to request resync.

Signed-off-by: Boris Pismenny 
Signed-off-by: Ilya Lesokhin 
Reviewed-by: Aviad Yehezkel 
Reviewed-by: Tariq Toukan 
---
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c | 112 -
 .../mellanox/mlx5/core/en_accel/tls_rxtx.h |   3 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c|   6 ++
 3 files changed, 118 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
index c96196f..d460fda 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -33,6 +33,12 @@
 
 #include "en_accel/tls.h"
 #include "en_accel/tls_rxtx.h"
+#include 
+#include 
+
+#define SYNDROM_DECRYPTED  0x30
+#define SYNDROM_RESYNC_REQUEST 0x31
+#define SYNDROM_AUTH_FAILED 0x32
 
 #define SYNDROME_OFFLOAD_REQUIRED 32
 #define SYNDROME_SYNC 33
@@ -44,10 +50,26 @@ struct sync_info {
skb_frag_t frags[MAX_SKB_FRAGS];
 };
 
-struct mlx5e_tls_metadata {
+struct recv_metadata_content {
+   u8 syndrome;
+   u8 reserved;
+   __be32 sync_seq;
+} __packed;
+
+struct send_metadata_content {
/* One byte of syndrome followed by 3 bytes of swid */
__be32 syndrome_swid;
__be16 first_seq;
+} __packed;
+
+struct mlx5e_tls_metadata {
+   union {
+   /* from fpga to host */
+   struct recv_metadata_content recv;
+   /* from host to fpga */
+   struct send_metadata_content send;
+   unsigned char raw[6];
+   } __packed content;
/* packet type ID field */
__be16 ethertype;
 } __packed;
@@ -68,7 +90,8 @@ static int mlx5e_tls_add_metadata(struct sk_buff *skb, __be32 
swid)
2 * ETH_ALEN);
 
eth->h_proto = cpu_to_be16(MLX5E_METADATA_ETHER_TYPE);
-   pet->syndrome_swid = htonl(SYNDROME_OFFLOAD_REQUIRED << 24) | swid;
+   pet->content.send.syndrome_swid =
+   htonl(SYNDROME_OFFLOAD_REQUIRED << 24) | swid;
 
return 0;
 }
@@ -149,7 +172,7 @@ static void mlx5e_tls_complete_sync_skb(struct sk_buff *skb,
 
pet = (struct mlx5e_tls_metadata *)(nskb->data + sizeof(struct ethhdr));
memcpy(pet, , sizeof(syndrome));
-   pet->first_seq = htons(tcp_seq);
+   pet->content.send.first_seq = htons(tcp_seq);
 
/* MLX5 devices don't care about the checksum partial start, offset
 * and pseudo header
@@ -276,3 +299,86 @@ struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device 
*netdev,
 out:
return skb;
 }
+
+static int tls_update_resync_sn(struct net_device *netdev,
+   struct sk_buff *skb,
+   struct mlx5e_tls_metadata *mdata)
+{
+   struct sock *sk = NULL;
+   struct iphdr *iph;
+   struct tcphdr *th;
+   __be32 seq;
+
+   if (mdata->ethertype != htons(ETH_P_IP))
+   return -EINVAL;
+
+   iph = (struct iphdr *)(mdata + 1);
+
+   th = ((void *)iph) + iph->ihl * 4;
+
+   if (iph->version == 4) {
+   sk = inet_lookup_established(dev_net(netdev), _hashinfo,
+iph->saddr, th->source, iph->daddr,
+th->dest, netdev->ifindex);
+#if IS_ENABLED(CONFIG_IPV6)
+   } else {
+   struct ipv6hdr *ipv6h = (struct ipv6hdr *)iph;
+
+   sk = __inet6_lookup_established(dev_net(netdev), _hashinfo,
+   >saddr, th->source,
+   >daddr, th->dest,
+   netdev->ifindex, 0);
+#endif
+   }
+   if (!sk || sk->sk_state == TCP_TIME_WAIT)
+   goto out;
+
+   skb->sk = sk;
+   skb->destructor = sock_edemux;
+
+   memcpy(, >content.recv.sync_seq, sizeof(seq));
+   tls_offload_rx_resync_request(sk, seq);
+out:
+   return 0;
+}
+
+void mlx5e_tls_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb,
+u32 *cqe_bcnt)
+{
+   struct mlx5e_tls_metadata *mdata;
+   struct ethhdr *old_eth;
+   struct ethhdr *new_eth;
+   __be16 *ethtype;
+
+   /* Detect inline metadata */
+   if (skb->len < ETH_HLEN + MLX5E_METADATA_ETHER_LEN)
+   return;
+   ethtype = (__be16 *)(skb->data + ETH_ALEN * 2);
+   if (*ethtype != cpu_to_be16(MLX5E_METADATA_ETHER_TYPE))
+   return;
+
+   /* Use the metadata */
+   mdata = (struct mlx5e_tls_metadata *)(skb->data + ETH_HLEN);
+   switch (mdata->content.recv.syndrome) {
+

[PATCH net-next 14/18] net/mlx5e: TLS, add software statistics

2018-07-04 Thread Boris Pismenny

This patch adds software statistics for TLS to count important
events.

Signed-off-by: Boris Pismenny 
Reviewed-by: Aviad Yehezkel 
Reviewed-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c  |  2 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h  |  4 
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c | 11 ++-
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index 68368c9..7f26869 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -169,7 +169,9 @@ static void mlx5e_tls_resync_rx(struct net_device *netdev, 
struct sock *sk,
 
rx_ctx = mlx5e_get_tls_rx_context(tls_ctx);
 
+   netdev_info(netdev, "resyncing seq %d rcd %lld\n", seq, rcd_sn);
mlx5_accel_tls_resync_rx(priv->mdev, rx_ctx->handle, seq, rcd_sn);
+   atomic64_inc(>tls->sw_stats.rx_tls_resync_reply);
 }
 
 static const struct tlsdev_ops mlx5e_tls_ops = {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
index 2d40ede..3f5d721 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
@@ -43,6 +43,10 @@ struct mlx5e_tls_sw_stats {
atomic64_t tx_tls_drop_resync_alloc;
atomic64_t tx_tls_drop_no_sync_data;
atomic64_t tx_tls_drop_bypass_required;
+   atomic64_t rx_tls_drop_resync_request;
+   atomic64_t rx_tls_resync_request;
+   atomic64_t rx_tls_resync_reply;
+   atomic64_t rx_tls_auth_fail;
 };
 
 struct mlx5e_tls {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
index d460fda..ecfc764 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -330,8 +330,12 @@ static int tls_update_resync_sn(struct net_device *netdev,
netdev->ifindex, 0);
 #endif
}
-   if (!sk || sk->sk_state == TCP_TIME_WAIT)
+   if (!sk || sk->sk_state == TCP_TIME_WAIT) {
+   struct mlx5e_priv *priv = netdev_priv(netdev);
+
+   atomic64_inc(>tls->sw_stats.rx_tls_drop_resync_request);
goto out;
+   }
 
skb->sk = sk;
skb->destructor = sock_edemux;
@@ -349,6 +353,7 @@ void mlx5e_tls_handle_rx_skb(struct net_device *netdev, 
struct sk_buff *skb,
struct ethhdr *old_eth;
struct ethhdr *new_eth;
__be16 *ethtype;
+   struct mlx5e_priv *priv;
 
/* Detect inline metadata */
if (skb->len < ETH_HLEN + MLX5E_METADATA_ETHER_LEN)
@@ -365,9 +370,13 @@ void mlx5e_tls_handle_rx_skb(struct net_device *netdev, 
struct sk_buff *skb,
break;
case SYNDROM_RESYNC_REQUEST:
tls_update_resync_sn(netdev, skb, mdata);
+   priv = netdev_priv(netdev);
+   atomic64_inc(>tls->sw_stats.rx_tls_resync_request);
break;
case SYNDROM_AUTH_FAILED:
/* Authentication failure will be observed and verified by kTLS 
*/
+   priv = netdev_priv(netdev);
+   atomic64_inc(>tls->sw_stats.rx_tls_auth_fail);
break;
default:
/* Bypass the metadata header to others */
-- 
1.8.3.1

[PATCH net-next 09/18] tls: Add rx inline crypto offload

2018-07-04 Thread Boris Pismenny

This patch completes the generic infrastructure to offload TLS crypto to a
network device. It enables the kernel to skip decryption and
authentication of some skbs marked as decrypted by the NIC. In the fast
path, all packets received are decrypted by the NIC and the performance
is comparable to plain TCP.

This infrastructure doesn't require a TCP offload engine. Instead, the
NIC only decrypts packets that contain the expected TCP sequence number.
Out-Of-Order TCP packets are provided unmodified. As a result, at the
worst case a received TLS record consists of both plaintext and ciphertext
packets. These partially decrypted records must be reencrypted,
only to be decrypted.

The notable differences between SW KTLS Rx and this offload are as
follows:
1. Partial decryption - Software must handle the case of a TLS record
that was only partially decrypted by HW. This can happen due to packet
reordering.
2. Resynchronization - tls_read_size calls the device driver to
resynchronize HW after HW lost track of TLS record framing in
the TCP stream.

Signed-off-by: Boris Pismenny 
---
 include/net/tls.h |  63 +-
 net/tls/tls_device.c  | 273 ++
 net/tls/tls_device_fallback.c |   1 +
 net/tls/tls_main.c|  32 +++--
 net/tls/tls_sw.c  |  13 +-
 5 files changed, 344 insertions(+), 38 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 7a485de..d8b3b65 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -83,6 +83,16 @@ struct tls_device {
void (*unhash)(struct tls_device *device, struct sock *sk);
 };
 
+enum {
+   TLS_BASE,
+   TLS_SW,
+#ifdef CONFIG_TLS_DEVICE
+   TLS_HW,
+#endif
+   TLS_HW_RECORD,
+   TLS_NUM_CONFIG,
+};
+
 struct tls_sw_context_tx {
struct crypto_aead *aead_send;
struct crypto_wait async_wait;
@@ -197,6 +207,7 @@ struct tls_context {
int (*push_pending_record)(struct sock *sk, int flags);
 
void (*sk_write_space)(struct sock *sk);
+   void (*sk_destruct)(struct sock *sk);
void (*sk_proto_close)(struct sock *sk, long timeout);
 
int  (*setsockopt)(struct sock *sk, int level,
@@ -209,13 +220,27 @@ struct tls_context {
void (*unhash)(struct sock *sk);
 };
 
+struct tls_offload_context_rx {
+   /* sw must be the first member of tls_offload_context_rx */
+   struct tls_sw_context_rx sw;
+   atomic64_t resync_req;
+   u8 driver_state[];
+   /* The TLS layer reserves room for driver specific state
+* Currently the belief is that there is not enough
+* driver specific state to justify another layer of indirection
+*/
+};
+
+#define TLS_OFFLOAD_CONTEXT_SIZE_RX\
+   (ALIGN(sizeof(struct tls_offload_context_rx), sizeof(void *)) + \
+TLS_DRIVER_STATE_SIZE)
+
 int wait_on_pending_writer(struct sock *sk, long *timeo);
 int tls_sk_query(struct sock *sk, int optname, char __user *optval,
int __user *optlen);
 int tls_sk_attach(struct sock *sk, int optname, char __user *optval,
  unsigned int optlen);
 
-
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx);
 int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 int tls_sw_sendpage(struct sock *sk, struct page *page,
@@ -290,11 +315,19 @@ static inline bool tls_is_pending_open_record(struct 
tls_context *tls_ctx)
return tls_ctx->pending_open_record_frags;
 }
 
+struct sk_buff *
+tls_validate_xmit_skb(struct sock *sk, struct net_device *dev,
+ struct sk_buff *skb);
+
 static inline bool tls_is_sk_tx_device_offloaded(struct sock *sk)
 {
-   return sk_fullsock(sk) &&
-  /* matches smp_store_release in tls_set_device_offload */
-  smp_load_acquire(>sk_destruct) == _device_sk_destruct;
+#ifdef CONFIG_SOCK_VALIDATE_XMIT
+   return sk_fullsock(sk) &
+  (smp_load_acquire(>sk_validate_xmit_skb) ==
+  _validate_xmit_skb);
+#else
+   return false;
+#endif
 }
 
 static inline void tls_err_abort(struct sock *sk, int err)
@@ -387,10 +420,27 @@ static inline struct tls_sw_context_tx *tls_sw_ctx_tx(
return (struct tls_offload_context_tx *)tls_ctx->priv_ctx_tx;
 }
 
+static inline struct tls_offload_context_rx *
+tls_offload_ctx_rx(const struct tls_context *tls_ctx)
+{
+   return (struct tls_offload_context_rx *)tls_ctx->priv_ctx_rx;
+}
+
+/* The TLS context is valid until sk_destruct is called */
+static inline void tls_offload_rx_resync_request(struct sock *sk, __be32 seq)
+{
+   struct tls_context *tls_ctx = tls_get_ctx(sk);
+   struct tls_offload_context_rx *rx_ctx = tls_offload_ctx_rx(tls_ctx);
+
+   atomic64_set(_ctx->resync_req, uint64_t)seq) << 32) | 1));
+}
+
+
 int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg,
  unsigned char *record_type);
 void

Re: [PATCH] rhashtable: add restart routine in rhashtable_free_and_destroy()

2018-07-04 Thread Taehee Yoo

2018-07-04 14:45 GMT+09:00 Herbert Xu :
> On Tue, Jul 03, 2018 at 10:19:09PM +0900, Taehee Yoo wrote:
>>
>> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
>> index 0e04947..8ea27fa 100644
>> --- a/lib/rhashtable.c
>> +++ b/lib/rhashtable.c
>> @@ -1134,6 +1134,7 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
>>   mutex_lock(>mutex);
>>   tbl = rht_dereference(ht->tbl, ht);
>>   if (free_fn) {
>> +restart:
>>   for (i = 0; i < tbl->size; i++) {
>>   struct rhash_head *pos, *next;
>>
>> @@ -1147,9 +1148,11 @@ void rhashtable_free_and_destroy(struct rhashtable 
>> *ht,
>>   rht_dereference(pos->next, ht) : NULL)
>>   rhashtable_free_one(ht, pos, free_fn, arg);
>>   }
>> + tbl = rht_dereference(tbl->future_tbl, ht);
>> + if (tbl)
>> + goto restart;
>>   }
>> -
>> - bucket_table_free(tbl);
>> + bucket_table_free(rht_dereference(ht->tbl, ht));
>

Thank you for reviewing!

> Good catch.  But don't we need to call bucket_table_free on all
> the tables too rather than just the first table?
>

I haven't thought of it
It seems that all table should be freed.
I will test and send v2 patch.

Thanks!

> Thanks,
> --
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[PATCH net-next 10/18] net/mlx5e: TLS, refactor variable names

2018-07-04 Thread Boris Pismenny

For symmetry, we rename mlx5e_tls_offload_context to
mlx5e_tls_offload_context_tx before we add mlx5e_tls_offload_context_rx.

Signed-off-by: Boris Pismenny 
Reviewed-by: Aviad Yehezkel 
Reviewed-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h | 14 +++---
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c|  6 +++---
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index d167845..7fb9c75 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -123,7 +123,7 @@ static int mlx5e_tls_add(struct net_device *netdev, struct 
sock *sk,
goto free_flow;
 
if (direction == TLS_OFFLOAD_CTX_DIR_TX) {
-   struct mlx5e_tls_offload_context *tx_ctx =
+   struct mlx5e_tls_offload_context_tx *tx_ctx =
mlx5e_get_tls_tx_context(tls_ctx);
u32 swid;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
index b616217..e26222a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
@@ -49,19 +49,19 @@ struct mlx5e_tls {
struct mlx5e_tls_sw_stats sw_stats;
 };
 
-struct mlx5e_tls_offload_context {
-   struct tls_offload_context base;
+struct mlx5e_tls_offload_context_tx {
+   struct tls_offload_context_tx base;
u32 expected_seq;
__be32 swid;
 };
 
-static inline struct mlx5e_tls_offload_context *
+static inline struct mlx5e_tls_offload_context_tx *
 mlx5e_get_tls_tx_context(struct tls_context *tls_ctx)
 {
-   BUILD_BUG_ON(sizeof(struct mlx5e_tls_offload_context) >
-TLS_OFFLOAD_CONTEXT_SIZE);
-   return container_of(tls_offload_ctx(tls_ctx),
-   struct mlx5e_tls_offload_context,
+   BUILD_BUG_ON(sizeof(struct mlx5e_tls_offload_context_tx) >
+TLS_OFFLOAD_CONTEXT_SIZE_TX);
+   return container_of(tls_offload_ctx_tx(tls_ctx),
+   struct mlx5e_tls_offload_context_tx,
base);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
index 15aef71..c96196f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -73,7 +73,7 @@ static int mlx5e_tls_add_metadata(struct sk_buff *skb, __be32 
swid)
return 0;
 }
 
-static int mlx5e_tls_get_sync_data(struct mlx5e_tls_offload_context *context,
+static int mlx5e_tls_get_sync_data(struct mlx5e_tls_offload_context_tx 
*context,
   u32 tcp_seq, struct sync_info *info)
 {
int remaining, i = 0, ret = -EINVAL;
@@ -161,7 +161,7 @@ static void mlx5e_tls_complete_sync_skb(struct sk_buff *skb,
 }
 
 static struct sk_buff *
-mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context *context,
+mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context_tx *context,
 struct mlx5e_txqsq *sq, struct sk_buff *skb,
 struct mlx5e_tx_wqe **wqe,
 u16 *pi,
@@ -239,7 +239,7 @@ struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device 
*netdev,
u16 *pi)
 {
struct mlx5e_priv *priv = netdev_priv(netdev);
-   struct mlx5e_tls_offload_context *context;
+   struct mlx5e_tls_offload_context_tx *context;
struct tls_context *tls_ctx;
u32 expected_seq;
int datalen;
-- 
1.8.3.1

[PATCH net-next 04/18] tcp: Don't coalesce decrypted and encrypted SKBs

2018-07-04 Thread Boris Pismenny

Prevent coalescing of decrypted and encrypted SKBs in GRO
and TCP layer.

Signed-off-by: Boris Pismenny 
Signed-off-by: Ilya Lesokhin 
---
 net/ipv4/tcp_input.c   | 12 
 net/ipv4/tcp_offload.c |  3 +++
 2 files changed, 15 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 814ea43..f89d86a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4343,6 +4343,11 @@ static bool tcp_try_coalesce(struct sock *sk,
if (TCP_SKB_CB(from)->seq != TCP_SKB_CB(to)->end_seq)
return false;
 
+#ifdef CONFIG_TLS_DEVICE
+   if (from->decrypted != to->decrypted)
+   return false;
+#endif
+
if (!skb_try_coalesce(to, from, fragstolen, ))
return false;
 
@@ -4872,6 +4877,9 @@ void tcp_rbtree_insert(struct rb_root *root, struct 
sk_buff *skb)
break;
 
memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
+#ifdef CONFIG_TLS_DEVICE
+   nskb->decrypted = skb->decrypted;
+#endif
TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
if (list)
__skb_queue_before(list, skb, nskb);
@@ -4899,6 +4907,10 @@ void tcp_rbtree_insert(struct rb_root *root, struct 
sk_buff *skb)
skb == tail ||
(TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | 
TCPHDR_FIN)))
goto end;
+#ifdef CONFIG_TLS_DEVICE
+   if (skb->decrypted != nskb->decrypted)
+   goto end;
+#endif
}
}
}
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index f5aee64..870b0a3 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -262,6 +262,9 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, 
struct sk_buff *skb)
 
flush |= (len - 1) >= mss;
flush |= (ntohl(th2->seq) + skb_gro_len(p)) ^ ntohl(th->seq);
+#ifdef CONFIG_TLS_DEVICE
+   flush |= p->decrypted ^ skb->decrypted;
+#endif
 
if (flush || skb_gro_receive(p, skb)) {
mss = 1;
-- 
1.8.3.1

[PATCH net-next 12/18] net/mlx5e: TLS, add innova rx support

2018-07-04 Thread Boris Pismenny

Add the mlx5 implementation of the TLS Rx routines to add/del TLS
contexts, also add the tls_dev_resync_rx routine
to work with the TLS inline Rx crypto offload infrastructure.

Signed-off-by: Boris Pismenny 
Signed-off-by: Ilya Lesokhin 
Reviewed-by: Aviad Yehezkel 
Reviewed-by: Tariq Toukan 
---
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 46 +++---
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h | 15 +++
 2 files changed, 46 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index 7fb9c75..68368c9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -110,9 +110,7 @@ static int mlx5e_tls_add(struct net_device *netdev, struct 
sock *sk,
u32 caps = mlx5_accel_tls_device_caps(mdev);
int ret = -ENOMEM;
void *flow;
-
-   if (direction != TLS_OFFLOAD_CTX_DIR_TX)
-   return -EINVAL;
+   u32 swid;
 
flow = kzalloc(MLX5_ST_SZ_BYTES(tls_flow), GFP_KERNEL);
if (!flow)
@@ -122,18 +120,23 @@ static int mlx5e_tls_add(struct net_device *netdev, 
struct sock *sk,
if (ret)
goto free_flow;
 
+   ret = mlx5_accel_tls_add_flow(mdev, flow, crypto_info,
+ start_offload_tcp_sn, ,
+ direction == TLS_OFFLOAD_CTX_DIR_TX);
+   if (ret < 0)
+   goto free_flow;
+
if (direction == TLS_OFFLOAD_CTX_DIR_TX) {
struct mlx5e_tls_offload_context_tx *tx_ctx =
mlx5e_get_tls_tx_context(tls_ctx);
-   u32 swid;
-
-   ret = mlx5_accel_tls_add_tx_flow(mdev, flow, crypto_info,
-start_offload_tcp_sn, );
-   if (ret < 0)
-   goto free_flow;
 
tx_ctx->swid = htonl(swid);
tx_ctx->expected_seq = start_offload_tcp_sn;
+   } else {
+   struct mlx5e_tls_offload_context_rx *rx_ctx =
+   mlx5e_get_tls_rx_context(tls_ctx);
+
+   rx_ctx->handle = htonl(swid);
}
 
return 0;
@@ -147,19 +150,32 @@ static void mlx5e_tls_del(struct net_device *netdev,
  enum tls_offload_ctx_dir direction)
 {
struct mlx5e_priv *priv = netdev_priv(netdev);
+   unsigned int handle;
 
-   if (direction == TLS_OFFLOAD_CTX_DIR_TX) {
-   u32 swid = ntohl(mlx5e_get_tls_tx_context(tls_ctx)->swid);
+   handle = ntohl((direction == TLS_OFFLOAD_CTX_DIR_TX) ?
+  mlx5e_get_tls_tx_context(tls_ctx)->swid :
+  mlx5e_get_tls_rx_context(tls_ctx)->handle);
 
-   mlx5_accel_tls_del_tx_flow(priv->mdev, swid);
-   } else {
-   netdev_err(netdev, "unsupported direction %d\n", direction);
-   }
+   mlx5_accel_tls_del_flow(priv->mdev, handle,
+   direction == TLS_OFFLOAD_CTX_DIR_TX);
+}
+
+static void mlx5e_tls_resync_rx(struct net_device *netdev, struct sock *sk,
+   u32 seq, u64 rcd_sn)
+{
+   struct tls_context *tls_ctx = tls_get_ctx(sk);
+   struct mlx5e_priv *priv = netdev_priv(netdev);
+   struct mlx5e_tls_offload_context_rx *rx_ctx;
+
+   rx_ctx = mlx5e_get_tls_rx_context(tls_ctx);
+
+   mlx5_accel_tls_resync_rx(priv->mdev, rx_ctx->handle, seq, rcd_sn);
 }
 
 static const struct tlsdev_ops mlx5e_tls_ops = {
.tls_dev_add = mlx5e_tls_add,
.tls_dev_del = mlx5e_tls_del,
+   .tls_dev_resync_rx = mlx5e_tls_resync_rx,
 };
 
 void mlx5e_tls_build_netdev(struct mlx5e_priv *priv)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
index e26222a..2d40ede 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
@@ -65,6 +65,21 @@ struct mlx5e_tls_offload_context_tx {
base);
 }
 
+struct mlx5e_tls_offload_context_rx {
+   struct tls_offload_context_rx base;
+   __be32 handle;
+};
+
+static inline struct mlx5e_tls_offload_context_rx *
+mlx5e_get_tls_rx_context(struct tls_context *tls_ctx)
+{
+   BUILD_BUG_ON(sizeof(struct mlx5e_tls_offload_context_rx) >
+TLS_OFFLOAD_CONTEXT_SIZE_RX);
+   return container_of(tls_offload_ctx_rx(tls_ctx),
+   struct mlx5e_tls_offload_context_rx,
+   base);
+}
+
 void mlx5e_tls_build_netdev(struct mlx5e_priv *priv);
 int mlx5e_tls_init(struct mlx5e_priv *priv);
 void mlx5e_tls_cleanup(struct mlx5e_priv *priv);
-- 
1.8.3.1

Re: [PATCH net] smsc75xx: Add workaround for gigabit link up hardware errata.

2018-07-04 Thread David Miller

From: Yuiko Oshino 
Date: Tue, 3 Jul 2018 11:21:46 -0400

> In certain conditions, the device may not be able to link in gigabit mode. 
> This software workaround ensures that the device will not enter the failure 
> state.
> 
> Fixes: d0cad871703b898a442e4049c532ec39168e5b57 ("SMSC75XX USB 2.0 Gigabit 
> Ethernet Devices")
> Signed-off-by: Yuiko Oshino 

Applied.

Re: [PATCH v2] net: usb: asix: allow optionally getting mac address from device tree

2018-07-04 Thread David Miller

From: Marcel Ziswiler 
Date: Tue,  3 Jul 2018 17:06:49 +0200

> From: Marcel Ziswiler 
> 
> For Embedded use where e.g. AX88772B chips may be used without external
> EEPROMs the boot loader may choose to pass the MAC address to be used
> via device tree. Therefore, allow for optionally getting the MAC
> address from device tree data e.g. as follows (excerpt from a T30 based
> board, local-mac-address to be filled in by boot loader):
> 
> /* EHCI instance 1: USB2_DP/N -> AX88772B */
> usb@7d004000 {
>   status = "okay";
>   #address-cells = <1>;
>   #size-cells = <0>;
>   asix@1 {
>   reg = <1>;
>   local-mac-address = [00 00 00 00 00 00];
>   };
> };
> 
> Signed-off-by: Marcel Ziswiler 

Applied to net-next.

Re: [PATCH net-next] net: sched: act_pedit: fix possible memory leak in tcf_pedit_init()

2018-07-04 Thread David Miller

From: Wei Yongjun 
Date: Tue, 3 Jul 2018 13:45:12 +

> 'keys_ex' is malloced by tcf_pedit_keys_ex_parse() in tcf_pedit_init()
> but not all of the error handle path free it, this may cause memory
> leak. This patch fix it.
> 
> Fixes: 71d0ed7079df ("net/act_pedit: Support using offset relative to the 
> conventional network headers")
> Signed-off-by: Wei Yongjun 

Applied.

Re: [PATCH net-next 0/2] bridge: iproute2 isolated port and selftests

2018-07-04 Thread David Miller

From: Nikolay Aleksandrov 
Date: Tue,  3 Jul 2018 15:42:41 +0300

> Add support to iproute2 for port isolation config and selftests for it.

Series applied, thanks Nikolay.

Re: [PATCHv2 net] sctp: fix the issue that pathmtu may be set lower than MINSEGMENT

2018-07-04 Thread David Miller

From: Xin Long 
Date: Tue,  3 Jul 2018 16:30:47 +0800

> After commit b6c5734db070 ("sctp: fix the handling of ICMP Frag Needed
> for too small MTUs"), sctp_transport_update_pmtu would refetch pathmtu
> from the dst and set it to transport's pathmtu without any check.
> 
> The new pathmtu may be lower than MINSEGMENT if the dst is obsolete and
> updated by .get_dst() in sctp_transport_update_pmtu. In this case, it
> could have a smaller MTU as well, and thus we should validate it
> against MINSEGMENT instead.
> 
> Syzbot reported a warning in sctp_mtu_payload caused by this.
> 
> This patch refetches the pathmtu by calling sctp_dst_mtu where it does
> the check against MINSEGMENT.
> 
> v1->v2:
>   - refetch the pathmtu by calling sctp_dst_mtu instead as Marcelo's
> suggestion.
> 
> Fixes: b6c5734db070 ("sctp: fix the handling of ICMP Frag Needed for too 
> small MTUs")
> Reported-by: syzbot+f0d9d7cba052f9344...@syzkaller.appspotmail.com
> Suggested-by: Marcelo Ricardo Leitner 
> Signed-off-by: Xin Long 

Applied and queued up for -stable.

Re: [PATCH v5 net-next] net:sched: add action inheritdsfield to skbedit

2018-07-04 Thread David Miller

From: Qiaobin Fu 
Date: Sun,  1 Jul 2018 15:16:27 -0400

> The new action inheritdsfield copies the field DS of
> IPv4 and IPv6 packets into skb->priority. This enables
> later classification of packets based on the DS field.
> 
> v5:
> *Update the drop counter for TC_ACT_SHOT
> 
> v4:
> *Not allow setting flags other than the expected ones.
> 
> *Allow dumping the pure flags.
> 
> v3:
> *Use optional flags, so that it won't break old versions of tc.
> 
> *Allow users to set both SKBEDIT_F_PRIORITY and SKBEDIT_F_INHERITDSFIELD 
> flags.
> 
> v2:
> *Fix the style issue
> 
> *Move the code from skbmod to skbedit
> 
> Original idea by Jamal Hadi Salim 
> 
> Signed-off-by: Qiaobin Fu 
> Reviewed-by: Michel Machado 
> Acked-by: Jamal Hadi Salim 
> Reviewed-by: Marcelo Ricardo Leitner 
> Acked-by: Davide Caratti 

Applied.

[PATCH net-next] cxgb4: Fix the condition to check if the card is T5

2018-07-04 Thread Ganesh Goudar

Use 'chip_ver' rather than 'chip' to check if the card
is T5.

Fixes: e8d452923ae6 ("cxgb4: clean up init_one")
Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 1be30bc..93b4b5a 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -5766,7 +5766,7 @@ static int init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
if (t4_read_reg(adapter, LE_DB_CONFIG_A) & HASHEN_F) {
u32 hash_base, hash_reg;
 
-   if (chip <= CHELSIO_T5) {
+   if (chip_ver <= CHELSIO_T5) {
hash_reg = LE_DB_TID_HASHBASE_A;
hash_base = t4_read_reg(adapter, hash_reg);
adapter->tids.hash_base = hash_base / 4;
-- 
2.1.0

[PATCH] epic100: remove redundant variable 'irq'

2018-07-04 Thread Colin King

From: Colin Ian King 

Variable 'irq' is being assigned but is never used hence it is
and can be removed.

Cleans up clang warning:
warning: variable 'irq' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/smsc/epic100.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/smsc/epic100.c 
b/drivers/net/ethernet/smsc/epic100.c
index 949aaef390b6..15c62c160953 100644
--- a/drivers/net/ethernet/smsc/epic100.c
+++ b/drivers/net/ethernet/smsc/epic100.c
@@ -321,7 +321,6 @@ static int epic_init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
static int card_idx = -1;
void __iomem *ioaddr;
int chip_idx = (int) ent->driver_data;
-   int irq;
struct net_device *dev;
struct epic_private *ep;
int i, ret, option = 0, duplex = 0;
@@ -338,7 +337,6 @@ static int epic_init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
ret = pci_enable_device(pdev);
if (ret)
goto out;
-   irq = pdev->irq;
 
if (pci_resource_len(pdev, 0) < EPIC_TOTAL_SIZE) {
dev_err(>dev, "no PCI region space\n");
-- 
2.17.1

Re: [PATCHv2 net] sctp: fix the issue that pathmtu may be set lower than MINSEGMENT

2018-07-04 Thread Neil Horman

On Tue, Jul 03, 2018 at 04:30:47PM +0800, Xin Long wrote:
> After commit b6c5734db070 ("sctp: fix the handling of ICMP Frag Needed
> for too small MTUs"), sctp_transport_update_pmtu would refetch pathmtu
> from the dst and set it to transport's pathmtu without any check.
> 
> The new pathmtu may be lower than MINSEGMENT if the dst is obsolete and
> updated by .get_dst() in sctp_transport_update_pmtu. In this case, it
> could have a smaller MTU as well, and thus we should validate it
> against MINSEGMENT instead.
> 
> Syzbot reported a warning in sctp_mtu_payload caused by this.
> 
> This patch refetches the pathmtu by calling sctp_dst_mtu where it does
> the check against MINSEGMENT.
> 
> v1->v2:
>   - refetch the pathmtu by calling sctp_dst_mtu instead as Marcelo's
> suggestion.
> 
> Fixes: b6c5734db070 ("sctp: fix the handling of ICMP Frag Needed for too 
> small MTUs")
> Reported-by: syzbot+f0d9d7cba052f9344...@syzkaller.appspotmail.com
> Suggested-by: Marcelo Ricardo Leitner 
> Signed-off-by: Xin Long 
> ---
>  net/sctp/transport.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sctp/transport.c b/net/sctp/transport.c
> index 445b7ef..12cac85 100644
> --- a/net/sctp/transport.c
> +++ b/net/sctp/transport.c
> @@ -282,7 +282,7 @@ bool sctp_transport_update_pmtu(struct sctp_transport *t, 
> u32 pmtu)
>  
>   if (dst) {
>   /* Re-fetch, as under layers may have a higher minimum size */
> - pmtu = SCTP_TRUNC4(dst_mtu(dst));
> + pmtu = sctp_dst_mtu(dst);
>   change = t->pathmtu != pmtu;
>   }
>   t->pathmtu = pmtu;
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Acked-by: Neil Horman

RE: [PATCH net] qed: off by one in qed_parse_mcp_trace_buf()

2018-07-04 Thread Tayar, Tomer

From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
Sent: Wednesday, July 04, 2018 12:53 PM

> If format_idx == s_mcp_trace_meta.formats_num then we read one element
> beyond the end of the s_mcp_trace_meta.formats[] array.
> 
> Fixes: 50bc60cb155c ("qed*: Utilize FW 8.33.11.0")
> Signed-off-by: Dan Carpenter 

Thanks
Acked-by: Tomer Tayar

[PATCH net] ixgbe: Off by one in ixgbe_ipsec_tx()

2018-07-04 Thread Dan Carpenter

The ipsec->tx_tbl[] has IXGBE_IPSEC_MAX_SA_COUNT elements so the > needs
to be changed to >= so we don't read one element beyond the end of the
array.

Fixes: 592594704761 ("ixgbe: process the Tx ipsec offload")
Signed-off-by: Dan Carpenter 

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index c116f459945d..da4322e4daed 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -839,7 +839,7 @@ int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring,
}
 
itd->sa_idx = xs->xso.offload_handle - IXGBE_IPSEC_BASE_TX_INDEX;
-   if (unlikely(itd->sa_idx > IXGBE_IPSEC_MAX_SA_COUNT)) {
+   if (unlikely(itd->sa_idx >= IXGBE_IPSEC_MAX_SA_COUNT)) {
netdev_err(tx_ring->netdev, "%s: bad sa_idx=%d handle=%lu\n",
   __func__, itd->sa_idx, xs->xso.offload_handle);
return 0;

[PATCH net] qed: off by one in qed_parse_mcp_trace_buf()

2018-07-04 Thread Dan Carpenter

If format_idx == s_mcp_trace_meta.formats_num then we read one element
beyond the end of the s_mcp_trace_meta.formats[] array.

Fixes: 50bc60cb155c ("qed*: Utilize FW 8.33.11.0")
Signed-off-by: Dan Carpenter 

diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c 
b/drivers/net/ethernet/qlogic/qed/qed_debug.c
index a14e48489029..4340c4c90bcb 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_debug.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c
@@ -6723,7 +6723,7 @@ static enum dbg_status qed_parse_mcp_trace_buf(u8 
*trace_buf,
format_idx = header & MFW_TRACE_EVENTID_MASK;
 
/* Skip message if its index doesn't exist in the meta data */
-   if (format_idx > s_mcp_trace_meta.formats_num) {
+   if (format_idx >= s_mcp_trace_meta.formats_num) {
u8 format_size =
(u8)((header & MFW_TRACE_PRM_SIZE_MASK) >>
 MFW_TRACE_PRM_SIZE_SHIFT);

[PATCH net-next] cxgb4: Add support to read actual provisioned resources

2018-07-04 Thread Ganesh Goudar

From: Casey Leedom 

In highly constrained resources environments (like the 124VF
T5 and 248VF T6 configurations), PF4 may not have very many
resources at all and we need to adapt to whatever we've been
allocated, this patch adds support to get the provisioned
resources.

Signed-off-by: Casey Leedom 
Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h |  17 +++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c |  39 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c| 134 +++--
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c |  51 
 4 files changed, 206 insertions(+), 35 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 4a8cbd8..3da9299 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -320,6 +320,21 @@ struct vpd_params {
u8 na[MACADDR_LEN + 1];
 };
 
+/* Maximum resources provisioned for a PCI PF.
+ */
+struct pf_resources {
+   unsigned int nvi;   /* N virtual interfaces */
+   unsigned int neq;   /* N egress Qs */
+   unsigned int nethctrl;  /* N egress ETH or CTRL Qs */
+   unsigned int niqflint;  /* N ingress Qs/w free list(s) & intr */
+   unsigned int niq;   /* N ingress Qs */
+   unsigned int tc;/* PCI-E traffic class */
+   unsigned int pmask; /* port access rights mask */
+   unsigned int nexactf;   /* N exact MPS filters */
+   unsigned int r_caps;/* read capabilities */
+   unsigned int wx_caps;   /* write/execute capabilities */
+};
+
 struct pci_params {
unsigned int vpd_cap_addr;
unsigned char speed;
@@ -347,6 +362,7 @@ struct adapter_params {
struct sge_params sge;
struct tp_params  tp;
struct vpd_params vpd;
+   struct pf_resources pfres;
struct pci_params pci;
struct devlog_params devlog;
enum pcie_memwin drv_memwin;
@@ -1568,6 +1584,7 @@ int t4_eeprom_ptov(unsigned int phys_addr, unsigned int 
fn, unsigned int sz);
 int t4_seeprom_wp(struct adapter *adapter, bool enable);
 int t4_get_raw_vpd_params(struct adapter *adapter, struct vpd_params *p);
 int t4_get_vpd_params(struct adapter *adapter, struct vpd_params *p);
+int t4_get_pfres(struct adapter *adapter);
 int t4_read_flash(struct adapter *adapter, unsigned int addr,
  unsigned int nwords, u32 *data, int byte_oriented);
 int t4_load_fw(struct adapter *adapter, const u8 *fw_data, unsigned int size);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index c301aaf..516c883 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -2414,6 +2414,44 @@ static const struct file_operations 
rss_vf_config_debugfs_fops = {
.release = seq_release_private
 };
 
+static int resources_show(struct seq_file *seq, void *v)
+{
+   struct adapter *adapter = seq->private;
+   struct pf_resources *pfres = >params.pfres;
+
+   #define S(desc, fmt, var) \
+   seq_printf(seq, "%-60s " fmt "\n", \
+  desc " (" #var "):", pfres->var)
+
+   S("Virtual Interfaces", "%d", nvi);
+   S("Egress Queues", "%d", neq);
+   S("Ethernet Control", "%d", nethctrl);
+   S("Ingress Queues/w Free Lists/Interrupts", "%d", niqflint);
+   S("Ingress Queues", "%d", niq);
+   S("Traffic Class", "%d", tc);
+   S("Port Access Rights Mask", "%#x", pmask);
+   S("MAC Address Filters", "%d", nexactf);
+   S("Firmware Command Read Capabilities", "%#x", r_caps);
+   S("Firmware Command Write/Execute Capabilities", "%#x", wx_caps);
+
+   #undef S
+
+   return 0;
+}
+
+static int resources_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, resources_show, inode->i_private);
+}
+
+static const struct file_operations resources_debugfs_fops = {
+   .owner   = THIS_MODULE,
+   .open= resources_open,
+   .read= seq_read,
+   .llseek  = seq_lseek,
+   .release = seq_release,
+};
+
 /**
  * ethqset2pinfo - return port_info of an Ethernet Queue Set
  * @adap: the adapter
@@ -2973,6 +3011,7 @@ int t4_setup_debugfs(struct adapter *adap)
{ "rss_key", _key_debugfs_fops, 0400, 0 },
{ "rss_pf_config", _pf_config_debugfs_fops, 0400, 0 },
{ "rss_vf_config", _vf_config_debugfs_fops, 0400, 0 },
+   { "resources", _debugfs_fops, 0400, 0 },
{ "sge_qinfo", _qinfo_debugfs_fops, 0400, 0 },
{ "ibq_tp0",  _ibq_fops, 0400, 0 },
{ "ibq_tp1",  _ibq_fops, 0400, 1 },
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index

[PATCH][net-next] net: limit each hash list length to MAX_GRO_SKBS

2018-07-04 Thread Li RongQing

After commit 07d78363dcff ("net: Convert NAPI gro list into a small hash
table.")' there is 8 hash buckets, which allows more flows to be held for
merging.  but MAX_GRO_SKBS, the total held skb for merging, is 8 skb still,
limit the hash table performance.

keep MAX_GRO_SKBS as 8 skb, but limit each hash list length to 8 skb, not
the total 8 skb

Signed-off-by: Li RongQing 
---
 include/linux/netdevice.h |  7 +-
 net/core/dev.c| 54 +++
 2 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8bf8d6149f79..3b60ac51ddba 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -302,6 +302,11 @@ struct netdev_boot_setup {
 
 int __init netdev_boot_setup(char *str);
 
+struct gro_list {
+   struct list_headlist;
+   int count;
+};
+
 /*
  * Structure for NAPI scheduling similar to tasklet but with weighting
  */
@@ -323,7 +328,7 @@ struct napi_struct {
int poll_owner;
 #endif
struct net_device   *dev;
-   struct list_headgro_hash[GRO_HASH_BUCKETS];
+   struct gro_list gro_hash[GRO_HASH_BUCKETS];
struct sk_buff  *skb;
struct hrtimer  timer;
struct list_headdev_list;
diff --git a/net/core/dev.c b/net/core/dev.c
index 08d58e0debe5..f8cdc27ee276 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -149,7 +149,6 @@
 
 #include "net-sysfs.h"
 
-/* Instead of increasing this, you should create a hash table. */
 #define MAX_GRO_SKBS 8
 
 /* This should be increased if a protocol with a bigger head is added. */
@@ -4989,10 +4988,11 @@ static int napi_gro_complete(struct sk_buff *skb)
return netif_receive_skb_internal(skb);
 }
 
-static void __napi_gro_flush_chain(struct napi_struct *napi, struct list_head 
*head,
+static void __napi_gro_flush_chain(struct napi_struct *napi, int index,
   bool flush_old)
 {
struct sk_buff *skb, *p;
+   struct list_head *head = >gro_hash[index].list;
 
list_for_each_entry_safe_reverse(skb, p, head, list) {
if (flush_old && NAPI_GRO_CB(skb)->age == jiffies)
@@ -5000,10 +5000,11 @@ static void __napi_gro_flush_chain(struct napi_struct 
*napi, struct list_head *h
list_del_init(>list);
napi_gro_complete(skb);
napi->gro_count--;
+   napi->gro_hash[index].count--;
}
 }
 
-/* napi->gro_hash contains packets ordered by age.
+/* napi->gro_hash[].list contains packets ordered by age.
  * youngest packets at the head of it.
  * Complete skbs in reverse order to reduce latencies.
  */
@@ -5011,11 +5012,8 @@ void napi_gro_flush(struct napi_struct *napi, bool 
flush_old)
 {
int i;
 
-   for (i = 0; i < GRO_HASH_BUCKETS; i++) {
-   struct list_head *head = >gro_hash[i];
-
-   __napi_gro_flush_chain(napi, head, flush_old);
-   }
+   for (i = 0; i < GRO_HASH_BUCKETS; i++)
+   __napi_gro_flush_chain(napi, i, flush_old);
 }
 EXPORT_SYMBOL(napi_gro_flush);
 
@@ -5027,7 +5025,7 @@ static struct list_head *gro_list_prepare(struct 
napi_struct *napi,
struct list_head *head;
struct sk_buff *p;
 
-   head = >gro_hash[hash & (GRO_HASH_BUCKETS - 1)];
+   head = >gro_hash[hash & (GRO_HASH_BUCKETS - 1)].list;
list_for_each_entry(p, head, list) {
unsigned long diffs;
 
@@ -5095,27 +5093,13 @@ static void gro_pull_from_frag0(struct sk_buff *skb, 
int grow)
}
 }
 
-static void gro_flush_oldest(struct napi_struct *napi)
+static void gro_flush_oldest(struct list_head *head)
 {
-   struct sk_buff *oldest = NULL;
-   unsigned long age = jiffies;
-   int i;
+   struct sk_buff *oldest;
 
-   for (i = 0; i < GRO_HASH_BUCKETS; i++) {
-   struct list_head *head = >gro_hash[i];
-   struct sk_buff *skb;
+   oldest = list_last_entry(head, struct sk_buff, list);
 
-   if (list_empty(head))
-   continue;
-
-   skb = list_last_entry(head, struct sk_buff, list);
-   if (!oldest || time_before(NAPI_GRO_CB(skb)->age, age)) {
-   oldest = skb;
-   age = NAPI_GRO_CB(skb)->age;
-   }
-   }
-
-   /* We are called with napi->gro_count >= MAX_GRO_SKBS, so this is
+   /* We are called with head length >= MAX_GRO_SKBS, so this is
 * impossible.
 */
if (WARN_ON_ONCE(!oldest))
@@ -5138,6 +5122,7 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
enum gro_result ret;
int same_flow;
int grow;
+   u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
 
if (netif_elide_gro(skb->dev))
goto normal;
@@ -5196,6 +5181,7 @@ static enum

Re: [RFC bpf-next 5/6] net/mlx5e: Add XDP RX meta data support

2018-07-04 Thread Daniel Borkmann

On 06/27/2018 04:46 AM, Saeed Mahameed wrote:
[...]
> @@ -935,11 +958,16 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
>   return false;
>  
>   xdp.data = va + *rx_headroom;
> - xdp_set_data_meta_invalid();
>   xdp.data_end = xdp.data + *len;
>   xdp.data_hard_start = va;
>   xdp.rxq = >xdp_rxq;
>  
> + if (rq->xdp.flags & XDP_FLAGS_META_ALL) {
> + xdp_reset_data_meta();
> + mlx5e_xdp_fill_data_meta(rq->xdp.md_info, xdp.data_meta, cqe);
> + } else
> + xdp_set_data_meta_invalid();
> +

Just a quick note on this one: would actually be great to not set
the xdp_set_data_meta_invalid() in the else path as this meta buffer
should also be usable independent of hw hints. Meaning, in any case
it would be great if mlx5 + mlx4 could implement the xdp->data_meta
support we have today, this might probably be a good first step
anyway; so far supported on i40e, ixgbe, ixgbevf, nfp.

>   act = bpf_prog_run_xdp(prog, );
>   switch (act) {
>   case XDP_PASS:
> 

Thanks,
Daniel

Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-04 Thread Daniel Borkmann

On 07/04/2018 02:57 AM, Saeed Mahameed wrote:
> On Tue, 2018-07-03 at 16:01 -0700, Alexei Starovoitov wrote:
[...]
>> How about we make driver+firmware provide a BTF definition of
>> metadata that they
>> can provide? There can be multiple definitions of such structs.
>> Then in userpsace we can have BTF->plain C converter.
>> (bpftool practically ready to do that already).
>> Then the programmer can take such generated C definition, add it to
>> .h and include
>> it in their programs. llvm will compile the whole thing and will
>> include BTF
>> of maps, progs and this md struct in the target elf file.
>> During loading the kernel can check that BTF in elf is matching one-
>> to-one
>> to what driver+firmware are saying they support.

I do like the above idea of utilizing BTF for this, seems like a good fit.

> Just thinking out loud, can't we do this at program load ? just run a
> setup function in the xdp program to load nic md BTF definition into
> the elf section ?
> 
>> No ambiguity and no possibility of mistake, since offsets and field
>> names
>> are verified.
> 
> But what about the dynamic nature of this feature ? Sometimes you only
> want HW/Driver to provide a subset of whatever the HW can provide and
> save md buffer for other stuff.
> 
> Yes a well defined format is favorable here, but we need to make sure
> there is no computational overhead in data path just to extract each
> field! for example if i want to know what is the offset of the hash
> will i need to go parse (for every packet) the whole BTF definition of
> metadata just to find the offset of type=hash ?

I don't think this would be the case that you'd need to walk BTF in fast
path here. In the ideal case, the only thing that driver would need to do
in fast path would be to set proper xdp->data_meta offset and _that_ would
be it. For the rest, program would know how to access the data since it's
already aware of it from BTF definition the driver provided. Other drivers
which would be less flexible on that regard would internally prep the buffer
based on the progs needs more or less similar as in mlx5e_xdp_fill_data_meta(),
but it would be really up to the driver how to handle this internally. The
BTF it would check at XDP setup time to do the configuration needed in the
driver. Verifier would only check BTF, pass it along for XDP setup, prog
rewrites in verifier aren't even needed since LLVM compiled everything
already.

>> Every driver can have their own BTF for md and their own special
>> features.
>> We can try to standardize the names (like vlan and csum), so xdp
>> programs
>> can stay relatively portable across NICs.
> 
> Yes this is a must.

Agree, there needs to be a basic common set that would be provided by
every XDP aware driver.

>> Such api will address exposing asic+firmware metadata to the xdp
>> program.
>> Once we tackle this problem, we'll think how to do the backward
>> config
>> (to do firmware reconfig for specific BTF definition of md supplied
>> by the prog).
>> What people think?
> 
> For legacy HW, we can do it already in the driver, provide whatever the
> prog requested, its only a matter of translation to the BTF format in
> the driver xdp setup and pushing the values accordingly into the md
> offsets on data path.
> 
> Question: how can you share the md BTF from the driver/HW with the xdp
> program ?
I think this would likely be a new query as in XDP_QUERY_META_BTF
implemented in ndo_bpf callback and then exported e.g. via bpf(2)
or netlink such that bpftool can generate BTF -> C from there for the
program to include later in compilation.

Thanks,
Daniel

Re: [offlist] Re: Crash in netlink/sk_filter_trim_cap on ARMv7 on 4.18rc1

2018-07-04 Thread Peter Robinson

On Tue, Jun 26, 2018 at 1:52 PM, Daniel Borkmann  wrote:
> On 06/26/2018 02:23 PM, Peter Robinson wrote:
> On 06/24/2018 11:24 AM, Peter Robinson wrote:
 I'm seeing this netlink/sk_filter_trim_cap crash on ARMv7 across quite
 a few ARMv7 platforms on Fedora with 4.18rc1. I've tested RPi2/RPi3
 (doesn't happen on aarch64), AllWinner H3, BeagleBone and a few
 others, both LPAE/normal kernels.
>
> So this is arm32 right?

 Correct.

 I'm a bit out of my depth in this part of the kernel but I'm wondering
 if it's known, I couldn't find anything that looked obvious on a few
 mailing lists.

 Peter
>>>
>>> Hi Peter
>>>
>>> Could you provide symbolic information ?
>>
>> I passed in through scripts/decode_stacktrace.sh is that what you were 
>> after:
>>
>> [8.673880] Internal error: Oops: a06 [#10] SMP ARM
>> [8.673949] ---[ end trace 049df4786ea3140a ]---
>> [8.678754] Modules linked in:
>> [8.678766] CPU: 1 PID: 206 Comm: systemd-udevd Tainted: G  D
>> 4.18.0-0.rc1.git0.1.fc29.armv7hl+lpae #1
>> [8.678769] Hardware name: Allwinner sun8i Family
>> [8.678781] PC is at sk_filter_trim_cap ()
>> [8.678790] LR is at   (null)
>> [8.709463] pc : lr : psr: 6013 ()
>> [8.715722] sp : c996bd60  ip :   fp : 
>> [8.720939] r10: ee79dc00  r9 : c12c9f80  r8 : 
>> [8.726157] r7 :   r6 : 0001  r5 : f1648000  r4 : 
>> [8.732674] r3 : 0007  r2 :   r1 :   r0 : 
>> [8.739193] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  
>> Segment user
>> [8.746318] Control: 30c5387d  Table: 6e7bc880  DAC: ffe75ece
>> [8.752055] Process systemd-udevd (pid: 206, stack limit = 0x(ptrval))
>> [8.758574] Stack: (0xc996bd60 to 0xc996c000)
>
> Do you have BPF JIT enabled or disabled? Does it happen with disabled?

 Enabled, I can test with it disabled, BPF configs bits are:
 CONFIG_BPF_EVENTS=y
 # CONFIG_BPFILTER is not set
 CONFIG_BPF_JIT_ALWAYS_ON=y
 CONFIG_BPF_JIT=y
 CONFIG_BPF_STREAM_PARSER=y
 CONFIG_BPF_SYSCALL=y
 CONFIG_BPF=y
 CONFIG_CGROUP_BPF=y
 CONFIG_HAVE_EBPF_JIT=y
 CONFIG_IPV6_SEG6_BPF=y
 CONFIG_LWTUNNEL_BPF=y
 # CONFIG_NBPFAXI_DMA is not set
 CONFIG_NET_ACT_BPF=m
 CONFIG_NET_CLS_BPF=m
 CONFIG_NETFILTER_XT_MATCH_BPF=m
 # CONFIG_TEST_BPF is not set

> I can see one bug, but your stack trace seems unrelated.
>
> Anyway, could you try with this?

 Build in process.

> diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
> index 6e8b716..f6a62ae 100644
> --- a/arch/arm/net/bpf_jit_32.c
> +++ b/arch/arm/net/bpf_jit_32.c
> @@ -1844,7 +1844,7 @@ struct bpf_prog *bpf_int_jit_compile(struct 
> bpf_prog *prog)
> /* there are 2 passes here */
> bpf_jit_dump(prog->len, image_size, 2, ctx.target);
>
> -   set_memory_ro((unsigned long)header, header->pages);
> +   bpf_jit_binary_lock_ro(header);
> prog->bpf_func = (void *)ctx.target;
> prog->jited = 1;
> prog->jited_len = image_size;
>>>
>>> So with that and the other fix there was no improvement, with those
>>> and the BPF JIT disabled it works, I'm not sure if the two patches
>>> have any effect with the JIT disabled though.
>>>
>>> Will look at the other patches shortly, there's been some other issue
>>> introduced between rc1 and rc2 which I have to work out before I can
>>> test those though.
>>
>> Quick update, with linus's head as of yesterday, basically rc2 plus
>> davem's network fixes it works if the JIT is disabled IE:
>> # CONFIG_BPF_JIT_ALWAYS_ON is not set
>> # CONFIG_BPF_JIT is not set
>>
>> If I enable it the boot breaks even worse than the errors above in
>> that I get no console output at all, even with earlycon, so we've gone
>> backwards since rc1 somehow.
>>
>> I'll try the above two reverted unless you have any other suggestions.
>
> Ok, thanks, lets do that!
>
> I'm still working on fixes meanwhile, should have something by end of day.

Sorry for the delay on this from my end. I noticed there was some bpf
bits land in the last net fixes pull request landed Monday so I built
a kernel with the JIT reenabled. It seems it's improved in that the
completely dead no output boot has gone but the original problem that
arrived in the merge window still persists:

[   17.564142] note: systemd-udevd[194] exited with preempt_count 1
[   17.592739] Unable to handle kernel NULL pointer dereference at
virtual address 000c
[   17.601002] pgd = (ptrval)
[   17.603819] [000c] *pgd=
[   17.607487] Internal error: Oops: 805 [#10] SMP ARM
[   17.612396] Modules linked in:
[   17.615484] CPU: 0

Re: [net-next, v6, 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue

2018-07-04 Thread Andrei Vagin

Hello Amritha,

I see a following warning on 4.18.0-rc3-next-20180703.
It looks like a problem is in this series.

[1.084722] 
[1.084797] WARNING: possible recursive locking detected
[1.084872] 4.18.0-rc3-next-20180703+ #1 Not tainted
[1.084949] 
[1.085024] swapper/0/1 is trying to acquire lock:
[1.085100] cf973d46 (cpu_hotplug_lock.rw_sem){}, at: 
static_key_slow_inc+0xe/0x20
[1.085189] 
[1.085189] but task is already holding lock:
[1.085271] cf973d46 (cpu_hotplug_lock.rw_sem){}, at: 
init_vqs+0x513/0x5a0
[1.085357] 
[1.085357] other info that might help us debug this:
[1.085450]  Possible unsafe locking scenario:
[1.085450] 
[1.085531]CPU0
[1.085605]
[1.085679]   lock(cpu_hotplug_lock.rw_sem);
[1.085753]   lock(cpu_hotplug_lock.rw_sem);
[1.085828] 
[1.085828]  *** DEADLOCK ***
[1.085828] 
[1.085916]  May be due to missing lock nesting notation
[1.085916] 
[1.085998] 3 locks held by swapper/0/1:
[1.086074]  #0: 244bc7da (>mutex){}, at: 
__driver_attach+0x5a/0x110
[1.086164]  #1: cf973d46 (cpu_hotplug_lock.rw_sem){}, at: 
init_vqs+0x513/0x5a0
[1.086248]  #2: 5cd8463f (xps_map_mutex){+.+.}, at: 
__netif_set_xps_queue+0x8d/0xc60
[1.086336] 
[1.086336] stack backtrace:
[1.086419] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
4.18.0-rc3-next-20180703+ #1
[1.086504] Hardware name: Google Google Compute Engine/Google Compute 
Engine, BIOS Google 01/01/2011
[1.086587] Call Trace:
[1.086667]  dump_stack+0x85/0xcb
[1.086744]  __lock_acquire+0x68a/0x1330
[1.086821]  ? lock_acquire+0x9f/0x200
[1.086900]  ? find_held_lock+0x2d/0x90
[1.086976]  ? lock_acquire+0x9f/0x200
[1.087051]  lock_acquire+0x9f/0x200
[1.087126]  ? static_key_slow_inc+0xe/0x20
[1.087205]  cpus_read_lock+0x3e/0x80
[1.087280]  ? static_key_slow_inc+0xe/0x20
[1.087355]  static_key_slow_inc+0xe/0x20
[1.087435]  __netif_set_xps_queue+0x216/0xc60
[1.087512]  virtnet_set_affinity+0xf0/0x130
[1.087589]  init_vqs+0x51b/0x5a0
[1.087665]  virtnet_probe+0x39f/0x870
[1.087742]  virtio_dev_probe+0x170/0x220
[1.087819]  driver_probe_device+0x30b/0x480
[1.087897]  ? set_debug_rodata+0x11/0x11
[1.087972]  __driver_attach+0xe0/0x110
[1.088064]  ? driver_probe_device+0x480/0x480
[1.088141]  bus_for_each_dev+0x79/0xc0
[1.088221]  bus_add_driver+0x164/0x260
[1.088302]  ? veth_init+0x11/0x11
[1.088379]  driver_register+0x5b/0xe0
[1.088402]  ? veth_init+0x11/0x11
[1.088402]  virtio_net_driver_init+0x6d/0x90
[1.088402]  do_one_initcall+0x5d/0x34c
[1.088402]  ? set_debug_rodata+0x11/0x11
[1.088402]  ? rcu_read_lock_sched_held+0x6b/0x80
[1.088402]  kernel_init_freeable+0x1ea/0x27b
[1.088402]  ? rest_init+0xd0/0xd0
[1.088402]  kernel_init+0xa/0x110
[1.088402]  ret_from_fork+0x3a/0x50
[1.094190] i8042: PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 
0x60,0x64 irq 1,12


https://travis-ci.org/avagin/linux/jobs/399867744

On Fri, Jun 29, 2018 at 09:27:07PM -0700, Amritha Nambiar wrote:
> Extend transmit queue sysfs attribute to configure Rx queue(s) map
> per Tx queue. By default no receive queues are configured for the
> Tx queue.
> 
> - /sys/class/net/eth0/queues/tx-*/xps_rxqs
> 
> Signed-off-by: Amritha Nambiar 
> ---
>  net/core/net-sysfs.c |   83 
> ++
>  1 file changed, 83 insertions(+)
> 
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index b39987c..f25ac5f 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -1283,6 +1283,88 @@ static ssize_t xps_cpus_store(struct netdev_queue 
> *queue,
>  
>  static struct netdev_queue_attribute xps_cpus_attribute __ro_after_init
>   = __ATTR_RW(xps_cpus);
> +
> +static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
> +{
> + struct net_device *dev = queue->dev;
> + struct xps_dev_maps *dev_maps;
> + unsigned long *mask, index;
> + int j, len, num_tc = 1, tc = 0;
> +
> + index = get_netdev_queue_index(queue);
> +
> + if (dev->num_tc) {
> + num_tc = dev->num_tc;
> + tc = netdev_txq_to_tc(dev, index);
> + if (tc < 0)
> + return -EINVAL;
> + }
> + mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
> +GFP_KERNEL);
> + if (!mask)
> + return -ENOMEM;
> +
> + rcu_read_lock();
> + dev_maps = rcu_dereference(dev->xps_rxqs_map);
> + if (!dev_maps)
> + goto out_no_maps;
> +
> + for (j = -1; j = netif_attrmask_next(j, NULL, dev->num_rx_queues),
> +  j < dev->num_rx_queues;) {
> + int i, tci = j * num_tc + tc;
> + struct xps_map *map;
> +
> +

[PATCH net] net: mv __dev_notify_flags from void to int

2018-07-04 Thread Hangbin Liu

As call_netdevice_notifiers() and call_netdevice_notifiers_info() have return
values, we should also return the values in function __dev_notify_flags().

Signed-off-by: Hangbin Liu 
---
 include/linux/netdevice.h |  2 +-
 net/core/dev.c| 30 --
 2 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3d0cc0b..b1c145e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3411,7 +3411,7 @@ int dev_ethtool(struct net *net, struct ifreq *);
 unsigned int dev_get_flags(const struct net_device *);
 int __dev_change_flags(struct net_device *, unsigned int flags);
 int dev_change_flags(struct net_device *, unsigned int);
-void __dev_notify_flags(struct net_device *, unsigned int old_flags,
+int __dev_notify_flags(struct net_device *, unsigned int old_flags,
unsigned int gchanges);
 int dev_change_name(struct net_device *, const char *);
 int dev_set_alias(struct net_device *, const char *, size_t);
diff --git a/net/core/dev.c b/net/core/dev.c
index a5aa1c7..0994533 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6798,8 +6798,10 @@ static int __dev_set_promiscuity(struct net_device *dev, 
int inc, bool notify)
 
dev_change_rx_flags(dev, IFF_PROMISC);
}
+
if (notify)
-   __dev_notify_flags(dev, old_flags, IFF_PROMISC);
+   return __dev_notify_flags(dev, old_flags, IFF_PROMISC);
+
return 0;
 }
 
@@ -6854,8 +6856,8 @@ static int __dev_set_allmulti(struct net_device *dev, int 
inc, bool notify)
dev_change_rx_flags(dev, IFF_ALLMULTI);
dev_set_rx_mode(dev);
if (notify)
-   __dev_notify_flags(dev, old_flags,
-  dev->gflags ^ old_gflags);
+   return  __dev_notify_flags(dev, old_flags,
+  dev->gflags ^ old_gflags);
}
return 0;
 }
@@ -7016,21 +7018,26 @@ int __dev_change_flags(struct net_device *dev, unsigned 
int flags)
return ret;
 }
 
-void __dev_notify_flags(struct net_device *dev, unsigned int old_flags,
-   unsigned int gchanges)
+int __dev_notify_flags(struct net_device *dev, unsigned int old_flags,
+  unsigned int gchanges)
 {
unsigned int changes = dev->flags ^ old_flags;
+   int err = 0;
 
if (gchanges)
rtmsg_ifinfo(RTM_NEWLINK, dev, gchanges, GFP_ATOMIC);
 
if (changes & IFF_UP) {
if (dev->flags & IFF_UP)
-   call_netdevice_notifiers(NETDEV_UP, dev);
+   err = call_netdevice_notifiers(NETDEV_UP, dev);
else
-   call_netdevice_notifiers(NETDEV_DOWN, dev);
+   err = call_netdevice_notifiers(NETDEV_DOWN, dev);
}
 
+   err = notifier_to_errno(err);
+   if (err)
+   goto out;
+
if (dev->flags & IFF_UP &&
(changes & ~(IFF_UP | IFF_PROMISC | IFF_ALLMULTI | IFF_VOLATILE))) {
struct netdev_notifier_change_info change_info = {
@@ -7040,8 +7047,12 @@ void __dev_notify_flags(struct net_device *dev, unsigned 
int old_flags,
.flags_changed = changes,
};
 
-   call_netdevice_notifiers_info(NETDEV_CHANGE, _info.info);
+   err = call_netdevice_notifiers_info(NETDEV_CHANGE, 
_info.info);
+   err = notifier_to_errno(err);
}
+
+out:
+   return err;
 }
 
 /**
@@ -7062,8 +7073,7 @@ int dev_change_flags(struct net_device *dev, unsigned int 
flags)
return ret;
 
changes = (old_flags ^ dev->flags) | (old_gflags ^ dev->gflags);
-   __dev_notify_flags(dev, old_flags, changes);
-   return ret;
+   return __dev_notify_flags(dev, old_flags, changes);
 }
 EXPORT_SYMBOL(dev_change_flags);
 
-- 
2.5.5

Re: [PATCH v2 net] net/ipv6: Revert attempt to simplify route replace and append

2018-07-04 Thread David Miller

From: dsah...@kernel.org
Date: Tue,  3 Jul 2018 14:36:21 -0700

> From: David Ahern 
> 
> NetworkManager likes to manage linklocal prefix routes and does so with
> the NLM_F_APPEND flag, breaking attempts to simplify the IPv6 route
> code and by extension enable multipath routes with device only nexthops.
> 
> Revert f34436a43092 and these followup patches:
> 6eba08c3626b ("ipv6: Only emit append events for appended routes").
> ce45bded6435 ("mlxsw: spectrum_router: Align with new route replace logic")
> 53b562df8c20 ("mlxsw: spectrum_router: Allow appending to dev-only routes")
> 
> Update the fib_tests cases to reflect the old behavior.
> 
> Fixes: f34436a43092 ("net/ipv6: Simplify route replace and appending into 
> multipath route")
> Signed-off-by: David Ahern 

Applied.

> The gre_multipath tests only exist in net-next. Those will need to be
> updated separately.

Ok, please send me the net-next update once this gets merged there.

Thanks.

84 matches

Mail list logo