Re: [PATCH net-next 01/10] udp: implement complete book-keeping for encap_needed

2018-11-13 Thread Eric Dumazet



On 11/07/2018 03:38 AM, Paolo Abeni wrote:
> The *encap_needed static keys are enabled by UDP tunnels
> and several UDP encapsulations type, but they are never
> turned off. This can cause unneeded overall performance
> degradation for systems where such features are used
> transiently.
> 
> This patch introduces complete book-keeping for such keys,
> decreasing the usage at socket destruction time, if needed,
> and avoiding that the same socket could increase the key
> usage multiple times.
> 
> rfc v3 -> v1:
>  - add socket lock around udp_tunnel_encap_enable()
> 
> rfc v2 -> rfc v3:
>  - use udp_tunnel_encap_enable() in setsockopt()
> 
> Signed-off-by: Paolo Abeni 
> ---
>  include/linux/udp.h  |  7 ++-
>  include/net/udp_tunnel.h |  6 ++
>  net/ipv4/udp.c   | 19 +--
>  net/ipv6/udp.c   | 14 +-
>  4 files changed, 34 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/udp.h b/include/linux/udp.h
> index 320d49d85484..a4dafff407fb 100644
> --- a/include/linux/udp.h
> +++ b/include/linux/udp.h
> @@ -49,7 +49,12 @@ struct udp_sock {
>   unsigned int corkflag;  /* Cork is required */
>   __u8 encap_type;/* Is this an Encapsulation socket? */
>   unsigned charno_check6_tx:1,/* Send zero UDP6 checksums on TX? */
> -  no_check6_rx:1;/* Allow zero UDP6 checksums on RX? */
> +  no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
> +  encap_enabled:1; /* This socket enabled encap
> +* processing; UDP tunnels and
> +* different encapsulation layer set
> +* this
> +*/
>   /*
>* Following member retains the information to create a UDP header
>* when the socket is uncorked.
> diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
> index fe680ab6b15a..3fbe56430e3b 100644
> --- a/include/net/udp_tunnel.h
> +++ b/include/net/udp_tunnel.h
> @@ -165,6 +165,12 @@ static inline int udp_tunnel_handle_offloads(struct 
> sk_buff *skb, bool udp_csum)
>  
>  static inline void udp_tunnel_encap_enable(struct socket *sock)
>  {
> + struct udp_sock *up = udp_sk(sock->sk);
> +
> + if (up->encap_enabled)
> + return;
> +
> + up->encap_enabled = 1;
>  #if IS_ENABLED(CONFIG_IPV6)
>   if (sock->sk->sk_family == PF_INET6)
>   ipv6_stub->udpv6_encap_enable();
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 1976fddb9e00..0ed715a72249 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -115,6 +115,7 @@
>  #include "udp_impl.h"
>  #include 
>  #include 
> +#include 
>  
>  struct udp_table udp_table __read_mostly;
>  EXPORT_SYMBOL(udp_table);
> @@ -2398,11 +2399,15 @@ void udp_destroy_sock(struct sock *sk)
>   bool slow = lock_sock_fast(sk);
>   udp_flush_pending_frames(sk);
>   unlock_sock_fast(sk, slow);
> - if (static_branch_unlikely(_encap_needed_key) && up->encap_type) {
> - void (*encap_destroy)(struct sock *sk);
> - encap_destroy = READ_ONCE(up->encap_destroy);
> - if (encap_destroy)
> - encap_destroy(sk);
> + if (static_branch_unlikely(_encap_needed_key)) {
> + if (up->encap_type) {
> + void (*encap_destroy)(struct sock *sk);
> + encap_destroy = READ_ONCE(up->encap_destroy);
> + if (encap_destroy)
> + encap_destroy(sk);
> + }
> + if (up->encap_enabled)
> + static_branch_disable(_encap_needed_key);
>   }
>  }
>  
> @@ -2447,7 +2452,9 @@ int udp_lib_setsockopt(struct sock *sk, int level, int 
> optname,
>   /* FALLTHROUGH */
>   case UDP_ENCAP_L2TPINUDP:
>   up->encap_type = val;
> - udp_encap_enable();
> + lock_sock(sk);
> + udp_tunnel_encap_enable(sk->sk_socket);
> + release_sock(sk);
>   break;
>   default:
>   err = -ENOPROTOOPT;
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index d2d97d07ef27..fc0ce6c59ebb 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -1458,11 +1458,15 @@ void udpv6_destroy_sock(struct sock *sk)
>   udp_v6_flush_pending_frames(sk);
>   release_sock(sk);
>  
> - if (static_branch_unlikely(_encap_needed_key) && up->encap_type) {
> - void (*encap_destroy)(struct sock *sk);
> - encap_destroy = READ_ONCE(up->encap_destroy);
> - if (encap_destroy)
> - encap_destroy(sk);
> + if (static_branch_unlikely(_encap_needed_key)) {
> + if (up->encap_type) {
> + void (*encap_destroy)(struct sock *sk);
> +  

Re: [PATCH net-next 17/17] net: sched: unlock rules update API

2018-11-13 Thread Jiri Pirko
Tue, Nov 13, 2018 at 02:46:54PM CET, vla...@mellanox.com wrote:
>On Mon 12 Nov 2018 at 17:30, David Miller  wrote:
>> From: Vlad Buslov 
>> Date: Mon, 12 Nov 2018 09:55:46 +0200
>>
>>> Register netlink protocol handlers for message types RTM_NEWTFILTER,
>>> RTM_DELTFILTER, RTM_GETTFILTER as unlocked. Set rtnl_held variable that
>>> tracks rtnl mutex state to be false by default.
>>
>> This whole conditional locking mechanism is really not clean and makes
>> this code so much harder to understand and audit.
>>
>> Please improve the code so that this kind of construct is not needed.
>>
>> Thank you.
>
>Hi David,
>
>I considered several approaches to this problem and decided that this
>one is most straightforward to implement. I understand your concern and
>agree that this code is not easiest to understand and can suggest
>several possible solutions that do not require this kind of elaborate
>locking mechanism in cls API, but have their own drawbacks:
>
>1. Convert all qdiscs and classifiers to support unlocked execution,
>like we did for actions. However, according to my experience with
>converting flower classifier, these require much more code than actions.
>I would estimate it to be more work than whole current unlocking effort
>(hundred+ patches). Also, authors of some of them might be unhappy with
>such intrusive changes. I don't think this approach is realistic.
>
>2. Somehow determine if rtnl is needed at the beginning of cls API rule
>update functions. Currently, this is not possible because locking
>requirements are determined by qdisc_class_ops and tcf_proto_ops 'flags'
>field, which requires code to first do whole ops lookup sequence.
>However, instead of class field I can put 'flags' in some kind of hash
>table or array that will map qdisc/classifier type string to flags, so
>it will be possible to determine locking requirements by just parsing
>netlink message and obtaining flags by qdisc/classifier type. I do not
>consider it pretty solution either, but maybe you have different
>opinion.

I think you will have to do 2. or some modification. Can't you just
check for cls ability to run unlocked early on in tc_new_tfilter()?
You would call tcf_proto_locking_check(nla_data(tca[TCA_KIND]), ...),
which would do tcf_proto_lookup_ops() for ops and check the flags?


>
>3. Anything you can suggest? I might be missing something simple that
>you would consider more elegant solution to this problem.
>
>Thanks,
>Vlad
>


Re: VETH & AF_PACKET problem

2018-11-13 Thread Anand H. Krishnan
Ok. I have 4.18.11 ubuntu. I will try out the latest kernel and will
let you know.
Thank you for your opinion and help.

Thanks,
Anand
On Wed, Nov 14, 2018 at 10:11 AM Willem de Bruijn
 wrote:
>
> On Tue, Nov 13, 2018 at 8:29 PM Anand H. Krishnan
>  wrote:
> >
> > skb_scrub_packet calls skb_orphan and from there the destructor is called.
>
> Not since
>
>   commit 9c4c325252c54b34d53b3d0ffd535182b744e03d
>   skbuff: preserve sock reference when scrubbing the skb.
>   v4.19-rc1~140^2~523^2
>
> But the general issue is valid that the tx_ring slot should not be
> released until all users of the pages are freed, not just when the
> skb is orphaned (e.g., on skb_set_owner_r).
>
> I think that this can happen even on a transmit to a physical
> nic, if a clone is queued for reception on a packet socket. That
> clone does not clone the destructor, so if the reader is slow,
> the slot may be released from consume_skb on the original
> path.
>
> I have not verified this yet. But if correct, then the long term
> solution is to use refcounted uarg, similar to msg_zerocopy.


Re: VETH & AF_PACKET problem

2018-11-13 Thread Willem de Bruijn
On Tue, Nov 13, 2018 at 8:29 PM Anand H. Krishnan
 wrote:
>
> skb_scrub_packet calls skb_orphan and from there the destructor is called.

Not since

  commit 9c4c325252c54b34d53b3d0ffd535182b744e03d
  skbuff: preserve sock reference when scrubbing the skb.
  v4.19-rc1~140^2~523^2

But the general issue is valid that the tx_ring slot should not be
released until all users of the pages are freed, not just when the
skb is orphaned (e.g., on skb_set_owner_r).

I think that this can happen even on a transmit to a physical
nic, if a clone is queued for reception on a packet socket. That
clone does not clone the destructor, so if the reader is slow,
the slot may be released from consume_skb on the original
path.

I have not verified this yet. But if correct, then the long term
solution is to use refcounted uarg, similar to msg_zerocopy.


Re: [PATCH net] ipv6/mcast: update mc_qrv when join new group

2018-11-13 Thread Hangbin Liu
On Thu, Nov 08, 2018 at 03:44:10PM +0800, Hangbin Liu wrote:
> On Fri, Oct 26, 2018 at 10:30:54AM +0800, Hangbin Liu wrote:
> > Currently we only set mc_qrv to sysctl_mld_qrv when interface up. If we
> > change sysctl_mld_qrv after interface up, it will has no effect.
> > 
> > Fix it by assigning latest sysctl_mld_qrv to idev mc_qrv when join new 
> > group.
> > 
> > Reported-by: Ying Xu 
> > Signed-off-by: Hangbin Liu 
> 
> Hi David,
> 
> Any comments for this patch?
> > ---
> >  net/ipv6/mcast.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
> > index dbab62e..bed4890 100644
> > --- a/net/ipv6/mcast.c
> > +++ b/net/ipv6/mcast.c
> > @@ -680,6 +680,7 @@ static void igmp6_group_added(struct ifmcaddr6 *mc)
> > if (!(dev->flags & IFF_UP) || (mc->mca_flags & MAF_NOREPORT))
> > return;
> >  
> > +   mc->idev->mc_qrv = sysctl_mld_qrv;
> > if (mld_in_v1_mode(mc->idev)) {
> > igmp6_join_group(mc);
> > return;

Hi David,

Sorry for the late reply. Today I find your reply in
https://patchwork.ozlabs.org/patch/989422/, but I could not find your reply
in my local mailbox, even on gmail web side. I can receive other kernel
mails correctly, which is weird...I will check it later.

For your question:

> Why isn't mld_update_qrv() taking care of this?

mld_update_qrv() is used for processing recieved mldv2 query message. This
patch is to init new sysctl_mld_qrv for new joined groups, which haven't
received any mld query message. e.g.

- ipv6_mc_up() - ipv6_mc_reset() set idev->mc_qrv to default sysctl_mld_qrv, 
which is 2
  - Join group A with qrv = 2
  - user set sysctl_mld_qrv to 3
  - Join group B with qrv = 2  <- Here the new group should set qrv to 3 as
  user updated sysctl_mld_qrv. This is the
  issue this patch want to fix
  - Recived MLDv2 query message
  - Group A, B update new qrv via mld_update_qrv()


> At a minimum, you must make your change take MLD_QRV_DEFAULT etc.
> into account like mld_update_qrv() does.

We should not init mc_qrv to min(MLD_QRV_DEFAULT, sysctl_mld_qrv), or
user will never be able to set mc_qrv greater than 2.

Thanks
Hangbin


Re: [PATCH net 4/5] net/smc: atomic SMCD cursor handling

2018-11-13 Thread David Miller
From: Ursula Braun 
Date: Mon, 12 Nov 2018 17:01:32 +0100

> +/* SMC-D cursor format */
> +union smcd_cdc_cursor {
 ...
> +} __aligned(8);

>  struct smcd_cdc_msg {
 ...
> + union smcd_cdc_cursor   prod;
> + union smcd_cdc_cursor   cons;
>   u8 res3[8];
>  } __packed;

You're asking for 8-byte alignment of an object you only embed in a
__packed structure, which therefore has arbitary alignment.

It's not going to work.


Re: [iproute PATCH] ip-address: Fix filtering by negated address flags

2018-11-13 Thread Stephen Hemminger
On Tue, 13 Nov 2018 16:12:01 +0100
Phil Sutter  wrote:

> + if (arg[0] == '-') {
> + inv = true;
> + arg++;
> + }
The inverse logic needs to be moved into the loop handling filter names.

Otherwise, you get weirdness like "-dynamic" being accepted and not
doing what was expected.

Also, please make sure the man page matches the code.


[patch net-next] net: 8021q: move vlan offload registrations into vlan_core

2018-11-13 Thread Jiri Pirko
From: Jiri Pirko 

Currently, the vlan packet offloads are registered only upon 8021q module
load. However, even without this module loaded, the offloads could be
utilized, for example by openvswitch datapath. As reported by Michael,
that causes 2x to 5x performance improvement, depending on a testcase.

So move the vlan offload registrations into vlan_core and make this
available even without 8021q module loaded.

Reported-by: Michael Shteinbok 
Signed-off-by: Jiri Pirko 
Tested-by: Michael Shteinbok 
---
 net/8021q/vlan.c  | 96 -
 net/8021q/vlan_core.c | 99 +++
 2 files changed, 99 insertions(+), 96 deletions(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 1b7a375c6616..aef1a977279c 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -648,93 +648,6 @@ static int vlan_ioctl_handler(struct net *net, void __user 
*arg)
return err;
 }
 
-static struct sk_buff *vlan_gro_receive(struct list_head *head,
-   struct sk_buff *skb)
-{
-   const struct packet_offload *ptype;
-   unsigned int hlen, off_vlan;
-   struct sk_buff *pp = NULL;
-   struct vlan_hdr *vhdr;
-   struct sk_buff *p;
-   __be16 type;
-   int flush = 1;
-
-   off_vlan = skb_gro_offset(skb);
-   hlen = off_vlan + sizeof(*vhdr);
-   vhdr = skb_gro_header_fast(skb, off_vlan);
-   if (skb_gro_header_hard(skb, hlen)) {
-   vhdr = skb_gro_header_slow(skb, hlen, off_vlan);
-   if (unlikely(!vhdr))
-   goto out;
-   }
-
-   type = vhdr->h_vlan_encapsulated_proto;
-
-   rcu_read_lock();
-   ptype = gro_find_receive_by_type(type);
-   if (!ptype)
-   goto out_unlock;
-
-   flush = 0;
-
-   list_for_each_entry(p, head, list) {
-   struct vlan_hdr *vhdr2;
-
-   if (!NAPI_GRO_CB(p)->same_flow)
-   continue;
-
-   vhdr2 = (struct vlan_hdr *)(p->data + off_vlan);
-   if (compare_vlan_header(vhdr, vhdr2))
-   NAPI_GRO_CB(p)->same_flow = 0;
-   }
-
-   skb_gro_pull(skb, sizeof(*vhdr));
-   skb_gro_postpull_rcsum(skb, vhdr, sizeof(*vhdr));
-   pp = call_gro_receive(ptype->callbacks.gro_receive, head, skb);
-
-out_unlock:
-   rcu_read_unlock();
-out:
-   skb_gro_flush_final(skb, pp, flush);
-
-   return pp;
-}
-
-static int vlan_gro_complete(struct sk_buff *skb, int nhoff)
-{
-   struct vlan_hdr *vhdr = (struct vlan_hdr *)(skb->data + nhoff);
-   __be16 type = vhdr->h_vlan_encapsulated_proto;
-   struct packet_offload *ptype;
-   int err = -ENOENT;
-
-   rcu_read_lock();
-   ptype = gro_find_complete_by_type(type);
-   if (ptype)
-   err = ptype->callbacks.gro_complete(skb, nhoff + sizeof(*vhdr));
-
-   rcu_read_unlock();
-   return err;
-}
-
-static struct packet_offload vlan_packet_offloads[] __read_mostly = {
-   {
-   .type = cpu_to_be16(ETH_P_8021Q),
-   .priority = 10,
-   .callbacks = {
-   .gro_receive = vlan_gro_receive,
-   .gro_complete = vlan_gro_complete,
-   },
-   },
-   {
-   .type = cpu_to_be16(ETH_P_8021AD),
-   .priority = 10,
-   .callbacks = {
-   .gro_receive = vlan_gro_receive,
-   .gro_complete = vlan_gro_complete,
-   },
-   },
-};
-
 static int __net_init vlan_init_net(struct net *net)
 {
struct vlan_net *vn = net_generic(net, vlan_net_id);
@@ -762,7 +675,6 @@ static struct pernet_operations vlan_net_ops = {
 static int __init vlan_proto_init(void)
 {
int err;
-   unsigned int i;
 
pr_info("%s v%s\n", vlan_fullname, vlan_version);
 
@@ -786,9 +698,6 @@ static int __init vlan_proto_init(void)
if (err < 0)
goto err5;
 
-   for (i = 0; i < ARRAY_SIZE(vlan_packet_offloads); i++)
-   dev_add_offload(_packet_offloads[i]);
-
vlan_ioctl_set(vlan_ioctl_handler);
return 0;
 
@@ -806,13 +715,8 @@ static int __init vlan_proto_init(void)
 
 static void __exit vlan_cleanup_module(void)
 {
-   unsigned int i;
-
vlan_ioctl_set(NULL);
 
-   for (i = 0; i < ARRAY_SIZE(vlan_packet_offloads); i++)
-   dev_remove_offload(_packet_offloads[i]);
-
vlan_netlink_fini();
 
unregister_netdevice_notifier(_notifier_block);
diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index 57425049faf2..a313165e7a67 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -453,3 +453,102 @@ bool vlan_uses_dev(const struct net_device *dev)
return vlan_info->grp.nr_vlan_devs ? true : false;
 }
 EXPORT_SYMBOL(vlan_uses_dev);
+
+static struct sk_buff *vlan_gro_receive(struct list_head *head,
+   

Re: [PATCH ipsec-next 00/11] xfrm: policy: add inexact policy search tree

2018-11-13 Thread Steffen Klassert
On Thu, Nov 08, 2018 at 07:00:14PM -0800, David Miller wrote:
> From: Florian Westphal 
> Date: Wed,  7 Nov 2018 23:00:30 +0100
> 
> > This series attempts to improve xfrm policy lookup performance when
> > a lot of (several hundred or even thousands) inexact policies exist
> > on a system.
> > 
> > On insert, a policy is either placed in hash table (all direct (/32 for
> > ipv4, /128 policies, or all policies matching a user-configured threshold).
> > All other policies get inserted into inexact list as per priority.
> > 
> > Lookup then scans inexact list for first matching entry.
> > 
> > This series instead makes it so that inexact policy is added to exactly
> > one of four different search list classes.
> > 
> > 1. "Any:Any" list, containing policies where both saddr and daddr are
> >wildcards or have very coarse prefixes, e.g. 10.0.0.0/8 and the like.
> > 2. "saddr:any" list, containing policies with a fixed saddr/prefixlen,
> >but without destination restrictions.
> >These lists are stored in rbtree nodes; each node contains those
> >policies matching saddr/prefixlen.
> > 3. "Any:daddr" list. Similar to 2), except for policies where only the
> >destinations are specified.
> > 4. "saddr:daddr" lists, containing policies that match the given
> >source/destination network.
> > 
> >The root of the saddr/daddr tree is stored in the nodes of the
> >'daddr' tree.
> ...
> > Comments or questions welcome.
> 
> Acked-by: David S. Miller 

This is now applied to ipsec-next, thanks a lot
for your work Florian!


[PATCH v3] net: Add trace events for all receive exit points

2018-11-13 Thread Geneviève Bastien
Trace events are already present for the receive entry points, to indicate
how the reception entered the stack.

This patch adds the corresponding exit trace events that will bound the
reception such that all events occurring between the entry and the exit
can be considered as part of the reception context. This greatly helps
for dependency and root cause analyses.

Without this, it is not possible with tracepoint instrumentation to
determine whether a sched_wakeup event following a netif_receive_skb
event is the result of the packet reception or a simple coincidence after
further processing by the thread. It is possible using other mechanisms
like kretprobes, but considering the "entry" points are already present,
it would be good to add the matching exit events.

In addition to linking packets with wakeups, the entry/exit event pair
can also be used to perform network stack latency analyses.

Signed-off-by: Geneviève Bastien 
CC: Mathieu Desnoyers 
CC: Steven Rostedt 
CC: Ingo Molnar 
CC: David S. Miller 
Reviewed-by: Steven Rostedt (VMware)  (tracing
side)
---
Changes in v3:
  - Update commit message
  - Rename exit template to match the entry one
  - Move the similar tracepoints together in the code
---
Changes in v2:
  - Add the return value to tracepoints where applicable
  - Verify if tracepoint is enabled before walking list in
netif_receive_skb_list
---
 include/trace/events/net.h | 71 ++
 net/core/dev.c | 38 
 2 files changed, 103 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/net.h b/include/trace/events/net.h
index 00aa72ce0e7c..aa64169c42cb 100644
--- a/include/trace/events/net.h
+++ b/include/trace/events/net.h
@@ -244,6 +244,77 @@ DEFINE_EVENT(net_dev_rx_verbose_template, 
netif_rx_ni_entry,
TP_ARGS(skb)
 );
 
+DECLARE_EVENT_CLASS(net_dev_rx_exit_template,
+
+   TP_PROTO(struct sk_buff *skb, int ret),
+
+   TP_ARGS(skb, ret),
+
+   TP_STRUCT__entry(
+   __field(void *, skbaddr)
+   __field(int,ret)
+   ),
+
+   TP_fast_assign(
+   __entry->skbaddr = skb;
+   __entry->ret = ret;
+   ),
+
+   TP_printk("skbaddr=%p ret=%d", __entry->skbaddr, __entry->ret)
+);
+
+DEFINE_EVENT(net_dev_rx_exit_template, napi_gro_frags_exit,
+
+   TP_PROTO(struct sk_buff *skb, int ret),
+
+   TP_ARGS(skb, ret)
+);
+
+DEFINE_EVENT(net_dev_rx_exit_template, napi_gro_receive_exit,
+
+   TP_PROTO(struct sk_buff *skb, int ret),
+
+   TP_ARGS(skb, ret)
+);
+
+DEFINE_EVENT(net_dev_rx_exit_template, netif_receive_skb_exit,
+
+   TP_PROTO(struct sk_buff *skb, int ret),
+
+   TP_ARGS(skb, ret)
+);
+
+DEFINE_EVENT(net_dev_rx_exit_template, netif_rx_exit,
+
+   TP_PROTO(struct sk_buff *skb, int ret),
+
+   TP_ARGS(skb, ret)
+);
+
+DEFINE_EVENT(net_dev_rx_exit_template, netif_rx_ni_exit,
+
+   TP_PROTO(struct sk_buff *skb, int ret),
+
+   TP_ARGS(skb, ret)
+);
+
+TRACE_EVENT(netif_receive_skb_list_exit,
+
+   TP_PROTO(struct sk_buff *skb),
+
+   TP_ARGS(skb),
+
+   TP_STRUCT__entry(
+   __field(void *, skbaddr)
+   ),
+
+   TP_fast_assign(
+   __entry->skbaddr = skb;
+   ),
+
+   TP_printk("skbaddr=%p", __entry->skbaddr)
+);
+
 #endif /* _TRACE_NET_H */
 
 /* This part must be outside protection */
diff --git a/net/core/dev.c b/net/core/dev.c
index 0ffcbdd55fa9..c4dc5df34abe 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4520,9 +4520,14 @@ static int netif_rx_internal(struct sk_buff *skb)
 
 int netif_rx(struct sk_buff *skb)
 {
+   int ret;
+
trace_netif_rx_entry(skb);
 
-   return netif_rx_internal(skb);
+   ret = netif_rx_internal(skb);
+   trace_netif_rx_exit(skb, ret);
+
+   return ret;
 }
 EXPORT_SYMBOL(netif_rx);
 
@@ -4537,6 +4542,7 @@ int netif_rx_ni(struct sk_buff *skb)
if (local_softirq_pending())
do_softirq();
preempt_enable();
+   trace_netif_rx_ni_exit(skb, err);
 
return err;
 }
@@ -5222,9 +5228,14 @@ static void netif_receive_skb_list_internal(struct 
list_head *head)
  */
 int netif_receive_skb(struct sk_buff *skb)
 {
+   int ret;
+
trace_netif_receive_skb_entry(skb);
 
-   return netif_receive_skb_internal(skb);
+   ret = netif_receive_skb_internal(skb);
+   trace_netif_receive_skb_exit(skb, ret);
+
+   return ret;
 }
 EXPORT_SYMBOL(netif_receive_skb);
 
@@ -5244,9 +5255,15 @@ void netif_receive_skb_list(struct list_head *head)
 
if (list_empty(head))
return;
-   list_for_each_entry(skb, head, list)
-   trace_netif_receive_skb_list_entry(skb);
+   if (trace_netif_receive_skb_list_entry_enabled()) {
+   list_for_each_entry(skb, head, list)
+   trace_netif_receive_skb_list_entry(skb);
+   }
netif_receive_skb_list_internal(head);
+   if 

Re: [Patch net-next v2] net: dump more useful information in netdev_rx_csum_fault()

2018-11-13 Thread Willem de Bruijn
On Tue, Nov 13, 2018 at 10:14 AM Cong Wang  wrote:
>
> On Tue, Nov 13, 2018 at 10:01 AM Willem de Bruijn
>  wrote:
> >
> > On Mon, Nov 12, 2018 at 2:49 PM Cong Wang  wrote:
> > >
> > > Currently netdev_rx_csum_fault() only shows a device name,
> > > we need more information about the skb for debugging csum
> > > failures.
> > >
> > > Sample output:
> > >
> > >  ens3: hw csum failure
> > >  dev features: 0x00014b89
> > >  skb len=84 data_len=0 pkt_type=0 gso_size=0 gso_type=0 nr_frags=0 
> > > ip_summed=0 csum=0 csum_complete_sw=0 csum_valid=0 csum_level=0
> >
> > Recent issues were protocol dependent, including whether vlan headers
> > were present. Perhaps also print skb vlan fields and even the first N
> > bytes of data to inspect protocol headers? Also skb_iif, esp. if this
> > differs from dev->ifindex.
>
> Pawel's case seems to be vlan related, however, as I mentioned,
> my case is neither vlan nor RXFCS related.
>
> Ideally, we should dump the whole packet in order to verify the
> correctness of the checksum. :) It is not easy to do so given
> how complex an skb is now. This is why I only select a few skb
> fields to dump. I am pretty sure this can't cover all cases, you
> can always add more for your need in the future.

Sounds good. This patch is definitely useful as is.

Also, instead of adding code to print a (partial) packet header to
the kernel log, it may make more sense to export the skb to a
drop monitor over netlink or a perf buffer.


[net-next 02/11] ice: Check for q_vector when stopping rings

2018-11-13 Thread Jeff Kirsher
From: Tony Nguyen 

There is a gap in time between a VF reset, which sets the q_vector to
NULL, and the VF requesting mapping of the q_vectors. If
ice_vsi_stop_tx_rings() is called during this time, a NULL pointer
dereference is encountered. Add a check in ice_vsi_stop_tx_rings()
to ensure the q_vector is set to avoid this situation from occurring.

Signed-off-by: Tony Nguyen 
Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_lib.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index 1041fa2a7767..11d0ab185dd2 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -1908,7 +1908,8 @@ int ice_vsi_stop_tx_rings(struct ice_vsi *vsi, enum 
ice_disq_rst_src rst_src,
ice_for_each_txq(vsi, i) {
u16 v_idx;
 
-   if (!vsi->tx_rings || !vsi->tx_rings[i]) {
+   if (!vsi->tx_rings || !vsi->tx_rings[i] ||
+   !vsi->tx_rings[i]->q_vector) {
err = -EINVAL;
goto err_out;
}
-- 
2.19.1



[net-next 00/11][pull request] 100GbE Intel Wired LAN Driver Updates 2018-11-13

2018-11-13 Thread Jeff Kirsher
This series contains updates to the ice driver only.

Brett cleans up debug print messages by removing useless or duplicate
messages, and make sure we assign the hardware head pointer to head
instead of the software head pointer.  Resolved an issue when disabling
SRIOV we were trying to stop queues multiple times, so make sure we
disable SRIOV before stopping transmit and receive queues for VF.

Tony fixes a potential NULL pointer dereference during a VF reset.

Anirudh resolves an issue where we were releasing the VSI before
removing the VSI scheduler node, which was resulting in an error "Failed
to set LAN Tx queue context, error: -1".  Also fixed the guaranteed
number of VSIs available and used by discovering the device
capabilities to determine the 'guar_num_vsi' per function, rather than
always using the theoretical max number of VSIs every time.

Dave avoids a deadlock by nesting RTNL locking, so added a boolean to
determine if the RTNL lock is already held.

Lev fixes bad mask values which would break compilation.

Piotr increases the receive queue disable timeout since it can take
additional time to finish all pending queue requests.

Usha resolves an issue of VLAN priority tagged traffic not appearing on
all traffic classes, which was causing ETS bandwidth shaping to not work
as expected.

Henry fixes the reset path to cleanup the old scheduler tree before
rebuilding it.

Md Fahad removes a unnecessary check which was causing a driver load
error on platforms with more than 128 cores.

The following are changes since commit 3e536cff34244959c81575733c9ca60633f74e1b:
  net: phy: check if advertising is zero using linkmode_empty
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 100GbE

Anirudh Venkataramanan (2):
  ice: Remove node before releasing VSI
  ice: Calculate guaranteed VSIs per function and use it

Brett Creeley (2):
  ice: Fix debug print in ice_tx_timeout
  ice: Call pci_disable_sriov before stopping queues for VF

Dave Ertman (1):
  ice: Avoid nested RTNL locking in ice_dis_vsi

Henry Tieman (1):
  ice: Destroy scheduler tree in reset path

Lev Faerman (1):
  ice: Fix NVM mask defines

Md Fahad Iqbal Polash (1):
  ice: Remove ICE_MAX_TXQ_PER_TXQG check when configuring Tx queue

Piotr Raczynski (1):
  ice: Increase Rx queue disable timeout

Tony Nguyen (1):
  ice: Check for q_vector when stopping rings

Usha Ketineni (1):
  ice: Fix to make VLAN priority tagged traffic to appear on all TCs

 drivers/net/ethernet/intel/ice/ice.h  |   5 +-
 .../net/ethernet/intel/ice/ice_adminq_cmd.h   |   7 +-
 drivers/net/ethernet/intel/ice/ice_common.c   |  31 +++-
 .../net/ethernet/intel/ice/ice_hw_autogen.h   |   3 +
 drivers/net/ethernet/intel/ice/ice_lib.c  | 136 ++
 drivers/net/ethernet/intel/ice/ice_main.c |  37 +++--
 drivers/net/ethernet/intel/ice/ice_sched.c| 110 +-
 drivers/net/ethernet/intel/ice/ice_sched.h|   3 +
 drivers/net/ethernet/intel/ice/ice_type.h |   4 +-
 .../net/ethernet/intel/ice/ice_virtchnl_pf.c  |  18 +--
 10 files changed, 265 insertions(+), 89 deletions(-)

-- 
2.19.1



[net-next 08/11] ice: Call pci_disable_sriov before stopping queues for VF

2018-11-13 Thread Jeff Kirsher
From: Brett Creeley 

Previous to this commit the driver was immediately stopping Tx/Rx
queues when doing the following "echo 0 > sriov_numvfs" and then it was
calling pci_disable_sriov if the VFs are not assigned. This was causing
the VIRTCHNL_OP_DISABLE_QUEUES to fail because it was trying to stop
the queues for a second time.

Fix this by calling pci_disable_sriov before stopping the Tx/Rx queues.
This allows the VIRTCHNL_OP_DISABLE_QUEUES to get processed before the
driver tries to stop the Rx/Tx queues in ice_free_vfs.

Signed-off-by: Brett Creeley 
Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 .../net/ethernet/intel/ice/ice_virtchnl_pf.c   | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c 
b/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c
index e71065f9d391..20b94dee0036 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c
@@ -215,6 +215,15 @@ void ice_free_vfs(struct ice_pf *pf)
while (test_and_set_bit(__ICE_VF_DIS, pf->state))
usleep_range(1000, 2000);
 
+   /* Disable IOV before freeing resources. This lets any VF drivers
+* running in the host get themselves cleaned up before we yank
+* the carpet out from underneath their feet.
+*/
+   if (!pci_vfs_assigned(pf->pdev))
+   pci_disable_sriov(pf->pdev);
+   else
+   dev_warn(>pdev->dev, "VFs are assigned - not disabling 
SR-IOV\n");
+
/* Avoid wait time by stopping all VFs at the same time */
for (i = 0; i < pf->num_alloc_vfs; i++) {
if (!test_bit(ICE_VF_STATE_ENA, pf->vf[i].vf_states))
@@ -228,15 +237,6 @@ void ice_free_vfs(struct ice_pf *pf)
clear_bit(ICE_VF_STATE_ENA, pf->vf[i].vf_states);
}
 
-   /* Disable IOV before freeing resources. This lets any VF drivers
-* running in the host get themselves cleaned up before we yank
-* the carpet out from underneath their feet.
-*/
-   if (!pci_vfs_assigned(pf->pdev))
-   pci_disable_sriov(pf->pdev);
-   else
-   dev_warn(>pdev->dev, "VFs are assigned - not disabling 
SR-IOV\n");
-
tmp = pf->num_alloc_vfs;
pf->num_vf_qps = 0;
pf->num_alloc_vfs = 0;
-- 
2.19.1



[net-next 01/11] ice: Fix debug print in ice_tx_timeout

2018-11-13 Thread Jeff Kirsher
From: Brett Creeley 

Currently the debug print in ice_tx_timeout is printing useless and
duplicate values. First, head is being assigned to tx_ring->next_to_clean
and we are printing both of those values, but naming them HWB and NTC
respectively. Also, reading tail always returns 0 so remove that as well.

Instead of assigning the SW head (NTC) read to head, use the actual head
register and change the debug print to note that this is HW_HEAD. Also
reduce the scope of a couple variables.

Signed-off-by: Brett Creeley 
Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h |  3 +++
 drivers/net/ethernet/intel/ice/ice_main.c   | 15 +--
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_hw_autogen.h 
b/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
index 596b9fb1c510..5507928c8fbe 100644
--- a/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
+++ b/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
@@ -7,6 +7,9 @@
 #define _ICE_HW_AUTOGEN_H_
 
 #define QTX_COMM_DBELL(_DBQM)  (0x002C + ((_DBQM) * 4))
+#define QTX_COMM_HEAD(_DBQM)   (0x000E + ((_DBQM) * 4))
+#define QTX_COMM_HEAD_HEAD_S   0
+#define QTX_COMM_HEAD_HEAD_M   ICE_M(0x1FFF, 0)
 #define PF_FW_ARQBAH   0x00080180
 #define PF_FW_ARQBAL   0x00080080
 #define PF_FW_ARQH 0x00080380
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c 
b/drivers/net/ethernet/intel/ice/ice_main.c
index 12a1d595..8584061e1bc6 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -3691,8 +3691,8 @@ static void ice_tx_timeout(struct net_device *netdev)
struct ice_ring *tx_ring = NULL;
struct ice_vsi *vsi = np->vsi;
struct ice_pf *pf = vsi->back;
-   u32 head, val = 0, i;
int hung_queue = -1;
+   u32 i;
 
pf->tx_timeout_count++;
 
@@ -3736,17 +3736,20 @@ static void ice_tx_timeout(struct net_device *netdev)
return;
 
if (tx_ring) {
-   head = tx_ring->next_to_clean;
+   struct ice_hw *hw = >hw;
+   u32 head, val = 0;
+
+   head = (rd32(hw, QTX_COMM_HEAD(vsi->txq_map[hung_queue])) &
+   QTX_COMM_HEAD_HEAD_M) >> QTX_COMM_HEAD_HEAD_S;
/* Read interrupt register */
if (test_bit(ICE_FLAG_MSIX_ENA, pf->flags))
-   val = rd32(>hw,
+   val = rd32(hw,
   GLINT_DYN_CTL(tx_ring->q_vector->v_idx +
tx_ring->vsi->hw_base_vector));
 
-   netdev_info(netdev, "tx_timeout: VSI_num: %d, Q %d, NTC: 0x%x, 
HWB: 0x%x, NTU: 0x%x, TAIL: 0x%x, INT: 0x%x\n",
+   netdev_info(netdev, "tx_timeout: VSI_num: %d, Q %d, NTC: 0x%x, 
HW_HEAD: 0x%x, NTU: 0x%x, INT: 0x%x\n",
vsi->vsi_num, hung_queue, tx_ring->next_to_clean,
-   head, tx_ring->next_to_use,
-   readl(tx_ring->tail), val);
+   head, tx_ring->next_to_use, val);
}
 
pf->tx_timeout_last_recovery = jiffies;
-- 
2.19.1



[net-next 09/11] ice: Fix to make VLAN priority tagged traffic to appear on all TCs

2018-11-13 Thread Jeff Kirsher
From: Usha Ketineni 

This patch includes below changes to resolve the issue of ETS bandwidth
shaping to work.

1. Allocation of Tx queues is accounted for based on the enabled TC's
   in ice_vsi_setup_q_map() and enabled the Tx queues on those TC's via
   ice_vsi_cfg_txqs()

2. Get the mapped netdev TC # for the user priority and set the priority
   to TC mapping for the VSI.

Signed-off-by: Usha Ketineni 
Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice.h   |   4 +-
 drivers/net/ethernet/intel/ice/ice_lib.c   | 121 +
 drivers/net/ethernet/intel/ice/ice_main.c  |   4 +-
 drivers/net/ethernet/intel/ice/ice_sched.c |   2 +-
 drivers/net/ethernet/intel/ice/ice_sched.h |   1 +
 5 files changed, 81 insertions(+), 51 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index ba03cbd3638e..7d8575d11786 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -112,7 +112,9 @@ extern const char ice_drv_ver[];
 
 struct ice_tc_info {
u16 qoffset;
-   u16 qcount;
+   u16 qcount_tx;
+   u16 qcount_rx;
+   u8 netdev_tc;
 };
 
 struct ice_tc_cfg {
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index f6e21363c8d6..597005f39919 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -774,11 +774,13 @@ static void ice_set_dflt_vsi_ctx(struct ice_vsi_ctx *ctxt)
  */
 static void ice_vsi_setup_q_map(struct ice_vsi *vsi, struct ice_vsi_ctx *ctxt)
 {
-   u16 offset = 0, qmap = 0, numq_tc;
-   u16 pow = 0, max_rss = 0, qcount;
+   u16 offset = 0, qmap = 0, tx_count = 0;
u16 qcount_tx = vsi->alloc_txq;
u16 qcount_rx = vsi->alloc_rxq;
+   u16 tx_numq_tc, rx_numq_tc;
+   u16 pow = 0, max_rss = 0;
bool ena_tc0 = false;
+   u8 netdev_tc = 0;
int i;
 
/* at least TC0 should be enabled by default */
@@ -794,7 +796,12 @@ static void ice_vsi_setup_q_map(struct ice_vsi *vsi, 
struct ice_vsi_ctx *ctxt)
vsi->tc_cfg.ena_tc |= 1;
}
 
-   numq_tc = qcount_rx / vsi->tc_cfg.numtc;
+   rx_numq_tc = qcount_rx / vsi->tc_cfg.numtc;
+   if (!rx_numq_tc)
+   rx_numq_tc = 1;
+   tx_numq_tc = qcount_tx / vsi->tc_cfg.numtc;
+   if (!tx_numq_tc)
+   tx_numq_tc = 1;
 
/* TC mapping is a function of the number of Rx queues assigned to the
 * VSI for each traffic class and the offset of these queues.
@@ -808,7 +815,8 @@ static void ice_vsi_setup_q_map(struct ice_vsi *vsi, struct 
ice_vsi_ctx *ctxt)
 * Setup number and offset of Rx queues for all TCs for the VSI
 */
 
-   qcount = numq_tc;
+   qcount_rx = rx_numq_tc;
+
/* qcount will change if RSS is enabled */
if (test_bit(ICE_FLAG_RSS_ENA, vsi->back->flags)) {
if (vsi->type == ICE_VSI_PF || vsi->type == ICE_VSI_VF) {
@@ -816,37 +824,41 @@ static void ice_vsi_setup_q_map(struct ice_vsi *vsi, 
struct ice_vsi_ctx *ctxt)
max_rss = ICE_MAX_LG_RSS_QS;
else
max_rss = ICE_MAX_SMALL_RSS_QS;
-   qcount = min_t(int, numq_tc, max_rss);
-   qcount = min_t(int, qcount, vsi->rss_size);
+   qcount_rx = min_t(int, rx_numq_tc, max_rss);
+   qcount_rx = min_t(int, qcount_rx, vsi->rss_size);
}
}
 
/* find the (rounded up) power-of-2 of qcount */
-   pow = order_base_2(qcount);
+   pow = order_base_2(qcount_rx);
 
for (i = 0; i < ICE_MAX_TRAFFIC_CLASS; i++) {
if (!(vsi->tc_cfg.ena_tc & BIT(i))) {
/* TC is not enabled */
vsi->tc_cfg.tc_info[i].qoffset = 0;
-   vsi->tc_cfg.tc_info[i].qcount = 1;
+   vsi->tc_cfg.tc_info[i].qcount_rx = 1;
+   vsi->tc_cfg.tc_info[i].qcount_tx = 1;
+   vsi->tc_cfg.tc_info[i].netdev_tc = 0;
ctxt->info.tc_mapping[i] = 0;
continue;
}
 
/* TC is enabled */
vsi->tc_cfg.tc_info[i].qoffset = offset;
-   vsi->tc_cfg.tc_info[i].qcount = qcount;
+   vsi->tc_cfg.tc_info[i].qcount_rx = qcount_rx;
+   vsi->tc_cfg.tc_info[i].qcount_tx = tx_numq_tc;
+   vsi->tc_cfg.tc_info[i].netdev_tc = netdev_tc++;
 
qmap = ((offset << ICE_AQ_VSI_TC_Q_OFFSET_S) &
ICE_AQ_VSI_TC_Q_OFFSET_M) |
((pow << ICE_AQ_VSI_TC_Q_NUM_S) &
 ICE_AQ_VSI_TC_Q_NUM_M);
-   offset += qcount;
+   offset += qcount_rx;
+

[net-next 11/11] ice: Remove ICE_MAX_TXQ_PER_TXQG check when configuring Tx queue

2018-11-13 Thread Jeff Kirsher
From: Md Fahad Iqbal Polash 

This patch removes the condition checking of VSI TX queue number to
ICE_MAX_TXQ_PER_TXQG. This is an unnecessary check and causes a driver
load error on hosts that have more than 128 cores.

Signed-off-by: Md Fahad Iqbal Polash 
Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_lib.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index 53685a66125b..a5961a8fe73c 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -1633,10 +1633,6 @@ int ice_vsi_cfg_txqs(struct ice_vsi *vsi)
if (!qg_buf)
return -ENOMEM;
 
-   if (vsi->num_txq > ICE_MAX_TXQ_PER_TXQG) {
-   err = -EINVAL;
-   goto err_cfg_txqs;
-   }
qg_buf->num_txqs = 1;
num_q_grps = 1;
 
-- 
2.19.1



[net-next 10/11] ice: Destroy scheduler tree in reset path

2018-11-13 Thread Jeff Kirsher
From: Henry Tieman 

The scheduler tree is is always rebuilt during reset. The existing code
adds new scheduler nodes for queues but may not clean up earlier nodes.
This patch removed the old scheduler tree during reset before it is
rebuilt.

Signed-off-by: Henry Tieman 
Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_lib.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index 597005f39919..53685a66125b 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -2551,6 +2551,7 @@ int ice_vsi_rebuild(struct ice_vsi *vsi)
return -EINVAL;
 
pf = vsi->back;
+   ice_rm_vsi_lan_cfg(vsi->port_info, vsi->idx);
ice_vsi_free_q_vectors(vsi);
ice_free_res(vsi->back->sw_irq_tracker, vsi->sw_base_vector, vsi->idx);
ice_free_res(vsi->back->hw_irq_tracker, vsi->hw_base_vector, vsi->idx);
-- 
2.19.1



[net-next 06/11] ice: Fix NVM mask defines

2018-11-13 Thread Jeff Kirsher
From: Lev Faerman 

Fixes bad masks that would break compilation when evaluated.

Signed-off-by: Lev Faerman 
Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h 
b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 602f02a0a2d1..4078070881ce 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -1066,10 +1066,10 @@ struct ice_aqc_nvm {
 #define ICE_AQC_NVM_LAST_CMD   BIT(0)
 #define ICE_AQC_NVM_PCIR_REQ   BIT(0)  /* Used by NVM Update reply */
 #define ICE_AQC_NVM_PRESERVATION_S 1
-#define ICE_AQC_NVM_PRESERVATION_M (3 << CSR_AQ_NVM_PRESERVATION_S)
-#define ICE_AQC_NVM_NO_PRESERVATION(0 << CSR_AQ_NVM_PRESERVATION_S)
+#define ICE_AQC_NVM_PRESERVATION_M (3 << ICE_AQC_NVM_PRESERVATION_S)
+#define ICE_AQC_NVM_NO_PRESERVATION(0 << ICE_AQC_NVM_PRESERVATION_S)
 #define ICE_AQC_NVM_PRESERVE_ALL   BIT(1)
-#define ICE_AQC_NVM_PRESERVE_SELECTED  (3 << CSR_AQ_NVM_PRESERVATION_S)
+#define ICE_AQC_NVM_PRESERVE_SELECTED  (3 << ICE_AQC_NVM_PRESERVATION_S)
 #define ICE_AQC_NVM_FLASH_ONLY BIT(7)
__le16 module_typeid;
__le16 length;
-- 
2.19.1



[net-next 07/11] ice: Increase Rx queue disable timeout

2018-11-13 Thread Jeff Kirsher
From: Piotr Raczynski 

With much traffic coming into the port, Rx queue disable
procedure can take more time until all pending queue
requests on PCIe finish. Reuse ICE_Q_WAIT_MAX_RETRY macro
and increase the delay itself.

Signed-off-by: Piotr Raczynski 
Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_lib.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index 1efd760debc2..f6e21363c8d6 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -174,15 +174,15 @@ static int ice_pf_rxq_wait(struct ice_pf *pf, int pf_q, 
bool ena)
 {
int i;
 
-   for (i = 0; i < ICE_Q_WAIT_RETRY_LIMIT; i++) {
+   for (i = 0; i < ICE_Q_WAIT_MAX_RETRY; i++) {
u32 rx_reg = rd32(>hw, QRX_CTRL(pf_q));
 
if (ena == !!(rx_reg & QRX_CTRL_QENA_STAT_M))
break;
 
-   usleep_range(10, 20);
+   usleep_range(20, 40);
}
-   if (i >= ICE_Q_WAIT_RETRY_LIMIT)
+   if (i >= ICE_Q_WAIT_MAX_RETRY)
return -ETIMEDOUT;
 
return 0;
-- 
2.19.1



[net-next 03/11] ice: Remove node before releasing VSI

2018-11-13 Thread Jeff Kirsher
From: Anirudh Venkataramanan 

Before releasing the VSI, remove the VSI scheduler node. If not,
the node is left in the scheduler tree and, on subsequent load, the
scheduler tree contains the node so it does not set it in vsi_ctx.
This, later, causes the node to not be found in ice_sched_get_free_qparent
which leads to a "Failed to set LAN Tx queue context, error: -1".

To remove the scheduler node, this patch introduces ice_rm_vsi_lan_cfg
and related helpers.

Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_lib.c   |   1 +
 drivers/net/ethernet/intel/ice/ice_sched.c | 108 -
 drivers/net/ethernet/intel/ice/ice_sched.h |   2 +
 3 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c 
b/drivers/net/ethernet/intel/ice/ice_lib.c
index 11d0ab185dd2..1efd760debc2 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -2492,6 +2492,7 @@ int ice_vsi_release(struct ice_vsi *vsi)
}
 
ice_remove_vsi_fltr(>hw, vsi->idx);
+   ice_rm_vsi_lan_cfg(vsi->port_info, vsi->idx);
ice_vsi_delete(vsi);
ice_vsi_free_q_vectors(vsi);
ice_vsi_clear_rings(vsi);
diff --git a/drivers/net/ethernet/intel/ice/ice_sched.c 
b/drivers/net/ethernet/intel/ice/ice_sched.c
index 7cc8aa18a22b..7e807b0e7514 100644
--- a/drivers/net/ethernet/intel/ice/ice_sched.c
+++ b/drivers/net/ethernet/intel/ice/ice_sched.c
@@ -1527,7 +1527,7 @@ ice_sched_update_vsi_child_nodes(struct ice_port_info 
*pi, u16 vsi_handle,
 }
 
 /**
- * ice_sched_cfg_vsi - configure the new/exisiting VSI
+ * ice_sched_cfg_vsi - configure the new/existing VSI
  * @pi: port information structure
  * @vsi_handle: software VSI handle
  * @tc: TC number
@@ -1605,3 +1605,109 @@ ice_sched_cfg_vsi(struct ice_port_info *pi, u16 
vsi_handle, u8 tc, u16 maxqs,
 
return status;
 }
+
+/**
+ * ice_sched_rm_agg_vsi_entry - remove agg related VSI info entry
+ * @pi: port information structure
+ * @vsi_handle: software VSI handle
+ *
+ * This function removes single aggregator VSI info entry from
+ * aggregator list.
+ */
+static void
+ice_sched_rm_agg_vsi_info(struct ice_port_info *pi, u16 vsi_handle)
+{
+   struct ice_sched_agg_info *agg_info;
+   struct ice_sched_agg_info *atmp;
+
+   list_for_each_entry_safe(agg_info, atmp, >agg_list, list_entry) {
+   struct ice_sched_agg_vsi_info *agg_vsi_info;
+   struct ice_sched_agg_vsi_info *vtmp;
+
+   list_for_each_entry_safe(agg_vsi_info, vtmp,
+_info->agg_vsi_list, list_entry)
+   if (agg_vsi_info->vsi_handle == vsi_handle) {
+   list_del(_vsi_info->list_entry);
+   devm_kfree(ice_hw_to_dev(pi->hw),
+  agg_vsi_info);
+   return;
+   }
+   }
+}
+
+/**
+ * ice_sched_rm_vsi_cfg - remove the VSI and its children nodes
+ * @pi: port information structure
+ * @vsi_handle: software VSI handle
+ * @owner: LAN or RDMA
+ *
+ * This function removes the VSI and its LAN or RDMA children nodes from the
+ * scheduler tree.
+ */
+static enum ice_status
+ice_sched_rm_vsi_cfg(struct ice_port_info *pi, u16 vsi_handle, u8 owner)
+{
+   enum ice_status status = ICE_ERR_PARAM;
+   struct ice_vsi_ctx *vsi_ctx;
+   u8 i, j = 0;
+
+   if (!ice_is_vsi_valid(pi->hw, vsi_handle))
+   return status;
+   mutex_lock(>sched_lock);
+   vsi_ctx = ice_get_vsi_ctx(pi->hw, vsi_handle);
+   if (!vsi_ctx)
+   goto exit_sched_rm_vsi_cfg;
+
+   for (i = 0; i < ICE_MAX_TRAFFIC_CLASS; i++) {
+   struct ice_sched_node *vsi_node, *tc_node;
+
+   tc_node = ice_sched_get_tc_node(pi, i);
+   if (!tc_node)
+   continue;
+
+   vsi_node = ice_sched_get_vsi_node(pi->hw, tc_node, vsi_handle);
+   if (!vsi_node)
+   continue;
+
+   while (j < vsi_node->num_children) {
+   if (vsi_node->children[j]->owner == owner) {
+   ice_free_sched_node(pi, vsi_node->children[j]);
+
+   /* reset the counter again since the num
+* children will be updated after node removal
+*/
+   j = 0;
+   } else {
+   j++;
+   }
+   }
+   /* remove the VSI if it has no children */
+   if (!vsi_node->num_children) {
+   ice_free_sched_node(pi, vsi_node);
+   vsi_ctx->sched.vsi_node[i] = NULL;
+
+   /* clean up agg related vsi info if any 

[net-next 04/11] ice: Calculate guaranteed VSIs per function and use it

2018-11-13 Thread Jeff Kirsher
From: Anirudh Venkataramanan 

Currently we are setting the guar_num_vsi to equal to ICE_MAX_VSI
which is the device limit of 768. This is incorrect and could have
unintended consequences. To fix this use the valid_function's 8-bit
bitmap returned from discovering device capabilities to determine the
guar_num_vsi per function. guar_num_vsi value is then passed on to
pf->num_alloc_vsi.

Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice.h  |  1 -
 .../net/ethernet/intel/ice/ice_adminq_cmd.h   |  1 +
 drivers/net/ethernet/intel/ice/ice_common.c   | 31 +--
 drivers/net/ethernet/intel/ice/ice_main.c |  3 +-
 drivers/net/ethernet/intel/ice/ice_type.h |  4 ++-
 5 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index b8548370f1c7..ba03cbd3638e 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -52,7 +52,6 @@ extern const char ice_drv_ver[];
 #define ICE_MBXQ_LEN   64
 #define ICE_MIN_MSIX   2
 #define ICE_NO_VSI 0x
-#define ICE_MAX_VSI_ALLOC  130
 #define ICE_MAX_TXQS   2048
 #define ICE_MAX_RXQS   2048
 #define ICE_VSI_MAP_CONTIG 0
diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h 
b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 6653555f55dd..602f02a0a2d1 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -87,6 +87,7 @@ struct ice_aqc_list_caps {
 /* Device/Function buffer entry, repeated per reported capability */
 struct ice_aqc_list_caps_elem {
__le16 cap;
+#define ICE_AQC_CAPS_VALID_FUNCTIONS   0x0005
 #define ICE_AQC_CAPS_SRIOV 0x0012
 #define ICE_AQC_CAPS_VF0x0013
 #define ICE_AQC_CAPS_VSI   0x0017
diff --git a/drivers/net/ethernet/intel/ice/ice_common.c 
b/drivers/net/ethernet/intel/ice/ice_common.c
index 554fd707a6d6..9de5a3aac77d 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -1386,6 +1386,27 @@ void ice_release_res(struct ice_hw *hw, enum 
ice_aq_res_ids res)
}
 }
 
+/**
+ * ice_get_guar_num_vsi - determine number of guar VSI for a PF
+ * @hw: pointer to the hw structure
+ *
+ * Determine the number of valid functions by going through the bitmap returned
+ * from parsing capabilities and use this to calculate the number of VSI per 
PF.
+ */
+static u32 ice_get_guar_num_vsi(struct ice_hw *hw)
+{
+   u8 funcs;
+
+#define ICE_CAPS_VALID_FUNCS_M 0xFF
+   funcs = hweight8(hw->dev_caps.common_cap.valid_functions &
+ICE_CAPS_VALID_FUNCS_M);
+
+   if (!funcs)
+   return 0;
+
+   return ICE_MAX_VSI / funcs;
+}
+
 /**
  * ice_parse_caps - parse function/device capabilities
  * @hw: pointer to the hw struct
@@ -1428,6 +1449,12 @@ ice_parse_caps(struct ice_hw *hw, void *buf, u32 
cap_count,
u16 cap = le16_to_cpu(cap_resp->cap);
 
switch (cap) {
+   case ICE_AQC_CAPS_VALID_FUNCTIONS:
+   caps->valid_functions = number;
+   ice_debug(hw, ICE_DBG_INIT,
+ "HW caps: Valid Functions = %d\n",
+ caps->valid_functions);
+   break;
case ICE_AQC_CAPS_SRIOV:
caps->sr_iov_1_1 = (number == 1);
ice_debug(hw, ICE_DBG_INIT,
@@ -1457,10 +1484,10 @@ ice_parse_caps(struct ice_hw *hw, void *buf, u32 
cap_count,
  "HW caps: Dev.VSI cnt = %d\n",
  dev_p->num_vsi_allocd_to_host);
} else if (func_p) {
-   func_p->guaranteed_num_vsi = number;
+   func_p->guar_num_vsi = ice_get_guar_num_vsi(hw);
ice_debug(hw, ICE_DBG_INIT,
  "HW caps: Func.VSI cnt = %d\n",
- func_p->guaranteed_num_vsi);
+ number);
}
break;
case ICE_AQC_CAPS_RSS:
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c 
b/drivers/net/ethernet/intel/ice/ice_main.c
index 8584061e1bc6..ea79e5e1f589 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -2091,8 +2091,7 @@ static int ice_probe(struct pci_dev *pdev,
 
ice_determine_q_usage(pf);
 
-   pf->num_alloc_vsi = min_t(u16, ICE_MAX_VSI_ALLOC,
- hw->func_caps.guaranteed_num_vsi);
+   pf->num_alloc_vsi = 

[net-next 05/11] ice: Avoid nested RTNL locking in ice_dis_vsi

2018-11-13 Thread Jeff Kirsher
From: Dave Ertman 

ice_dis_vsi() performs an rtnl_lock() if it detects a netdev that is
running on the VSI. In cases where the RTNL lock has already been
acquired, a deadlock results. Add a boolean to pass to ice_dis_vsi to
tell it if the RTNL lock is already held.

Signed-off-by: Dave Ertman 
Signed-off-by: Anirudh Venkataramanan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_main.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c 
b/drivers/net/ethernet/intel/ice/ice_main.c
index ea79e5e1f589..089b0f0b2e71 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -3137,8 +3137,9 @@ static void ice_vsi_release_all(struct ice_pf *pf)
 /**
  * ice_dis_vsi - pause a VSI
  * @vsi: the VSI being paused
+ * @locked: is the rtnl_lock already held
  */
-static void ice_dis_vsi(struct ice_vsi *vsi)
+static void ice_dis_vsi(struct ice_vsi *vsi, bool locked)
 {
if (test_bit(__ICE_DOWN, vsi->state))
return;
@@ -3147,9 +3148,13 @@ static void ice_dis_vsi(struct ice_vsi *vsi)
 
if (vsi->type == ICE_VSI_PF && vsi->netdev) {
if (netif_running(vsi->netdev)) {
-   rtnl_lock();
-   vsi->netdev->netdev_ops->ndo_stop(vsi->netdev);
-   rtnl_unlock();
+   if (!locked) {
+   rtnl_lock();
+   vsi->netdev->netdev_ops->ndo_stop(vsi->netdev);
+   rtnl_unlock();
+   } else {
+   vsi->netdev->netdev_ops->ndo_stop(vsi->netdev);
+   }
} else {
ice_vsi_close(vsi);
}
@@ -3188,7 +3193,7 @@ static void ice_pf_dis_all_vsi(struct ice_pf *pf)
 
ice_for_each_vsi(pf, v)
if (pf->vsi[v])
-   ice_dis_vsi(pf->vsi[v]);
+   ice_dis_vsi(pf->vsi[v], false);
 }
 
 /**
-- 
2.19.1



Re: VETH & AF_PACKET problem

2018-11-13 Thread Willem de Bruijn
On Sun, Nov 11, 2018 at 11:15 PM Anand H. Krishnan
 wrote:
>
> Hello,
>
> We are seeing a problem with AF_PACKET when used along with the
> veth interfaces. SCP complains that message authentication code is
> incorrect.
>
> I was browsing the code and I see that veth_xmit calls dev_forward_skb
> which does a skb_scrub_packet, which in turn calls the skb destructor 
> function.

Where does it call this?

On the other hand, it is likely that skb_orphan() is called somewhere on the
receive path, which indeed will release the tx_ring slot.

> skb_orphan_frags, called by dev_forward_skb, seems to do the right thing,
> but it probably does not get called for packets from AF_PACKET socket, since 
> the
> skb is not a zero copy skb (SKBTX_DEV_ZEROCOPY is not set).

Yes, an skb_copy_ubufs may be needed if the orphan is unavoidable.


Re: [Patch net-next v2] net: dump more useful information in netdev_rx_csum_fault()

2018-11-13 Thread Cong Wang
On Tue, Nov 13, 2018 at 10:01 AM Willem de Bruijn
 wrote:
>
> On Mon, Nov 12, 2018 at 2:49 PM Cong Wang  wrote:
> >
> > Currently netdev_rx_csum_fault() only shows a device name,
> > we need more information about the skb for debugging csum
> > failures.
> >
> > Sample output:
> >
> >  ens3: hw csum failure
> >  dev features: 0x00014b89
> >  skb len=84 data_len=0 pkt_type=0 gso_size=0 gso_type=0 nr_frags=0 
> > ip_summed=0 csum=0 csum_complete_sw=0 csum_valid=0 csum_level=0
>
> Recent issues were protocol dependent, including whether vlan headers
> were present. Perhaps also print skb vlan fields and even the first N
> bytes of data to inspect protocol headers? Also skb_iif, esp. if this
> differs from dev->ifindex.

Pawel's case seems to be vlan related, however, as I mentioned,
my case is neither vlan nor RXFCS related.

Ideally, we should dump the whole packet in order to verify the
correctness of the checksum. :) It is not easy to do so given
how complex an skb is now. This is why I only select a few skb
fields to dump. I am pretty sure this can't cover all cases, you
can always add more for your need in the future.

Thanks.


Re: [Patch net-next v2] net: dump more useful information in netdev_rx_csum_fault()

2018-11-13 Thread Willem de Bruijn
On Mon, Nov 12, 2018 at 2:49 PM Cong Wang  wrote:
>
> Currently netdev_rx_csum_fault() only shows a device name,
> we need more information about the skb for debugging csum
> failures.
>
> Sample output:
>
>  ens3: hw csum failure
>  dev features: 0x00014b89
>  skb len=84 data_len=0 pkt_type=0 gso_size=0 gso_type=0 nr_frags=0 
> ip_summed=0 csum=0 csum_complete_sw=0 csum_valid=0 csum_level=0

Recent issues were protocol dependent, including whether vlan headers
were present. Perhaps also print skb vlan fields and even the first N
bytes of data to inspect protocol headers? Also skb_iif, esp. if this
differs from dev->ifindex.


RE: [PATCH net-next v3 6/6] net/ncsi: Configure multi-package, multi-channel modes with failover

2018-11-13 Thread Justin.Lee1
Hi Samuel,

I have tested your new patch. The failover function works as expected.

Thanks,
Justin


> On Fri, 2018-11-09 at 21:58 +, justin.l...@dell.com wrote:
> > Hi Samuel,
> > 
> > After running more testing, I notice that the extra patch causing failover 
> > function
> > to fail. Also, occasionally, I will see two channels having TX enabled if I
> > send netlink command back-to-back.
> > 
> > cat /sys/kernel/debug/ncsi_protocol/ncsi_device_
> > IFIDX IFNAME NAME   PID CID RX TX MP MC WP WC PC CS PS LS RU CR NQ HA
> > =
> >   2   eth2   ncsi0  000 000 1  1  1  1  1  1  1  2  1  1  1  1  0  1
> >   2   eth2   ncsi1  000 001 1  1  1  1  1  1  0  2  1  1  1  1  0  1
> >   2   eth2   ncsi2  001 000 0  0  1  1  0  0  0  1  0  1  1  1  0  1
> >   2   eth2   ncsi3  001 001 0  0  1  1  0  0  0  1  0  1  1  1  0  1
> > =
> > MP: Multi-mode Package  WP: Whitelist Package
> > MC: Multi-mode Channel  WC: Whitelist Channel
> > PC: Primary Channel CS: Channel State IA/A/IV 1/2/3
> > PS: Poll Status LS: Link Status
> > RU: Running CR: Carrier OK
> > NQ: Queue Stopped   HA: Hardware Arbitration
> > 
> > Thanks,
> > Justin
> > 
> > 
> > > From  Mon Sep 17 00:00:00 2001
> > > From: Samuel Mendoza-Jonas 
> > > Date: Fri, 9 Nov 2018 13:11:03 +1100
> > > Subject: [PATCH] net/ncsi: Reset state fixes, single-channel LSC
> > > 
> > > ---
> > >  net/ncsi/ncsi-aen.c|  8 +---
> > >  net/ncsi/ncsi-manage.c | 19 +++
> > >  2 files changed, 20 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/net/ncsi/ncsi-aen.c b/net/ncsi/ncsi-aen.c
> > > index 39c2e9eea2ba..034cb1dc5566 100644
> > > --- a/net/ncsi/ncsi-aen.c
> > > +++ b/net/ncsi/ncsi-aen.c
> > > @@ -93,14 +93,16 @@ static int ncsi_aen_handler_lsc(struct ncsi_dev_priv 
> > > *ndp,
> > >   if ((had_link == has_link) || chained)
> > >   return 0;
> > >  
> > > - if (!ndp->multi_package && !nc->package->multi_channel) {
> > > - if (had_link)
> > > - ndp->flags |= NCSI_DEV_RESHUFFLE;
> > > + if (!ndp->multi_package && !nc->package->multi_channel && had_link) {
> > > + ndp->flags |= NCSI_DEV_RESHUFFLE;
> > >   ncsi_stop_channel_monitor(nc);
> > >   spin_lock_irqsave(>lock, flags);
> > >   list_add_tail_rcu(>link, >channel_queue);
> > >   spin_unlock_irqrestore(>lock, flags);
> > >   return ncsi_process_next_channel(ndp);
> > > + } else {
> > > + /* Configured channel came up */
> > > + return 0;
> > 
> > It is always going to else statement if multi_package and/or mutlit_channel 
> > is enabled.
> >  
> > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_aen_handler_lsc() - pkg 0 ch 0 
> > state down
> > npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_aen_handler_lsc() - had_link 1, 
> > has_link 0, chained 0
> > 
> > These codes have no chance to run.
> > if (had_link) {
> > ncm = >modes[NCSI_MODE_TX_ENABLE];
> > if (ncsi_channel_is_last(ndp, nc)) {
> > /* No channels left, reconfigure */
> > return ncsi_reset_dev(>ndev);
> > } else if (ncm->enable) {
> > /* Need to failover Tx channel */
> > ncsi_update_tx_channel(ndp, nc->package, nc, NULL);
> > }
> > 
> > >   }
> > > 
> 
> Oops, wrote that patch a little fast it seems. This may also affect the
> double-Tx above since ncsi_update_tx_channel() won't be reached.
> I've attached a fixed up version below.
> 
> For the failover issue you're seeing in your previous email the issue is
> likely in the ncsi_dev_state_suspend_gls state. This should send a GLS
> command to all available channels, but it only does it for the current
> package. Since not all packages are necessarily enabled in single-channel 
> mode I'll need to have a think about the neatest way to handle that, but
> it's a separate issue from this patch.
> 
> Cheers,
> Sam
> 
> 
> From  Mon Sep 17 00:00:00 2001
> From: Samuel Mendoza-Jonas 
> Date: Fri, 9 Nov 2018 13:11:03 +1100
> Subject: [PATCH] net/ncsi: Reset state fixes, single-channel LSC
> 
> ---
>  net/ncsi/ncsi-aen.c| 16 ++--
>  net/ncsi/ncsi-manage.c | 19 +++
>  2 files changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/net/ncsi/ncsi-aen.c b/net/ncsi/ncsi-aen.c
> index 39c2e9eea2ba..76559d0eeea8 100644
> --- a/net/ncsi/ncsi-aen.c
> +++ b/net/ncsi/ncsi-aen.c
> @@ -94,13 +94,17 @@ static int ncsi_aen_handler_lsc(struct ncsi_dev_priv *ndp,
>   return 0;
>  
>   if (!ndp->multi_package && !nc->package->multi_channel) {
> - if (had_link)
> + if (had_link) {
>   ndp->flags |= NCSI_DEV_RESHUFFLE;
> - 

Re: BUG: sleeping function called from invalid context at mm/slab.h:421

2018-11-13 Thread Roman Gushchin
On Tue, Nov 13, 2018 at 10:03:38PM +0530, Naresh Kamboju wrote:
> While running kernel selftests bpf test_cgroup_storage test this
> kernel BUG reported every time on all devices running Linux -next
> 4.20.0-rc2-next-20181113 (from 4.19.0-rc5-next-20180928).
> This kernel BUG log is from x86_64 machine.
> 
> Do you see at your end ?
> 
> [   73.047526] BUG: sleeping function called from invalid context at
> /srv/oe/build/tmp-rpb-glibc/work-shared/intel-corei7-64/kernel-source/mm/slab.h:421
> [   73.060915] in_atomic(): 1, irqs_disabled(): 0, pid: 3157, name:
> test_cgroup_sto
> [   73.068342] INFO: lockdep is turned off.
> [   73.072293] CPU: 2 PID: 3157 Comm: test_cgroup_sto Not tainted
> 4.20.0-rc2-next-20181113 #1
> [   73.080548] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.0b 07/27/2017
> [   73.088018] Call Trace:
> [   73.090463]  dump_stack+0x70/0xa5
> [   73.093783]  ___might_sleep+0x152/0x240
> [   73.097619]  __might_sleep+0x4a/0x80
> [   73.101191]  __kmalloc_node+0x1cf/0x2f0
> [   73.105031]  ? cgroup_storage_update_elem+0x46/0x90
> [   73.109909]  cgroup_storage_update_elem+0x46/0x90
> [   73.114608]  map_update_elem+0x4a1/0x4c0
> [   73.118534]  __x64_sys_bpf+0x124/0x280
> [   73.122286]  do_syscall_64+0x4f/0x190
> [   73.125952]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [   73.131004] RIP: 0033:0x7f46b93ea7f9
> [   73.134581] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
> 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
> 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4f e6 2b 00 f7 d8 64 89
> 01 48
> [   73.153318] RSP: 002b:7fffc6595858 EFLAGS: 0206 ORIG_RAX:
> 0141
> [   73.160876] RAX: ffda RBX: 014a0260 RCX: 
> 7f46b93ea7f9
> [   73.167999] RDX: 0048 RSI: 7fffc65958a0 RDI: 
> 0002
> [   73.175124] RBP: 7fffc6595870 R08: 7fffc65958a0 R09: 
> 7fffc65958a0
> [   73.182246] R10: 7fffc65958a0 R11: 0206 R12: 
> 0003
> [   73.189369] R13: 0004 R14: 0005 R15: 
> 0006
> selftests: bpf: test_cgroup_storage

Hi Naresh!

Thank you for the report! Can you, please, try the following patch?

Thanks!

--

diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index c97a8f968638..d91710fb8360 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -139,8 +139,8 @@ static int cgroup_storage_update_elem(struct bpf_map *map, 
void *_key,
return -ENOENT;
 
new = kmalloc_node(sizeof(struct bpf_storage_buffer) +
-  map->value_size, __GFP_ZERO | GFP_USER,
-  map->numa_node);
+  map->value_size, __GFP_ZERO | GFP_ATOMIC |
+  __GFP_NOWARN, map->numa_node);
if (!new)
return -ENOMEM;
 


Re: [PATCHv2 net-next 1/4] sctp: define subscribe in sctp_sock as __u16

2018-11-13 Thread Neil Horman
On Tue, Nov 13, 2018 at 02:24:53PM +0800, Xin Long wrote:
>  
>   /* Default Peer Address Parameters.  These defaults can
>* be modified via SCTP_PEER_ADDR_PARAMS
> @@ -5267,14 +5274,24 @@ static int sctp_getsockopt_disable_fragments(struct 
> sock *sk, int len,
>  static int sctp_getsockopt_events(struct sock *sk, int len, char __user 
> *optval,
> int __user *optlen)
>  {
> + struct sctp_event_subscribe subscribe;
> + __u8 *sn_type = (__u8 *)
> + int i;
> +
>   if (len == 0)
>   return -EINVAL;
>   if (len > sizeof(struct sctp_event_subscribe))
>   len = sizeof(struct sctp_event_subscribe);
>   if (put_user(len, optlen))
>   return -EFAULT;
> - if (copy_to_user(optval, _sk(sk)->subscribe, len))
> +
> + for (i = 0; i <= len; i++)
> + sn_type[i] = sctp_ulpevent_type_enabled(sctp_sk(sk)->subscribe,
> + SCTP_SN_TYPE_BASE + i);
> +
This seems like an off by one error.  sctp_event_subscribe has N bytes in it (1
byte for each event), meaning that that events 0-(N-1) are subscribable.
Iterating this loop imples that you are going to check N events, overrunning the
sctp_event_subscribe struct.

Neil

> 


Re: [PATCH net-next 16/17] net: sched: conditionally take rtnl lock on rules update path

2018-11-13 Thread Stefano Brivio
On Tue, 13 Nov 2018 16:53:07 +0100
Stefano Brivio  wrote:

> But to make that effective, you would need to protect the read too, and
> that makes your optimisation not really overzealous I think.
> 
> I'd rather go with an additional comment, if that doesn't become
> unreadable.

Oh, and of course, this whole thing makes sense only if for some reason
you end up keeping this function. Maybe ignore my comments on this patch
at least for the moment being ;)

-- 
Stefano


Re: [PATCH net v2 0/4] qed: Miscellaneous bug fixes

2018-11-13 Thread David Miller
From: Denis Bolotin 
Date: Mon, 12 Nov 2018 12:50:19 +0200

> This patch series fixes several unrelated bugs across the driver.
> Please consider applying to net.
> 
> V1->V2:
> ---
> Use dma_rmb() instead of rmb().

Series applied, thank you.


Re: pull-request: can 2018-11-09

2018-11-13 Thread David Miller
From: Marc Kleine-Budde 
Date: Mon, 12 Nov 2018 12:57:08 +0100

> this is a pull request of 20 patches for net/master.

Pulled, thanks Marc.


[PATCH net] ipv6: fix a dst leak when removing its exception

2018-11-13 Thread Xin Long
These is no need to hold dst before calling rt6_remove_exception_rt().
The call to dst_hold_safe() in ip6_link_failure() was for ip6_del_rt(),
which has been removed in Commit 93531c674315 ("net/ipv6: separate
handling of FIB entries from dst based routes"). Otherwise, it will
cause a dst leak.

This patch is to simply remove the dst_hold_safe() call before calling
rt6_remove_exception_rt() and also do the same in ip6_del_cached_rt().
It's safe, because the removal of the exception that holds its dst's
refcnt is protected by rt6_exception_lock.

Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based 
routes")
Fixes: 23fb93a4d3f1 ("net/ipv6: Cleanup exception and cache route handling")
Reported-by: Li Shuang 
Signed-off-by: Xin Long 
---
 net/ipv6/route.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 2a7423c..14b422f 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2232,8 +2232,7 @@ static void ip6_link_failure(struct sk_buff *skb)
if (rt) {
rcu_read_lock();
if (rt->rt6i_flags & RTF_CACHE) {
-   if (dst_hold_safe(>dst))
-   rt6_remove_exception_rt(rt);
+   rt6_remove_exception_rt(rt);
} else {
struct fib6_info *from;
struct fib6_node *fn;
@@ -3214,8 +3213,8 @@ static int ip6_del_cached_rt(struct rt6_info *rt, struct 
fib6_config *cfg)
if (cfg->fc_flags & RTF_GATEWAY &&
!ipv6_addr_equal(>fc_gateway, >rt6i_gateway))
goto out;
-   if (dst_hold_safe(>dst))
-   rc = rt6_remove_exception_rt(rt);
+
+   rc = rt6_remove_exception_rt(rt);
 out:
return rc;
 }
-- 
2.1.0



[PATCH bpf-next v2] filter: add BPF_ADJ_ROOM_DATA mode to bpf_skb_adjust_room()

2018-11-13 Thread Nicolas Dichtel
This new mode enables to add or remove an l2 header in a programmatic way
with cls_bpf.
For example, it enables to play with mpls headers.

Signed-off-by: Nicolas Dichtel 
Acked-by: Martin KaFai Lau 
---

v2: use skb_set_network_header()

 include/uapi/linux/bpf.h   |  3 ++
 net/core/filter.c  | 53 ++
 tools/include/uapi/linux/bpf.h |  3 ++
 3 files changed, 59 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 47d606d744cc..9762ef3a536c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1467,6 +1467,8 @@ union bpf_attr {
  *
  * * **BPF_ADJ_ROOM_NET**: Adjust room at the network layer
  *   (room space is added or removed below the layer 3 header).
+ * * **BPF_ADJ_ROOM_DATA**: Adjust room at the beginning of the
+ *   packet (room space is added or removed below skb->data).
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2412,6 +2414,7 @@ enum bpf_func_id {
 /* Mode for BPF_FUNC_skb_adjust_room helper. */
 enum bpf_adj_room_mode {
BPF_ADJ_ROOM_NET,
+   BPF_ADJ_ROOM_DATA,
 };
 
 /* Mode for BPF_FUNC_skb_load_bytes_relative helper. */
diff --git a/net/core/filter.c b/net/core/filter.c
index 53d50fb75ea1..e56da3077dca 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2884,6 +2884,57 @@ static int bpf_skb_adjust_net(struct sk_buff *skb, s32 
len_diff)
return ret;
 }
 
+static int bpf_skb_data_shrink(struct sk_buff *skb, u32 len)
+{
+   unsigned short hhlen = skb->dev->header_ops ?
+  skb->dev->hard_header_len : 0;
+   int ret;
+
+   ret = skb_unclone(skb, GFP_ATOMIC);
+   if (unlikely(ret < 0))
+   return ret;
+
+   __skb_pull(skb, len);
+   skb_reset_mac_header(skb);
+   skb_set_network_header(skb, hhlen);
+   skb_reset_transport_header(skb);
+   return 0;
+}
+
+static int bpf_skb_data_grow(struct sk_buff *skb, u32 len)
+{
+   unsigned short hhlen = skb->dev->header_ops ?
+  skb->dev->hard_header_len : 0;
+   int ret;
+
+   ret = skb_cow(skb, len);
+   if (unlikely(ret < 0))
+   return ret;
+
+   skb_push(skb, len);
+   skb_reset_mac_header(skb);
+   return 0;
+}
+
+static int bpf_skb_adjust_data(struct sk_buff *skb, s32 len_diff)
+{
+   u32 len_diff_abs = abs(len_diff);
+   bool shrink = len_diff < 0;
+   int ret;
+
+   if (unlikely(len_diff_abs > 0xfffU))
+   return -EFAULT;
+
+   if (shrink && len_diff_abs >= skb_headlen(skb))
+   return -EFAULT;
+
+   ret = shrink ? bpf_skb_data_shrink(skb, len_diff_abs) :
+  bpf_skb_data_grow(skb, len_diff_abs);
+
+   bpf_compute_data_pointers(skb);
+   return ret;
+}
+
 BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
   u32, mode, u64, flags)
 {
@@ -2891,6 +2942,8 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, 
s32, len_diff,
return -EINVAL;
if (likely(mode == BPF_ADJ_ROOM_NET))
return bpf_skb_adjust_net(skb, len_diff);
+   if (likely(mode == BPF_ADJ_ROOM_DATA))
+   return bpf_skb_adjust_data(skb, len_diff);
 
return -ENOTSUPP;
 }
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 852dc17ab47a..47407fd5162b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1467,6 +1467,8 @@ union bpf_attr {
  *
  * * **BPF_ADJ_ROOM_NET**: Adjust room at the network layer
  *   (room space is added or removed below the layer 3 header).
+ * * **BPF_ADJ_ROOM_DATA**: Adjust room at the beginning of the
+ *   packet (room space is added or removed below skb->data).
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2408,6 +2410,7 @@ enum bpf_func_id {
 /* Mode for BPF_FUNC_skb_adjust_room helper. */
 enum bpf_adj_room_mode {
BPF_ADJ_ROOM_NET,
+   BPF_ADJ_ROOM_DATA,
 };
 
 /* Mode for BPF_FUNC_skb_load_bytes_relative helper. */
-- 
2.18.0



BUG: sleeping function called from invalid context at mm/slab.h:421

2018-11-13 Thread Naresh Kamboju
While running kernel selftests bpf test_cgroup_storage test this
kernel BUG reported every time on all devices running Linux -next
4.20.0-rc2-next-20181113 (from 4.19.0-rc5-next-20180928).
This kernel BUG log is from x86_64 machine.

Do you see at your end ?

[   73.047526] BUG: sleeping function called from invalid context at
/srv/oe/build/tmp-rpb-glibc/work-shared/intel-corei7-64/kernel-source/mm/slab.h:421
[   73.060915] in_atomic(): 1, irqs_disabled(): 0, pid: 3157, name:
test_cgroup_sto
[   73.068342] INFO: lockdep is turned off.
[   73.072293] CPU: 2 PID: 3157 Comm: test_cgroup_sto Not tainted
4.20.0-rc2-next-20181113 #1
[   73.080548] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.0b 07/27/2017
[   73.088018] Call Trace:
[   73.090463]  dump_stack+0x70/0xa5
[   73.093783]  ___might_sleep+0x152/0x240
[   73.097619]  __might_sleep+0x4a/0x80
[   73.101191]  __kmalloc_node+0x1cf/0x2f0
[   73.105031]  ? cgroup_storage_update_elem+0x46/0x90
[   73.109909]  cgroup_storage_update_elem+0x46/0x90
[   73.114608]  map_update_elem+0x4a1/0x4c0
[   73.118534]  __x64_sys_bpf+0x124/0x280
[   73.122286]  do_syscall_64+0x4f/0x190
[   73.125952]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   73.131004] RIP: 0033:0x7f46b93ea7f9
[   73.134581] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4f e6 2b 00 f7 d8 64 89
01 48
[   73.153318] RSP: 002b:7fffc6595858 EFLAGS: 0206 ORIG_RAX:
0141
[   73.160876] RAX: ffda RBX: 014a0260 RCX: 7f46b93ea7f9
[   73.167999] RDX: 0048 RSI: 7fffc65958a0 RDI: 0002
[   73.175124] RBP: 7fffc6595870 R08: 7fffc65958a0 R09: 7fffc65958a0
[   73.182246] R10: 7fffc65958a0 R11: 0206 R12: 0003
[   73.189369] R13: 0004 R14: 0005 R15: 0006
selftests: bpf: test_cgroup_storage

Please find full test log,
https://lkft.validation.linaro.org/scheduler/job/506640#L2874

Thank you.
Best regards
Naresh Kamboju


Re: [PATCH net-next 01/17] net: sched: refactor mini_qdisc_pair_swap() to use workqueue

2018-11-13 Thread David Miller
From: Vlad Buslov 
Date: Tue, 13 Nov 2018 13:13:19 +

> 
> On Mon 12 Nov 2018 at 17:28, David Miller  wrote:
>> From: Vlad Buslov 
>> Date: Mon, 12 Nov 2018 09:55:30 +0200
>>
>>> +void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,
>>> + struct tcf_proto *tp_head)
>>> +{
>>> +   xchg(>tp_head, tp_head);
>>
>> If you are not checking the return value of xchg(), then this is
>> simply a store with optionally a memory barrier of some sort
>> either before or after.
> 
> That was my intention. What would be a better way to atomically
> reset a pointer? Should I just change this line to explicit
> assignment+barrier?

We have all kinds of helpers in the kernel for doing things like
this, grep for things like "smp_load_acquire", "smp_store_release()"
etc.



[iproute PATCH v2] man: ip-route.8: Document nexthop limit

2018-11-13 Thread Phil Sutter
Add a note to 'nexthop' description stating the maximum number of
nexthops per command and pointing at 'append' command as a workaround.

Signed-off-by: Phil Sutter 
---
Changes since v1:
- Reviewed text.
- 'route append' trick works with IPv6 only.
---
 man/man8/ip-route.8.in | 9 +
 1 file changed, 9 insertions(+)

diff --git a/man/man8/ip-route.8.in b/man/man8/ip-route.8.in
index a33ce1f0f4006..11dd4ca7abf68 100644
--- a/man/man8/ip-route.8.in
+++ b/man/man8/ip-route.8.in
@@ -589,6 +589,15 @@ argument lists:
 route reflecting its relative bandwidth or quality.
 .in -8
 
+The internal buffer used in iproute2 limits the maximum number of nexthops that
+may be specified in one go. If only
+.I ADDRESS
+is given, the current buffer size allows for 144 IPv6 nexthops and 253 IPv4
+ones. For IPv4, this effectively limits the number of nexthops possible per
+route. With IPv6, further nexthops may be appended to the same route via
+.B "ip route append"
+command.
+
 .TP
 .BI scope " SCOPE_VAL"
 the scope of the destinations covered by the route prefix.
-- 
2.19.0



Re: [PATCH net-next 16/17] net: sched: conditionally take rtnl lock on rules update path

2018-11-13 Thread Stefano Brivio
On Tue, 13 Nov 2018 13:58:05 +
Vlad Buslov  wrote:

> On Tue 13 Nov 2018 at 13:40, Stefano Brivio  wrote:
> > On Tue, 13 Nov 2018 13:25:52 +
> > Vlad Buslov  wrote:
> >  
> >> On Tue 13 Nov 2018 at 09:40, Stefano Brivio  wrote:  
> >> > Hi Vlad,
> >> >
> >> > On Mon, 12 Nov 2018 09:55:45 +0200
> >> > Vlad Buslov  wrote:
> >> >
> >> >> @@ -179,9 +179,25 @@ static void tcf_proto_destroy_work(struct 
> >> >> work_struct *work)
> >> >> rtnl_unlock();
> >> >>  }
> >> >>  
> >> >> +/* Helper function to lock rtnl mutex when specified condition is true 
> >> >> and mutex
> >> >> + * hasn't been locked yet. Will set rtnl_held to 'true' before taking 
> >> >> rtnl lock.
> >> >> + * Note that this function does nothing if rtnl is already held. This 
> >> >> is
> >> >> + * intended to be used by cls API rules update API when multiple 
> >> >> conditions
> >> >> + * could require rtnl lock and its state needs to be tracked to 
> >> >> prevent trying
> >> >> + * to obtain lock multiple times.
> >> >> + */
> >> >> +
> >> >> +static void tcf_require_rtnl(bool cond, bool *rtnl_held)
> >> >> +{
> >> >> +   if (!*rtnl_held && cond) {
> >> >> +   *rtnl_held = true;
> >> >> +   rtnl_lock();
> >> >> +   }
> >> >> +}
> >> >
> >> > I guess calls to this function are supposed to be serialised. If that's
> >> > the case (which is my tentative understanding so far), I would indicate
> >> > that in the comment.
> >> >
> >> > If that's not the case, you would be introducing a race I guess.
> >> >
> >> > Same applies to tcf_block_release() from 17/17.
> >> 
> >> Hi Stefano,
> >> 
> >> Thank you for reviewing my code!
> >> 
> >> I did not intend for this function to be serialized. First argument to
> >> tcf_require_rtnl() is passed by value, and second argument is always a
> >> pointer to local stack-allocated value of the caller.  
> >
> > Yes, sorry, I haven't been terribly clear, that's what I meant by
> > serialised: it won't be called concurrently with the same *rtnl_held.
> >
> > Perhaps the risk that somebody uses it that way is close to zero, so
> > I'm not even too sure this is worth a comment, but if you can come up
> > with a concise way of saying this, that would be nice.  
> 
> I considered my comment that function "Will set rtnl_held to 'true'
> before taking rtnl lock" as a red flag for caller to not pass pointer to
> a variable that can be accessed concurrently. I guess I can add
> additional sentence to explicitly warn potential users. Or I can just
> move rtnl_held assignment in both functions to be performed while
> holding rtnl mutex. I implemented it the way I did as an overzealous
> optimization, but realistically price of an assignment is negligible in
> this case.

But to make that effective, you would need to protect the read too, and
that makes your optimisation not really overzealous I think.

I'd rather go with an additional comment, if that doesn't become
unreadable.

-- 
Stefano


Re: [PATCH bpf-next v2] bpftool: make libbfd optional

2018-11-13 Thread Stanislav Fomichev
On 11/13, Quentin Monnet wrote:
> 2018-11-12 14:02 UTC-0800 ~ Jakub Kicinski 
> > On Mon, 12 Nov 2018 13:44:10 -0800, Stanislav Fomichev wrote:
> >> Make it possible to build bpftool without libbfd. libbfd and libopcodes are
> >> typically provided in dev/dbg packages (binutils-dev in debian) which we
> >> usually don't have installed on the fleet machines and we'd like a way to 
> >> have
> >> bpftool version that works without installing any additional packages.
> >> This excludes support for disassembling jit-ted code and prints an error if
> >> the user tries to use these features.
> >>
> >> Tested by:
> >> cat > FEATURES_DUMP.bpftool < >> feature-libbfd=0
> >> feature-disassembler-four-args=1
> >> feature-reallocarray=0
> >> feature-libelf=1
> >> feature-libelf-mmap=1
> >> feature-bpf=1
> >> EOF
> >> FEATURES_DUMP=$PWD/FEATURES_DUMP.bpftool make
> >> ldd bpftool | grep libbfd
> >>
> >> Signed-off-by: Stanislav Fomichev 
> > 
> > Seems reasonable, thanks!
> > 
> > Acked-by: Jakub Kicinski 
> > 
> 
> Thanks Stanislav!
> 
> There is a problem with this patch on some distributions, Ubuntu at least.
> 
> Feature detection for libbfd has been used for perf before being also
> used with bpftool. Since commit 280e7c48c3b8 the feature needs libz and
> libiberty to be present on the system, otherwise the feature would not
> compile (and be detected) on OpenSuse.
> 
> On Ubuntu, libiberty is not needed (libbfd might be statically linked
> against it, if I remember correctly?), which means that we are able to
> build bpftool as long as binutils-dev has been installed, even if
> libiberty-dev has not been installed. The BFD feature, in that case,
> will appear as “undetected”. It is a bug. But since the Makefile does
> not stop compilation in that case (another bug), in the end we're good.
> 
> With your patch, the problem is that libbpf detection will fail on
> Ubuntu if libiberty-dev is not present, even though all the necessary
> libraries for using the JIT disassembler are available. And in that case
> it _will_ make a difference, since the Makefile will no more compile the
> libbfd-related bits.
> 
> So I'm not against the idea, but we have to fix libbfd detection first.
Yeah, libbfd feature detection looks broken on ubuntu, this patch just
exercises this brokenness :-). I can take a look somewhere this week,
thanks for spotting it!

I wonder, how does it work on opensuse currently if we link only against -lbfd
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/bpftool/Makefile#n56)?
It might be broken everywhere...

(Resent with proper formatting and without HTML).
> Thanks,
> Quentin


[iproute PATCH] ip-address: Fix filtering by negated address flags

2018-11-13 Thread Phil Sutter
When disabling a flag, one needs to AND with the inverse not the flag
itself. Otherwise specifying for instance 'home -nodad' will effectively
clear the flags variable.

While being at it, simplify the code a bit by merging common parts of
negated and non-negated case branches. Also allow for the "special
cases" to be inverted, too.

Fixes: f73ac674d0abf ("ip: change flag names to an array")
Signed-off-by: Phil Sutter 
---
 ip/ipaddress.c | 32 ++--
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index cd8cc76a3473f..7b98877085878 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -1211,31 +1211,35 @@ static void print_ifa_flags(FILE *fp, const struct 
ifaddrmsg *ifa,
 
 static int get_filter(const char *arg)
 {
+   bool inv = false;
unsigned int i;
 
+   if (arg[0] == '-') {
+   inv = true;
+   arg++;
+   }
/* Special cases */
if (strcmp(arg, "dynamic") == 0) {
-   filter.flags &= ~IFA_F_PERMANENT;
+   if (inv)
+   filter.flags |= IFA_F_PERMANENT;
+   else
+   filter.flags &= ~IFA_F_PERMANENT;
filter.flagmask |= IFA_F_PERMANENT;
} else if (strcmp(arg, "primary") == 0) {
-   filter.flags &= ~IFA_F_SECONDARY;
+   if (inv)
+   filter.flags |= IFA_F_SECONDARY;
+   else
+   filter.flags &= ~IFA_F_SECONDARY;
filter.flagmask |= IFA_F_SECONDARY;
-   } else if (*arg == '-') {
-   for (i = 0; i < ARRAY_SIZE(ifa_flag_names); i++) {
-   if (strcmp(arg + 1, ifa_flag_names[i].name))
-   continue;
-
-   filter.flags &= ifa_flag_names[i].value;
-   filter.flagmask |= ifa_flag_names[i].value;
-   return 0;
-   }
-
-   return -1;
} else {
for (i = 0; i < ARRAY_SIZE(ifa_flag_names); i++) {
if (strcmp(arg, ifa_flag_names[i].name))
continue;
-   filter.flags |= ifa_flag_names[i].value;
+
+   if (inv)
+   filter.flags &= ~ifa_flag_names[i].value;
+   else
+   filter.flags |= ifa_flag_names[i].value;
filter.flagmask |= ifa_flag_names[i].value;
return 0;
}
-- 
2.19.0



Re: [PATCH net] l2tp: fix a sock refcnt leak in l2tp_tunnel_register

2018-11-13 Thread Guillaume Nault
On Tue, Nov 13, 2018 at 01:08:25AM +0800, Xin Long wrote:
> This issue happens when trying to add an existent tunnel. It
> doesn't call sock_put() before returning -EEXIST to release
> the sock refcnt that was held by calling sock_hold() before
> the existence check.
> 
> This patch is to fix it by holding the sock after doing the
> existence check.
> 
Nice fix. Thanks Xin!

Reviewed-by: Guillaume Nault 


Re: [PATCH net-next 16/17] net: sched: conditionally take rtnl lock on rules update path

2018-11-13 Thread Vlad Buslov


On Tue 13 Nov 2018 at 13:40, Stefano Brivio  wrote:
> On Tue, 13 Nov 2018 13:25:52 +
> Vlad Buslov  wrote:
>
>> On Tue 13 Nov 2018 at 09:40, Stefano Brivio  wrote:
>> > Hi Vlad,
>> >
>> > On Mon, 12 Nov 2018 09:55:45 +0200
>> > Vlad Buslov  wrote:
>> >  
>> >> @@ -179,9 +179,25 @@ static void tcf_proto_destroy_work(struct 
>> >> work_struct *work)
>> >>   rtnl_unlock();
>> >>  }
>> >>  
>> >> +/* Helper function to lock rtnl mutex when specified condition is true 
>> >> and mutex
>> >> + * hasn't been locked yet. Will set rtnl_held to 'true' before taking 
>> >> rtnl lock.
>> >> + * Note that this function does nothing if rtnl is already held. This is
>> >> + * intended to be used by cls API rules update API when multiple 
>> >> conditions
>> >> + * could require rtnl lock and its state needs to be tracked to prevent 
>> >> trying
>> >> + * to obtain lock multiple times.
>> >> + */
>> >> +
>> >> +static void tcf_require_rtnl(bool cond, bool *rtnl_held)
>> >> +{
>> >> + if (!*rtnl_held && cond) {
>> >> + *rtnl_held = true;
>> >> + rtnl_lock();
>> >> + }
>> >> +}  
>> >
>> > I guess calls to this function are supposed to be serialised. If that's
>> > the case (which is my tentative understanding so far), I would indicate
>> > that in the comment.
>> >
>> > If that's not the case, you would be introducing a race I guess.
>> >
>> > Same applies to tcf_block_release() from 17/17.  
>> 
>> Hi Stefano,
>> 
>> Thank you for reviewing my code!
>> 
>> I did not intend for this function to be serialized. First argument to
>> tcf_require_rtnl() is passed by value, and second argument is always a
>> pointer to local stack-allocated value of the caller.
>
> Yes, sorry, I haven't been terribly clear, that's what I meant by
> serialised: it won't be called concurrently with the same *rtnl_held.
>
> Perhaps the risk that somebody uses it that way is close to zero, so
> I'm not even too sure this is worth a comment, but if you can come up
> with a concise way of saying this, that would be nice.

I considered my comment that function "Will set rtnl_held to 'true'
before taking rtnl lock" as a red flag for caller to not pass pointer to
a variable that can be accessed concurrently. I guess I can add
additional sentence to explicitly warn potential users. Or I can just
move rtnl_held assignment in both functions to be performed while
holding rtnl mutex. I implemented it the way I did as an overzealous
optimization, but realistically price of an assignment is negligible in
this case. Suggestions are welcome!

>
>> Same applies to tcf_block_release() - its arguments are Qdisc and block
>> which support concurrency-safe reference counting, and pointer to local
>> variable rtnl_held, which is not accessible to concurrent users.
>
> Same there.
>
>> What is the race in these cases? Am I missing something?
>
> No, no race then. My only concern was:
>
> thread A: thread B:
> - x = false;
> - tcf_require_rtnl(true, ); - tcf_require_rtnl(true, );
>   - if (!*x && true)- if (!*x && true)
> - *x = true;
> - rtnl_lock() - *x = true;
>   - rtnl_lock()
>
> but this cannot happen as you explained.



Re: [PATCH net-next 17/17] net: sched: unlock rules update API

2018-11-13 Thread Vlad Buslov
On Mon 12 Nov 2018 at 17:30, David Miller  wrote:
> From: Vlad Buslov 
> Date: Mon, 12 Nov 2018 09:55:46 +0200
>
>> Register netlink protocol handlers for message types RTM_NEWTFILTER,
>> RTM_DELTFILTER, RTM_GETTFILTER as unlocked. Set rtnl_held variable that
>> tracks rtnl mutex state to be false by default.
>
> This whole conditional locking mechanism is really not clean and makes
> this code so much harder to understand and audit.
>
> Please improve the code so that this kind of construct is not needed.
>
> Thank you.

Hi David,

I considered several approaches to this problem and decided that this
one is most straightforward to implement. I understand your concern and
agree that this code is not easiest to understand and can suggest
several possible solutions that do not require this kind of elaborate
locking mechanism in cls API, but have their own drawbacks:

1. Convert all qdiscs and classifiers to support unlocked execution,
like we did for actions. However, according to my experience with
converting flower classifier, these require much more code than actions.
I would estimate it to be more work than whole current unlocking effort
(hundred+ patches). Also, authors of some of them might be unhappy with
such intrusive changes. I don't think this approach is realistic.

2. Somehow determine if rtnl is needed at the beginning of cls API rule
update functions. Currently, this is not possible because locking
requirements are determined by qdisc_class_ops and tcf_proto_ops 'flags'
field, which requires code to first do whole ops lookup sequence.
However, instead of class field I can put 'flags' in some kind of hash
table or array that will map qdisc/classifier type string to flags, so
it will be possible to determine locking requirements by just parsing
netlink message and obtaining flags by qdisc/classifier type. I do not
consider it pretty solution either, but maybe you have different
opinion.

3. Anything you can suggest? I might be missing something simple that
you would consider more elegant solution to this problem.

Thanks,
Vlad



Re: [PATCH net-next 16/17] net: sched: conditionally take rtnl lock on rules update path

2018-11-13 Thread Stefano Brivio
On Tue, 13 Nov 2018 13:25:52 +
Vlad Buslov  wrote:

> On Tue 13 Nov 2018 at 09:40, Stefano Brivio  wrote:
> > Hi Vlad,
> >
> > On Mon, 12 Nov 2018 09:55:45 +0200
> > Vlad Buslov  wrote:
> >  
> >> @@ -179,9 +179,25 @@ static void tcf_proto_destroy_work(struct work_struct 
> >> *work)
> >>rtnl_unlock();
> >>  }
> >>  
> >> +/* Helper function to lock rtnl mutex when specified condition is true 
> >> and mutex
> >> + * hasn't been locked yet. Will set rtnl_held to 'true' before taking 
> >> rtnl lock.
> >> + * Note that this function does nothing if rtnl is already held. This is
> >> + * intended to be used by cls API rules update API when multiple 
> >> conditions
> >> + * could require rtnl lock and its state needs to be tracked to prevent 
> >> trying
> >> + * to obtain lock multiple times.
> >> + */
> >> +
> >> +static void tcf_require_rtnl(bool cond, bool *rtnl_held)
> >> +{
> >> +  if (!*rtnl_held && cond) {
> >> +  *rtnl_held = true;
> >> +  rtnl_lock();
> >> +  }
> >> +}  
> >
> > I guess calls to this function are supposed to be serialised. If that's
> > the case (which is my tentative understanding so far), I would indicate
> > that in the comment.
> >
> > If that's not the case, you would be introducing a race I guess.
> >
> > Same applies to tcf_block_release() from 17/17.  
> 
> Hi Stefano,
> 
> Thank you for reviewing my code!
> 
> I did not intend for this function to be serialized. First argument to
> tcf_require_rtnl() is passed by value, and second argument is always a
> pointer to local stack-allocated value of the caller.

Yes, sorry, I haven't been terribly clear, that's what I meant by
serialised: it won't be called concurrently with the same *rtnl_held.

Perhaps the risk that somebody uses it that way is close to zero, so
I'm not even too sure this is worth a comment, but if you can come up
with a concise way of saying this, that would be nice.

> Same applies to tcf_block_release() - its arguments are Qdisc and block
> which support concurrency-safe reference counting, and pointer to local
> variable rtnl_held, which is not accessible to concurrent users.

Same there.

> What is the race in these cases? Am I missing something?

No, no race then. My only concern was:

thread A: thread B:
- x = false;
- tcf_require_rtnl(true, ); - tcf_require_rtnl(true, );
  - if (!*x && true)- if (!*x && true)
- *x = true;
- rtnl_lock() - *x = true;
  - rtnl_lock()

but this cannot happen as you explained.

-- 
Stefano


Re: [PATCH net-next 02/17] net: sched: protect block state with spinlock

2018-11-13 Thread Vlad Buslov


On Tue 13 Nov 2018 at 10:07, Stefano Brivio  wrote:
> Vlad,
>
> On Mon, 12 Nov 2018 09:28:59 -0800 (PST)
> David Miller  wrote:
>
>> From: Vlad Buslov 
>> Date: Mon, 12 Nov 2018 09:55:31 +0200
>> 
>> > +#define ASSERT_BLOCK_LOCKED(block)
>> > \
>> > +  WARN_ONCE(!spin_is_locked(&(block)->lock),  \
>> > +"BLOCK: assertion failed at %s (%d)\n", __FILE__,  __LINE__)  
>> 
>> spin_is_locked() is not usable for assertions.
>
> See also b86077207d0c ("igbvf: Replace spin_is_locked() with
> lockdep").

Stefano,

Thanks for the tip. I will check it out.

Vlad.



Re: [PATCH net-next 16/17] net: sched: conditionally take rtnl lock on rules update path

2018-11-13 Thread Vlad Buslov


On Tue 13 Nov 2018 at 09:40, Stefano Brivio  wrote:
> Hi Vlad,
>
> On Mon, 12 Nov 2018 09:55:45 +0200
> Vlad Buslov  wrote:
>
>> @@ -179,9 +179,25 @@ static void tcf_proto_destroy_work(struct work_struct 
>> *work)
>>  rtnl_unlock();
>>  }
>>  
>> +/* Helper function to lock rtnl mutex when specified condition is true and 
>> mutex
>> + * hasn't been locked yet. Will set rtnl_held to 'true' before taking rtnl 
>> lock.
>> + * Note that this function does nothing if rtnl is already held. This is
>> + * intended to be used by cls API rules update API when multiple conditions
>> + * could require rtnl lock and its state needs to be tracked to prevent 
>> trying
>> + * to obtain lock multiple times.
>> + */
>> +
>> +static void tcf_require_rtnl(bool cond, bool *rtnl_held)
>> +{
>> +if (!*rtnl_held && cond) {
>> +*rtnl_held = true;
>> +rtnl_lock();
>> +}
>> +}
>
> I guess calls to this function are supposed to be serialised. If that's
> the case (which is my tentative understanding so far), I would indicate
> that in the comment.
>
> If that's not the case, you would be introducing a race I guess.
>
> Same applies to tcf_block_release() from 17/17.

Hi Stefano,

Thank you for reviewing my code!

I did not intend for this function to be serialized. First argument to
tcf_require_rtnl() is passed by value, and second argument is always a
pointer to local stack-allocated value of the caller. Same applies to
tcf_block_release() - its arguments are Qdisc and block which support
concurrency-safe reference counting, and pointer to local variable
rtnl_held, which is not accessible to concurrent users.

What is the race in these cases? Am I missing something?

Vlad


Re: [PATCH net-next 01/17] net: sched: refactor mini_qdisc_pair_swap() to use workqueue

2018-11-13 Thread Vlad Buslov


On Mon 12 Nov 2018 at 17:28, David Miller  wrote:
> From: Vlad Buslov 
> Date: Mon, 12 Nov 2018 09:55:30 +0200
>
>> +void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,
>> +  struct tcf_proto *tp_head)
>> +{
>> +xchg(>tp_head, tp_head);
>
> If you are not checking the return value of xchg(), then this is
> simply a store with optionally a memory barrier of some sort
> either before or after.

That was my intention. What would be a better way to atomically
reset a pointer? Should I just change this line to explicit
assignment+barrier?


[iproute PATCH] ip-route: Fix nexthop encap parsing

2018-11-13 Thread Phil Sutter
When parsing nexthop parameters, a buffer of 4k bytes is provided. Yet,
in lwt_parse_encap() and some functions called by it, buffer size was
assumed to be 1k despite the actual size was provided. This led to
spurious buffer size errors if the buffer was filled by previous nexthop
parameters to exceed that 1k boundary.

Fixes: 1e5293056a02c ("lwtunnel: Add encapsulation support to ip route")
Fixes: 5866bddd9aa9e ("ila: Add support for ILA lwtunnels")
Fixes: ed67f83806538 ("ila: Support for checksum neutral translation")
Fixes: 86905c8f057c0 ("ila: support for configuring identifier and hook types")
Fixes: b15f440e78373 ("lwt: BPF support for LWT")
Signed-off-by: Phil Sutter 
---
 ip/iproute_lwtunnel.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/ip/iproute_lwtunnel.c b/ip/iproute_lwtunnel.c
index 8f49701509d20..85ab13cb31746 100644
--- a/ip/iproute_lwtunnel.c
+++ b/ip/iproute_lwtunnel.c
@@ -860,7 +860,7 @@ static int parse_encap_ila(struct rtattr *rta, size_t len,
 
argc--; argv++;
 
-   if (rta_addattr64(rta, 1024, ILA_ATTR_LOCATOR, locator))
+   if (rta_addattr64(rta, len, ILA_ATTR_LOCATOR, locator))
return -1;
 
while (argc > 0) {
@@ -874,7 +874,7 @@ static int parse_encap_ila(struct rtattr *rta, size_t len,
invarg("\"csum-mode\" value is invalid\n",
   *argv);
 
-   ret = rta_addattr8(rta, 1024, ILA_ATTR_CSUM_MODE,
+   ret = rta_addattr8(rta, len, ILA_ATTR_CSUM_MODE,
   (__u8)csum_mode);
 
argc--; argv++;
@@ -888,7 +888,7 @@ static int parse_encap_ila(struct rtattr *rta, size_t len,
invarg("\"ident-type\" value is invalid\n",
   *argv);
 
-   ret = rta_addattr8(rta, 1024, ILA_ATTR_IDENT_TYPE,
+   ret = rta_addattr8(rta, len, ILA_ATTR_IDENT_TYPE,
   (__u8)ident_type);
 
argc--; argv++;
@@ -902,7 +902,7 @@ static int parse_encap_ila(struct rtattr *rta, size_t len,
invarg("\"hook-type\" value is invalid\n",
   *argv);
 
-   ret = rta_addattr8(rta, 1024, ILA_ATTR_HOOK_TYPE,
+   ret = rta_addattr8(rta, len, ILA_ATTR_HOOK_TYPE,
   (__u8)hook_type);
 
argc--; argv++;
@@ -1034,7 +1034,7 @@ static int parse_encap_bpf(struct rtattr *rta, size_t 
len, int *argcp,
if (get_unsigned(, *argv, 0) || headroom == 0)
invarg("headroom is invalid\n", *argv);
if (!headroom_set)
-   rta_addattr32(rta, 1024, LWT_BPF_XMIT_HEADROOM,
+   rta_addattr32(rta, len, LWT_BPF_XMIT_HEADROOM,
  headroom);
headroom_set = 1;
} else if (strcmp(*argv, "help") == 0) {
@@ -1075,7 +1075,7 @@ int lwt_parse_encap(struct rtattr *rta, size_t len, int 
*argcp, char ***argvp)
exit(-1);
}
 
-   nest = rta_nest(rta, 1024, RTA_ENCAP);
+   nest = rta_nest(rta, len, RTA_ENCAP);
switch (type) {
case LWTUNNEL_ENCAP_MPLS:
ret = parse_encap_mpls(rta, len, , );
@@ -1108,7 +1108,7 @@ int lwt_parse_encap(struct rtattr *rta, size_t len, int 
*argcp, char ***argvp)
 
rta_nest_end(rta, nest);
 
-   ret = rta_addattr16(rta, 1024, RTA_ENCAP_TYPE, type);
+   ret = rta_addattr16(rta, len, RTA_ENCAP_TYPE, type);
 
*argcp = argc;
*argvp = argv;
-- 
2.19.0



Re: [iproute PATCH] man: ip-route.8: Document nexthop limit

2018-11-13 Thread Phil Sutter
Hi David,

On Mon, Nov 12, 2018 at 04:37:48PM -0800, David Ahern wrote:
> On 11/12/18 2:21 PM, Phil Sutter wrote:
> > diff --git a/man/man8/ip-route.8.in b/man/man8/ip-route.8.in
> > index a33ce1f0f4006..383178c11331e 100644
> > --- a/man/man8/ip-route.8.in
> > +++ b/man/man8/ip-route.8.in
> > @@ -589,6 +589,13 @@ argument lists:
> >  route reflecting its relative bandwidth or quality.
> >  .in -8
> >  
> > +The internal buffer used in iproute2 limits the maximum number of nexthops 
> > to
> > +be specified in one go. If only a gateway address is given, the current 
> > buffer
> > +size allows for 144 IPv6 nexthops and 253 IPv4 ones. If more are required, 
> > they
> > +may be added to the existing route using
> > +.B "ip route append"
> > +command.
> > +
> 
> That is not true for IPv4. 'ip ro append' adds a new route after the
> existing route - an entry that can not be hit unless all of the nexthops
> in the first route are down. 'ip ro prepend' adds a new entry before the
> existing one meaning it takes precedence over the existing entries.

Oh, thanks for clarifying. I'll follow-up with a fixed version.

> For IPv6, 'append' and 'prepend' both add new nexthops to the existing
> entry.

'ip route prepend' is not even documented. :(

Thanks, Phil


Re: VETH & AF_PACKET problem

2018-11-13 Thread Anand H. Krishnan
[bump]

Hello,

Any ideas? AF_PACKET with veth seems to be incompatible. I am using DPDK to bind
af_packet with one end of VETH pair and the other end of the pair is
in a container.

Thanks,
Anand
On Mon, Nov 12, 2018 at 12:42 PM Anand H. Krishnan
 wrote:
>
> Hello,
>
> We are seeing a problem with AF_PACKET when used along with the
> veth interfaces. SCP complains that message authentication code is
> incorrect.
>
> I was browsing the code and I see that veth_xmit calls dev_forward_skb
> which does a skb_scrub_packet, which in turn calls the skb destructor 
> function.
>
> In the case of packets coming from the AF_PACKET socket, the destructor
> function seems to set all the mmap-ed pages to be available for user space to
> copy any new packet it wants. Isn't this a problem?
>
> skb_orphan_frags, called by dev_forward_skb, seems to do the right thing,
> but it probably does not get called for packets from AF_PACKET socket, since 
> the
> skb is not a zero copy skb (SKBTX_DEV_ZEROCOPY is not set).
>
> Did I miss something basic here?
>
> (Please cc me, since I am not part of this list)
>
> Thanks,
> Anand


Re: [PATCH net-next 02/17] net: sched: protect block state with spinlock

2018-11-13 Thread Stefano Brivio
Vlad,

On Mon, 12 Nov 2018 09:28:59 -0800 (PST)
David Miller  wrote:

> From: Vlad Buslov 
> Date: Mon, 12 Nov 2018 09:55:31 +0200
> 
> > +#define ASSERT_BLOCK_LOCKED(block) \
> > +   WARN_ONCE(!spin_is_locked(&(block)->lock),  \
> > + "BLOCK: assertion failed at %s (%d)\n", __FILE__,  __LINE__)  
> 
> spin_is_locked() is not usable for assertions.

See also b86077207d0c ("igbvf: Replace spin_is_locked() with lockdep").

-- 
Stefano


Re: [PATCH net-next 16/17] net: sched: conditionally take rtnl lock on rules update path

2018-11-13 Thread Stefano Brivio
Hi Vlad,

On Mon, 12 Nov 2018 09:55:45 +0200
Vlad Buslov  wrote:

> @@ -179,9 +179,25 @@ static void tcf_proto_destroy_work(struct work_struct 
> *work)
>   rtnl_unlock();
>  }
>  
> +/* Helper function to lock rtnl mutex when specified condition is true and 
> mutex
> + * hasn't been locked yet. Will set rtnl_held to 'true' before taking rtnl 
> lock.
> + * Note that this function does nothing if rtnl is already held. This is
> + * intended to be used by cls API rules update API when multiple conditions
> + * could require rtnl lock and its state needs to be tracked to prevent 
> trying
> + * to obtain lock multiple times.
> + */
> +
> +static void tcf_require_rtnl(bool cond, bool *rtnl_held)
> +{
> + if (!*rtnl_held && cond) {
> + *rtnl_held = true;
> + rtnl_lock();
> + }
> +}

I guess calls to this function are supposed to be serialised. If that's
the case (which is my tentative understanding so far), I would indicate
that in the comment.

If that's not the case, you would be introducing a race I guess.

Same applies to tcf_block_release() from 17/17.

-- 
Stefano