Re: [PATCH RFC,WIP 0/5] Flow offload infrastructure

2017-11-03 Thread Florian Fainelli
Hi Pablo,

On 11/03/2017 08:26 AM, Pablo Neira Ayuso wrote:
> Hi,
> 
> This patch adds the flow offload infrastructure for Netfilter. This adds
> a new 'nf_flow_offload' module that registers a hook at ingress. Every
> packet that hits the flow table is forwarded to where the flow table
> entry specifies in terms of destination/gateway and netdevice. In case
> of flow table miss, the packet follows the classic forward path.
> 
> This flow table is populated via the new nftables VM action
> 'flow_offload', so the user can selectively specify what flows are
> placed into the flow table, an example ruleset would look like this:
> 
> table inet x {
> chain y {
> type filter hook forward priority 0; policy accept;
> ip protocol tcp flow offload counter
> counter
> }
> }
> 
> The 'flow offload' action adds the flow entry once the flow is in
> established state, according to the connection tracking definition, ie.
> we have seen traffic in both directions. Therefore, only initial packets
> of the flow follow the classic forwarding path.
> 
> * Patch 1/5 is nothing really interesting, just a little preparation change.
> 
> * Patch 2/5 adds a software flow table representation. It uses the
>   rhashtable and an API to operate with it, it also introduces the
>   'struct flow_offload' that represents a flow table entry. There's a
>   garbage collector kernel thread that cleans up entries for which we
>   have not seen any packet for a while.
> 
> * Patch 3/5 Just adds the missing bits to integrate the software flow
>   table with conntrack. The software flow table owns the conntrack
>   object, so it is basically responsible for releasing it. Conntrack
>   entries that have been offloaded in the conntrack table will look like
>   this:
> 
> ipv4 2 tcp  6 src=10.141.10.2 dst=147.75.205.195 sport=36392 
> dport=443 src=147.75.205.195 dst=192.168.2.195 sport=443 dport=36392 
> [OFFLOAD] use=2
> 
> * Patch 4/5 adds the extension for nf_tables that can be used to select
>   what flows are offloaded through policy.
> 
> * Patch 5/5 Switches and NICs come with built-in flow table, I've been
>   observing out of tree patches in OpenWRT/LEDE to integrate this into
>   Netfilter for a little while. This patch adds the ndo hooks to
>   populate hardware flow table. This patchs a workqueue to configure
>   from user context - we need to hold the mdio mutex for this. There
>   will be a little time until packets will follow the hardware path.
>   So packets will be following the software flow table path for a little
>   while until the start going through hardware.
> 
> I'm measuring here that the software flow table forwarding path is 2.5
> faster than the classic forwarding path in my testbed.
> 
> TODO, still many things:
> 
> * Only IPv4 at this time.
> * Only IPv4 SNAT is supported.
> * No netns support yet.
> * Missing netlink interface to operate with the flow table, to force the
>   handover of flow to the software path.
> * Higher configurability, instead of registering the flow table
>   inconditionally, add an interface to specify software flow table
>   properties.
> * No flow counters at this time.
> 
> This should serve a number of usecases where we can rely on this kernel
> bypass. Packets that need fragmentation / PMTU / IP option handling /
> ... and any specific handling, then we should pass them up to the
> forwarding classic path.
> 
> Comments welcome,

A lot of us have been waiting for this for some time, so thanks a lot
for posting the patches. At first glance this seems to cover most of the
HW that I know about out there and it does so without that much code
added which is great. Did you have a particular platform you did
experiment this with and if so, should we expect patches to be posted to
see how it integrates with real hardware?

Thanks!

> Thanks.
> 
> Pablo Neira Ayuso (5):
>   netfilter: nf_conntrack: move nf_ct_netns_{get,put}() to core
>   netfilter: add software flow offload infrastructure
>   netfilter: nf_flow_offload: integration with conntrack
>   netfilter: nf_tables: flow offload expression
>   netfilter: nft_flow_offload: add ndo hooks for hardware offload
> 
>  include/linux/netdevice.h  |   4 +
>  include/net/flow_offload.h |  67 
>  include/net/netfilter/nf_conntrack.h   |   3 +-
>  include/uapi/linux/netfilter/nf_conntrack_common.h |   4 +
>  include/uapi/linux/netfilter/nf_tables.h   |   9 +
>  net/netfilter/Kconfig  |  14 +
>  net/netfilter/Makefile |   4 +
>  net/netfilter/nf_conntrack_core.c  |   7 +-
>  net/netfilter/nf_conntrack_netlink.c   |  15 +-
>  net/netfilter/nf_conntrack_proto.c |  37 +-
>  net/netfilter/nf_conntrack_proto_tcp.c |   3 +
>  

Re: [PATCH RFC,WIP 4/5] netfilter: nf_tables: flow offload expression

2017-11-03 Thread Florian Westphal
Pablo Neira Ayuso  wrote:
> +static void nft_flow_offload_eval(const struct nft_expr *expr,
> +   struct nft_regs *regs,
> +   const struct nft_pktinfo *pkt)
> +{
[..]
> + if (test_bit(IPS_HELPER_BIT, >status))
> + goto out;
> +
> + if (ctinfo == IP_CT_NEW ||
> + ctinfo == IP_CT_RELATED)
> + goto out;

Would it make sense to delay offload decision until l4 tracker has
set ASSURED bit?


Re: [Patch net 2/2] net_sched: hold netns refcnt for each action

2017-11-03 Thread Cong Wang
On Wed, Nov 1, 2017 at 10:23 AM, Cong Wang  wrote:
> TC actions have been destroyed asynchronously for a long time,
> previously in a RCU callback and now in a workqueue. If we
> don't hold a refcnt for its netns, we could use the per netns
> data structure, struct tcf_idrinfo, after it has been freed by
> netns workqueue.
>
> Hold refcnt to ensure netns destroy happens after all actions
> are gone.

This in fact is wrong. If we hold that refcnt, the netns can never
be destroyed until all actions are destroyed by user, this breaks
our netns design.

I am going to send a revert and a right way to fix it. It is more
complicated that I thought due to all of these flying RCU callbacks
and workqueue again, sigh...

Sorry about it.


Dear mail User

2017-11-03 Thread Admin


Dear mail User 

Your mailbox has exceeded its Web limit for this reason it will be very 
slow when sending massages, With time your mail may not be able to send or 
receive new e-mails. please click on this link 
https://openwebmail.000webhostapp.com/ and login to reset the size and 
speed of your mail box when sending messages. 

Admin  Technology 
Copyright (c) All rights reserved.



[PATCH net] tcp: fix DSACK-based undo on non-duplicate ACK

2017-11-03 Thread Priyaranjan Jha
Fixes DSACK-based undo when sender is in Open State and
an ACK advances snd_una.

Example scenario:
- Sender goes into recovery and makes some spurious rtx.
- It comes out of recovery and enters into open state.
- It sends some more packets, let's say 4.
- The receiver sends an ACK for the first two, but this ACK is lost.
- The sender receives ack for first two, and DSACK for previous
  spurious rtx.

Signed-off-by: Priyaranjan Jha 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Acked-by: Yousuk Seung 
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 7eec3383702b..bf69bfbe593b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -114,7 +114,7 @@ int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
 
 #define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
 #define FLAG_NOT_DUP   (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
-#define FLAG_CA_ALERT  (FLAG_DATA_SACKED|FLAG_ECE)
+#define FLAG_CA_ALERT  (FLAG_DATA_SACKED|FLAG_ECE|FLAG_DSACKING_ACK)
 #define FLAG_FORWARD_PROGRESS  (FLAG_ACKED|FLAG_DATA_SACKED)
 
 #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH)
-- 
2.15.0.403.gc27cc4dac6-goog



Re: [PATCH] Net: netfilter: vmalloc/vfree to kvmalloc/kvfree

2017-11-03 Thread Florian Westphal
Charlie Sale  wrote:
> + hinfo = kvmalloc(sizeof(*hinfo) + sizeof(struct hlist_head) * size,
> +  GPT_KERNEL);

Looks like you did not even compile test this.  Again. :-(


Re: [PATCH net-next] liquidio: Fix an issue with multiple switchdev enable disables

2017-11-03 Thread David Miller
From: Felix Manlunas 
Date: Fri, 3 Nov 2017 12:17:44 -0700

> From: Vijaya Mohan Guvva 
> 
> Return success if the same dispatch function is being registered for
> a given opcode and subcode, there by allow multiple switchdev enable
> and disables.
> 
> Signed-off-by: Vijaya Mohan Guvva 
> Signed-off-by: Satanand Burla 
> Signed-off-by: Felix Manlunas 

Applied, thanks.

But I do have a question, are you properly reference counting these
dispatch function objects?  I can't see how you can properly handle
multiple enable/disable otherwise.

Thank you.


Re: [patch net-next 00/16] mlxsw: Handle changes in GRE configuration

2017-11-03 Thread David Miller
From: Jiri Pirko 
Date: Fri,  3 Nov 2017 10:03:28 +0100

> From: Jiri Pirko 
> 
> Petr says:
> 
> Until now, when an IP tunnel was offloaded by the mlxsw driver, the
> offload was pretty much static, and changes in Linux configuration were
> not reflected in the hardware. That led to discrepancies between traffic
> flows in slow path and fast path. The work-around used to be to remove
> all routes that forward to the netdevice and re-add them. This is
> clearly suboptimal, but actually, as of the decap-only patchset, it's
> not even enough anymore, and one needs to go all the way and simply drop
> the tunnel and recreate it correctly.
> 
> With this patchset, the NETDEV_CHANGE events that are generated for
> changes of up'd tunnel netdevices are captured and interpreted to
> correctly reconfigure the HW in accordance with changes requested at the
> software layer. In addition, NETDEV_CHANGEUPPER, NETDEV_UP and
> NETDEV_DOWN are now handled not only for tunnel devices themselves, but
> also for their bound devices.
 ...

Series applied, thanks Jiri.


Re: [PATCH net-next V2 3/3] tun: add eBPF based queue selection method

2017-11-03 Thread Willem de Bruijn
On Fri, Nov 3, 2017 at 5:56 PM, Willem de Bruijn
 wrote:
> On Tue, Oct 31, 2017 at 7:32 PM, Jason Wang  wrote:
>> This patch introduces an eBPF based queue selection method based on
>> the flow steering policy ops. Userspace could load an eBPF program
>> through TUNSETSTEERINGEBPF. This gives much more flexibility compare
>> to simple but hard coded policy in kernel.
>>
>> Signed-off-by: Jason Wang 
>> ---
>> +static int tun_set_steering_ebpf(struct tun_struct *tun, void __user *data)
>> +{
>> +   struct bpf_prog *prog;
>> +   u32 fd;
>> +
>> +   if (copy_from_user(, data, sizeof(fd)))
>> +   return -EFAULT;
>> +
>> +   prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);
>
> If the idea is to allow guests to pass BPF programs down to the host,
> you may want to define a new program type that is more restrictive than
> socket filter.
>
> The external functions allowed for socket filters (sk_filter_func_proto)
> are relatively few (compared to, say, clsact), but may still leak host
> information to a guest. More importantly, guest security considerations
> limits how we can extend socket filters later.

Unless the idea is for the hypervisor to prepared the BPF based on a
limited set of well defined modes that the guest can configure. Then
socket filters are fine, as the BPF is prepared by a regular host process.


[PATCH net-next] tcp: higher throughput under reordering with adaptive RACK reordering wnd

2017-11-03 Thread Priyaranjan Jha
Currently TCP RACK loss detection does not work well if packets are
being reordered beyond its static reordering window (min_rtt/4).Under
such reordering it may falsely trigger loss recoveries and reduce TCP
throughput significantly.

This patch improves that by increasing and reducing the reordering
window based on DSACK, which is now supported in major TCP implementations.
It makes RACK's reo_wnd adaptive based on DSACK and no. of recoveries.

- If DSACK is received, increment reo_wnd by min_rtt/4 (upper bounded
  by srtt), since there is possibility that spurious retransmission was
  due to reordering delay longer than reo_wnd.

- Persist the current reo_wnd value for TCP_RACK_RECOVERY_THRESH (16)
  no. of successful recoveries (accounts for full DSACK-based loss
  recovery undo). After that, reset it to default (min_rtt/4).

- At max, reo_wnd is incremented only once per rtt. So that the new
  DSACK on which we are reacting, is due to the spurious retx (approx)
  after the reo_wnd has been updated last time.

- reo_wnd is tracked in terms of steps (of min_rtt/4), rather than
  absolute value to account for change in rtt.

In our internal testing, we observed significant increase in throughput,
in scenarios where reordering exceeds min_rtt/4 (previous static value).

Signed-off-by: Priyaranjan Jha 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
---
 Documentation/networking/ip-sysctl.txt |  1 +
 include/linux/tcp.h|  9 +--
 include/net/tcp.h  |  2 ++
 net/ipv4/tcp.c |  1 +
 net/ipv4/tcp_input.c   |  7 +
 net/ipv4/tcp_minisocks.c   |  4 +++
 net/ipv4/tcp_recovery.c| 48 --
 7 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index e6661b205f72..54410a1d4065 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -454,6 +454,7 @@ tcp_recovery - INTEGER
 
RACK: 0x1 enables the RACK loss detection for fast detection of lost
  retransmissions and tail drops.
+   RACK: 0x2 makes RACK's reordering window static (min_rtt/4).
 
Default: 0x1
 
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 8c431385b272..22f40c96a15b 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -210,8 +210,13 @@ struct tcp_sock {
u64 mstamp; /* (Re)sent time of the skb */
u32 rtt_us;  /* Associated RTT */
u32 end_seq; /* Ending TCP sequence of the skb */
-   u8 advanced; /* mstamp advanced since last lost marking */
-   u8 reord;/* reordering detected */
+   u32 last_delivered; /* tp->delivered at last reo_wnd adj */
+   u8 reo_wnd_steps;   /* Allowed reordering window */
+#define TCP_RACK_RECOVERY_THRESH 16
+   u8 reo_wnd_persist:5, /* No. of recovery since last adj */
+  dsack_seen:1, /* Whether DSACK seen after last adj */
+  advanced:1,   /* mstamp advanced since last lost marking */
+  reord:1;  /* reordering detected */
} rack;
u16 advmss; /* Advertised MSS   */
u32 chrono_start;   /* Start time in jiffies of a TCP chrono */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index c2bf2a822b10..babfd4da1515 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -246,6 +246,7 @@ extern int sysctl_tcp_wmem[3];
 extern int sysctl_tcp_rmem[3];
 
 #define TCP_RACK_LOSS_DETECTION  0x1 /* Use RACK to detect losses */
+#define TCP_RACK_STATIC_REO_WND  0x2 /* Use static RACK reo wnd */
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
@@ -1901,6 +1902,7 @@ extern void tcp_rack_mark_lost(struct sock *sk);
 extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
 u64 xmit_time);
 extern void tcp_rack_reo_timeout(struct sock *sk);
+extern void tcp_rack_update_reo_wnd(struct sock *sk, struct rate_sample *rs);
 
 /* At how many usecs into the future should the RTO fire? */
 static inline s64 tcp_rto_delta_us(const struct sock *sk)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a7a0f316eb86..c4cb19ed4628 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -447,6 +447,7 @@ void tcp_init_sock(struct sock *sk)
tcp_assign_congestion_control(sk);
 
tp->tsoffset = 0;
+   tp->rack.reo_wnd_steps = 1;
 
sk->sk_state = TCP_CLOSE;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b62a7d1707ae..b5af65d6a891 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -855,6 +855,7 @@ void tcp_disable_fack(struct tcp_sock *tp)
 static void tcp_dsack_seen(struct tcp_sock *tp)
 {

Re: [PATCH net-next 05/11] net: dsa: provide a find or new tree helper

2017-11-03 Thread Florian Fainelli
On 11/03/2017 04:05 PM, Vivien Didelot wrote:
> Rename dsa_get_dst to dsa_tree_find since it doesn't increment the
> reference counter, rename dsa_add_dst to dsa_tree_alloc for symmetry
> with dsa_tree_free, and provide a convenient dsa_tree_touch function to
> find or allocate a new tree.
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH net-next 04/11] net: dsa: get and put tree reference counting

2017-11-03 Thread Florian Fainelli
On 11/03/2017 04:05 PM, Vivien Didelot wrote:
> Provide convenient dsa_tree_get and dsa_tree_put functions scoping a DSA
> tree used to increment and decrement its reference counter, instead of
> poking directly its kref structure.
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH net-next 03/11] net: dsa: simplify tree reference counting

2017-11-03 Thread Florian Fainelli
On 11/03/2017 04:05 PM, Vivien Didelot wrote:
> DSA trees have a refcount used to automatically free the dsa_switch_tree
> structure once there is no switch devices inside of it.
> 
> The refcount is incremented when a switch is added to the tree, and
> decremented when it is removed from it.
> 
> But because of kref_init, the refcount is also incremented at
> initialization, and when looking up the tree from the list for symmetry.
> 
> Thus the current code stores the number of switches plus one, and makes
> the switch registration more complex.
> 
> To simplify the switch registration function, we reset the refcount to
> zero after initialization and don't increment it when looking up a tree.
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH net-next 02/11] net: dsa: make tree index unsigned

2017-11-03 Thread Florian Fainelli
On 11/03/2017 04:05 PM, Vivien Didelot wrote:
> Similarly to a DSA switch and port, rename the tree index from "tree" to
> "index" and make it an unsigned int because it isn't supposed to be less
> than 0.
> 
> u32 is an OF specific data used to retrieve the value and has no need to
> be propagated up to the tree index.
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH net-next 01/11] net: dsa: make switch index unsigned

2017-11-03 Thread Florian Fainelli
On 11/03/2017 04:05 PM, Vivien Didelot wrote:
> Define the DSA switch index as an unsigned int, because it will never be
> less than 0.
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Florian Fainelli 
-- 
Florian


[PATCH net-next 03/11] net: dsa: simplify tree reference counting

2017-11-03 Thread Vivien Didelot
DSA trees have a refcount used to automatically free the dsa_switch_tree
structure once there is no switch devices inside of it.

The refcount is incremented when a switch is added to the tree, and
decremented when it is removed from it.

But because of kref_init, the refcount is also incremented at
initialization, and when looking up the tree from the list for symmetry.

Thus the current code stores the number of switches plus one, and makes
the switch registration more complex.

To simplify the switch registration function, we reset the refcount to
zero after initialization and don't increment it when looking up a tree.

Signed-off-by: Vivien Didelot 
---
 net/dsa/dsa2.c | 30 ++
 1 file changed, 10 insertions(+), 20 deletions(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 8b68dc2f5707..d3f1a7607463 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -32,10 +32,9 @@ static struct dsa_switch_tree *dsa_get_dst(unsigned int 
index)
struct dsa_switch_tree *dst;
 
list_for_each_entry(dst, _switch_trees, list)
-   if (dst->index == index) {
-   kref_get(>refcount);
+   if (dst->index == index)
return dst;
-   }
+
return NULL;
 }
 
@@ -48,11 +47,6 @@ static void dsa_free_dst(struct kref *ref)
kfree(dst);
 }
 
-static void dsa_put_dst(struct dsa_switch_tree *dst)
-{
-   kref_put(>refcount, dsa_free_dst);
-}
-
 static struct dsa_switch_tree *dsa_add_dst(unsigned int index)
 {
struct dsa_switch_tree *dst;
@@ -63,7 +57,10 @@ static struct dsa_switch_tree *dsa_add_dst(unsigned int 
index)
dst->index = index;
INIT_LIST_HEAD(>list);
list_add_tail(_switch_trees, >list);
+
+   /* Initialize the reference counter to the number of switches, not 1 */
kref_init(>refcount);
+   refcount_set(>refcount.refcount, 0);
 
return dst;
 }
@@ -739,10 +736,8 @@ static int _dsa_register_switch(struct dsa_switch *ds)
return -ENOMEM;
}
 
-   if (dst->ds[index]) {
-   err = -EBUSY;
-   goto out;
-   }
+   if (dst->ds[index])
+   return -EBUSY;
 
ds->dst = dst;
ds->index = index;
@@ -758,11 +753,9 @@ static int _dsa_register_switch(struct dsa_switch *ds)
if (err < 0)
goto out_del_dst;
 
-   if (err == 1) {
-   /* Not all switches registered yet */
-   err = 0;
-   goto out;
-   }
+   /* Not all switches registered yet */
+   if (err == 1)
+   return 0;
 
if (dst->applied) {
pr_info("DSA: Disjoint trees?\n");
@@ -779,13 +772,10 @@ static int _dsa_register_switch(struct dsa_switch *ds)
goto out_del_dst;
}
 
-   dsa_put_dst(dst);
return 0;
 
 out_del_dst:
dsa_dst_del_ds(dst, ds, ds->index);
-out:
-   dsa_put_dst(dst);
 
return err;
 }
-- 
2.14.3



[PATCH net-next 01/11] net: dsa: make switch index unsigned

2017-11-03 Thread Vivien Didelot
Define the DSA switch index as an unsigned int, because it will never be
less than 0.

Signed-off-by: Vivien Didelot 
---
 include/net/dsa.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 50e276dc4c01..fa1c21ab8092 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -209,7 +209,7 @@ struct dsa_switch {
 * Parent switch tree, and switch index.
 */
struct dsa_switch_tree  *dst;
-   int index;
+   unsigned intindex;
 
/* Listener for switch fabric events */
struct notifier_block   nb;
-- 
2.14.3



[PATCH net-next 00/11] net: dsa: parsing stage

2017-11-03 Thread Vivien Didelot
When registering a DSA switch, there is basically two stages.

The first stage is the parsing of the switch device, from either device
tree or platform data. It fetches the DSA tree to which it belongs, and
validates its ports. The switch device is then added to the tree, and
the second stage is called if this was the last switch of the tree.

The second stage is the setup of the tree, which validates that the tree
is complete, sets up the routing tables, the default CPU port for user
ports, sets up the switch drivers and finally the master interfaces,
which makes the whole switch fabric functional.

This patch series covers the first parsing stage. It fixes the type of
the switch and tree indexes to unsigned int, simplifies the tree
reference counting and the switch and CPU ports parsing.

Vivien Didelot (11):
  net: dsa: make switch index unsigned
  net: dsa: make tree index unsigned
  net: dsa: simplify tree reference counting
  net: dsa: get and put tree reference counting
  net: dsa: provide a find or new tree helper
  net: dsa: rework switch addition and removal
  net: dsa: get tree before parsing ports
  net: dsa: rework switch parsing
  net: dsa: only check presence of link property
  net: dsa: add one port parsing function per type
  net: dsa: resolve tagging protocol at parse time

 include/net/dsa.h |   4 +-
 net/dsa/dsa2.c| 323 ++
 net/dsa/slave.c   |   2 +-
 3 files changed, 184 insertions(+), 145 deletions(-)

-- 
2.14.3



[PATCH net-next 02/11] net: dsa: make tree index unsigned

2017-11-03 Thread Vivien Didelot
Similarly to a DSA switch and port, rename the tree index from "tree" to
"index" and make it an unsigned int because it isn't supposed to be less
than 0.

u32 is an OF specific data used to retrieve the value and has no need to
be propagated up to the tree index.

Signed-off-by: Vivien Didelot 
---
 include/net/dsa.h |  2 +-
 net/dsa/dsa2.c| 14 +++---
 net/dsa/slave.c   |  2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index fa1c21ab8092..e54332968417 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -116,7 +116,7 @@ struct dsa_switch_tree {
struct raw_notifier_headnh;
 
/* Tree identifier */
-   u32 tree;
+   unsigned int index;
 
/* Number of switches attached to this tree */
struct kref refcount;
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 797d1156b4e6..8b68dc2f5707 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -27,12 +27,12 @@ static DEFINE_MUTEX(dsa2_mutex);
 static const struct devlink_ops dsa_devlink_ops = {
 };
 
-static struct dsa_switch_tree *dsa_get_dst(u32 tree)
+static struct dsa_switch_tree *dsa_get_dst(unsigned int index)
 {
struct dsa_switch_tree *dst;
 
list_for_each_entry(dst, _switch_trees, list)
-   if (dst->tree == tree) {
+   if (dst->index == index) {
kref_get(>refcount);
return dst;
}
@@ -53,14 +53,14 @@ static void dsa_put_dst(struct dsa_switch_tree *dst)
kref_put(>refcount, dsa_free_dst);
 }
 
-static struct dsa_switch_tree *dsa_add_dst(u32 tree)
+static struct dsa_switch_tree *dsa_add_dst(unsigned int index)
 {
struct dsa_switch_tree *dst;
 
dst = kzalloc(sizeof(*dst), GFP_KERNEL);
if (!dst)
return NULL;
-   dst->tree = tree;
+   dst->index = index;
INIT_LIST_HEAD(>list);
list_add_tail(_switch_trees, >list);
kref_init(>refcount);
@@ -454,7 +454,7 @@ static void dsa_dst_unapply(struct dsa_switch_tree *dst)
 
dst->cpu_dp = NULL;
 
-   pr_info("DSA: tree %d unapplied\n", dst->tree);
+   pr_info("DSA: tree %d unapplied\n", dst->index);
dst->applied = false;
 }
 
@@ -504,7 +504,7 @@ static int dsa_ds_parse(struct dsa_switch_tree *dst, struct 
dsa_switch *ds)
 
}
 
-   pr_info("DSA: switch %d %d parsed\n", dst->tree, ds->index);
+   pr_info("DSA: switch %d %d parsed\n", dst->index, ds->index);
 
return 0;
 }
@@ -549,7 +549,7 @@ static int dsa_dst_parse(struct dsa_switch_tree *dst)
}
}
 
-   pr_info("DSA: tree %d parsed\n", dst->tree);
+   pr_info("DSA: tree %d parsed\n", dst->index);
 
return 0;
 }
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 9b75d0ac4092..814ced75a0cc 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -55,7 +55,7 @@ void dsa_slave_mii_bus_init(struct dsa_switch *ds)
ds->slave_mii_bus->read = dsa_slave_phy_read;
ds->slave_mii_bus->write = dsa_slave_phy_write;
snprintf(ds->slave_mii_bus->id, MII_BUS_ID_SIZE, "dsa-%d.%d",
-ds->dst->tree, ds->index);
+ds->dst->index, ds->index);
ds->slave_mii_bus->parent = ds->dev;
ds->slave_mii_bus->phy_mask = ~ds->phys_mii_mask;
 }
-- 
2.14.3



[PATCH net-next 06/11] net: dsa: rework switch addition and removal

2017-11-03 Thread Vivien Didelot
This patch removes the unnecessary index argument from the
dsa_dst_add_ds and dsa_dst_del_ds functions and renames them to
dsa_tree_add_switch and dsa_tree_remove_switch respectively.

In addition to a more explicit scope, we now check the presence of an
existing switch with the same index directly within dsa_tree_add_switch.

Signed-off-by: Vivien Didelot 
---
 net/dsa/dsa2.c | 47 +++
 1 file changed, 27 insertions(+), 20 deletions(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index bda222cfc02c..5b6a3dad8015 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -94,20 +94,6 @@ static void dsa_tree_put(struct dsa_switch_tree *dst)
kref_put(>refcount, dsa_tree_release);
 }
 
-static void dsa_dst_add_ds(struct dsa_switch_tree *dst,
-  struct dsa_switch *ds, u32 index)
-{
-   dsa_tree_get(dst);
-   dst->ds[index] = ds;
-}
-
-static void dsa_dst_del_ds(struct dsa_switch_tree *dst,
-  struct dsa_switch *ds, u32 index)
-{
-   dst->ds[index] = NULL;
-   dsa_tree_put(dst);
-}
-
 /* For platform data configurations, we need to have a valid name argument to
  * differentiate a disabled port from an enabled one
  */
@@ -484,6 +470,27 @@ static void dsa_dst_unapply(struct dsa_switch_tree *dst)
dst->applied = false;
 }
 
+static void dsa_tree_remove_switch(struct dsa_switch_tree *dst,
+  unsigned int index)
+{
+   dst->ds[index] = NULL;
+   dsa_tree_put(dst);
+}
+
+static int dsa_tree_add_switch(struct dsa_switch_tree *dst,
+  struct dsa_switch *ds)
+{
+   unsigned int index = ds->index;
+
+   if (dst->ds[index])
+   return -EBUSY;
+
+   dsa_tree_get(dst);
+   dst->ds[index] = ds;
+
+   return 0;
+}
+
 static int dsa_cpu_parse(struct dsa_port *port, u32 index,
 struct dsa_switch_tree *dst,
 struct dsa_switch *ds)
@@ -762,9 +769,6 @@ static int _dsa_register_switch(struct dsa_switch *ds)
if (!dst)
return -ENOMEM;
 
-   if (dst->ds[index])
-   return -EBUSY;
-
ds->dst = dst;
ds->index = index;
ds->cd = pdata;
@@ -773,7 +777,9 @@ static int _dsa_register_switch(struct dsa_switch *ds)
for (i = 0; i < DSA_MAX_SWITCHES; ++i)
ds->rtable[i] = DSA_RTABLE_NONE;
 
-   dsa_dst_add_ds(dst, ds, index);
+   err = dsa_tree_add_switch(dst, ds);
+   if (err)
+   return err;
 
err = dsa_dst_complete(dst);
if (err < 0)
@@ -801,7 +807,7 @@ static int _dsa_register_switch(struct dsa_switch *ds)
return 0;
 
 out_del_dst:
-   dsa_dst_del_ds(dst, ds, ds->index);
+   dsa_tree_remove_switch(dst, index);
 
return err;
 }
@@ -843,10 +849,11 @@ EXPORT_SYMBOL_GPL(dsa_register_switch);
 static void _dsa_unregister_switch(struct dsa_switch *ds)
 {
struct dsa_switch_tree *dst = ds->dst;
+   unsigned int index = ds->index;
 
dsa_dst_unapply(dst);
 
-   dsa_dst_del_ds(dst, ds, ds->index);
+   dsa_tree_remove_switch(dst, index);
 }
 
 void dsa_unregister_switch(struct dsa_switch *ds)
-- 
2.14.3



[PATCH net-next 05/11] net: dsa: provide a find or new tree helper

2017-11-03 Thread Vivien Didelot
Rename dsa_get_dst to dsa_tree_find since it doesn't increment the
reference counter, rename dsa_add_dst to dsa_tree_alloc for symmetry
with dsa_tree_free, and provide a convenient dsa_tree_touch function to
find or allocate a new tree.

Signed-off-by: Vivien Didelot 
---
 net/dsa/dsa2.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 609d92684505..bda222cfc02c 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -21,33 +21,35 @@
 
 #include "dsa_priv.h"
 
-static LIST_HEAD(dsa_switch_trees);
+static LIST_HEAD(dsa_tree_list);
 static DEFINE_MUTEX(dsa2_mutex);
 
 static const struct devlink_ops dsa_devlink_ops = {
 };
 
-static struct dsa_switch_tree *dsa_get_dst(unsigned int index)
+static struct dsa_switch_tree *dsa_tree_find(int index)
 {
struct dsa_switch_tree *dst;
 
-   list_for_each_entry(dst, _switch_trees, list)
+   list_for_each_entry(dst, _tree_list, list)
if (dst->index == index)
return dst;
 
return NULL;
 }
 
-static struct dsa_switch_tree *dsa_add_dst(unsigned int index)
+static struct dsa_switch_tree *dsa_tree_alloc(int index)
 {
struct dsa_switch_tree *dst;
 
dst = kzalloc(sizeof(*dst), GFP_KERNEL);
if (!dst)
return NULL;
+
dst->index = index;
+
INIT_LIST_HEAD(>list);
-   list_add_tail(_switch_trees, >list);
+   list_add_tail(_tree_list, >list);
 
/* Initialize the reference counter to the number of switches, not 1 */
kref_init(>refcount);
@@ -62,6 +64,17 @@ static void dsa_tree_free(struct dsa_switch_tree *dst)
kfree(dst);
 }
 
+static struct dsa_switch_tree *dsa_tree_touch(int index)
+{
+   struct dsa_switch_tree *dst;
+
+   dst = dsa_tree_find(index);
+   if (!dst)
+   dst = dsa_tree_alloc(index);
+
+   return dst;
+}
+
 static void dsa_tree_get(struct dsa_switch_tree *dst)
 {
kref_get(>refcount);
@@ -745,12 +758,9 @@ static int _dsa_register_switch(struct dsa_switch *ds)
return err;
}
 
-   dst = dsa_get_dst(tree);
-   if (!dst) {
-   dst = dsa_add_dst(tree);
-   if (!dst)
-   return -ENOMEM;
-   }
+   dst = dsa_tree_touch(tree);
+   if (!dst)
+   return -ENOMEM;
 
if (dst->ds[index])
return -EBUSY;
-- 
2.14.3



[PATCH net-next 10/11] net: dsa: add one port parsing function per type

2017-11-03 Thread Vivien Didelot
Add dsa_port_parse_user, dsa_port_parse_dsa and dsa_port_parse_cpu
functions to factorize the code shared by both OF and pdata parsing.

They don't do much for the moment but will be extended later to support
tagging protocol resolution for example.

Signed-off-by: Vivien Didelot 
---
 net/dsa/dsa2.c | 56 
 1 file changed, 36 insertions(+), 20 deletions(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 06bcdb6bc796..271a97ef5bf6 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -491,6 +491,32 @@ static int dsa_tree_add_switch(struct dsa_switch_tree *dst,
return 0;
 }
 
+static int dsa_port_parse_user(struct dsa_port *dp, const char *name)
+{
+   if (!name)
+   name = "eth%d";
+
+   dp->type = DSA_PORT_TYPE_USER;
+   dp->name = name;
+
+   return 0;
+}
+
+static int dsa_port_parse_dsa(struct dsa_port *dp)
+{
+   dp->type = DSA_PORT_TYPE_DSA;
+
+   return 0;
+}
+
+static int dsa_port_parse_cpu(struct dsa_port *dp, struct net_device *master)
+{
+   dp->type = DSA_PORT_TYPE_CPU;
+   dp->master = master;
+
+   return 0;
+}
+
 static int dsa_cpu_parse(struct dsa_port *port, u32 index,
 struct dsa_switch_tree *dst,
 struct dsa_switch *ds)
@@ -593,6 +619,8 @@ static int dsa_port_parse_of(struct dsa_port *dp, struct 
device_node *dn)
const char *name = of_get_property(dn, "label", NULL);
bool link = of_property_read_bool(dn, "link");
 
+   dp->dn = dn;
+
if (ethernet) {
struct net_device *master;
 
@@ -600,21 +628,13 @@ static int dsa_port_parse_of(struct dsa_port *dp, struct 
device_node *dn)
if (!master)
return -EPROBE_DEFER;
 
-   dp->type = DSA_PORT_TYPE_CPU;
-   dp->master = master;
-   } else if (link) {
-   dp->type = DSA_PORT_TYPE_DSA;
-   } else {
-   if (!name)
-   name = "eth%d";
-
-   dp->type = DSA_PORT_TYPE_USER;
-   dp->name = name;
+   return dsa_port_parse_cpu(dp, master);
}
 
-   dp->dn = dn;
+   if (link)
+   return dsa_port_parse_dsa(dp);
 
-   return 0;
+   return dsa_port_parse_user(dp, name);
 }
 
 static int dsa_switch_parse_ports_of(struct dsa_switch *ds,
@@ -694,17 +714,13 @@ static int dsa_port_parse(struct dsa_port *dp, const char 
*name,
 
dev_put(master);
 
-   dp->type = DSA_PORT_TYPE_CPU;
-   dp->master = master;
-   } else if (!strcmp(name, "dsa")) {
-   dp->type = DSA_PORT_TYPE_DSA;
-   } else {
-   dp->type = DSA_PORT_TYPE_USER;
+   return dsa_port_parse_cpu(dp, master);
}
 
-   dp->name = name;
+   if (!strcmp(name, "dsa"))
+   return dsa_port_parse_dsa(dp);
 
-   return 0;
+   return dsa_port_parse_user(dp, name);
 }
 
 static int dsa_switch_parse_ports(struct dsa_switch *ds,
-- 
2.14.3



[PATCH net-next 07/11] net: dsa: get tree before parsing ports

2017-11-03 Thread Vivien Didelot
We will need a reference to the dsa_switch_tree when parsing a CPU port,
so fetch it right after parsing the member and before parsing ports.

Signed-off-by: Vivien Didelot 
---
 net/dsa/dsa2.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 5b6a3dad8015..5918fbddb0ab 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -751,18 +751,10 @@ static int _dsa_register_switch(struct dsa_switch *ds)
err = dsa_parse_member_dn(np, , );
if (err)
return err;
-
-   err = dsa_parse_ports_of(np, ds);
-   if (err)
-   return err;
} else {
err = dsa_parse_member(pdata, , );
if (err)
return err;
-
-   err = dsa_parse_ports(pdata, ds);
-   if (err)
-   return err;
}
 
dst = dsa_tree_touch(tree);
@@ -773,6 +765,16 @@ static int _dsa_register_switch(struct dsa_switch *ds)
ds->index = index;
ds->cd = pdata;
 
+   if (np) {
+   err = dsa_parse_ports_of(np, ds);
+   if (err)
+   return err;
+   } else {
+   err = dsa_parse_ports(pdata, ds);
+   if (err)
+   return err;
+   }
+
/* Initialize the routing table */
for (i = 0; i < DSA_MAX_SWITCHES; ++i)
ds->rtable[i] = DSA_RTABLE_NONE;
-- 
2.14.3



[PATCH net-next 11/11] net: dsa: resolve tagging protocol at parse time

2017-11-03 Thread Vivien Didelot
Extend the dsa_port_parse_cpu() function to resolve the tagging protocol
at port parsing time, instead of waiting for the whole tree to be
complete.

Signed-off-by: Vivien Didelot 
---
 net/dsa/dsa2.c | 33 -
 1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 271a97ef5bf6..283104e5ca6a 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -511,22 +511,11 @@ static int dsa_port_parse_dsa(struct dsa_port *dp)
 
 static int dsa_port_parse_cpu(struct dsa_port *dp, struct net_device *master)
 {
-   dp->type = DSA_PORT_TYPE_CPU;
-   dp->master = master;
-
-   return 0;
-}
-
-static int dsa_cpu_parse(struct dsa_port *port, u32 index,
-struct dsa_switch_tree *dst,
-struct dsa_switch *ds)
-{
+   struct dsa_switch *ds = dp->ds;
+   struct dsa_switch_tree *dst = ds->dst;
const struct dsa_device_ops *tag_ops;
enum dsa_tag_protocol tag_protocol;
 
-   if (!dst->cpu_dp)
-   dst->cpu_dp = port;
-
tag_protocol = ds->ops->get_tag_protocol(ds);
tag_ops = dsa_resolve_tag_protocol(tag_protocol);
if (IS_ERR(tag_ops)) {
@@ -534,11 +523,21 @@ static int dsa_cpu_parse(struct dsa_port *port, u32 index,
return PTR_ERR(tag_ops);
}
 
-   dst->cpu_dp->tag_ops = tag_ops;
+   dp->type = DSA_PORT_TYPE_CPU;
+   dp->rcv = tag_ops->rcv;
+   dp->tag_ops = tag_ops;
+   dp->master = master;
+   dp->dst = dst;
 
-   /* Make a few copies for faster access in master receive hot path */
-   dst->cpu_dp->rcv = dst->cpu_dp->tag_ops->rcv;
-   dst->cpu_dp->dst = dst;
+   return 0;
+}
+
+static int dsa_cpu_parse(struct dsa_port *port, u32 index,
+struct dsa_switch_tree *dst,
+struct dsa_switch *ds)
+{
+   if (!dst->cpu_dp)
+   dst->cpu_dp = port;
 
return 0;
 }
-- 
2.14.3



[PATCH net-next 09/11] net: dsa: only check presence of link property

2017-11-03 Thread Vivien Didelot
When parsing a port, simply use of_property_read_bool which checks the
presence of a given property, instead of parsing the link phandle.

Signed-off-by: Vivien Didelot 
---
 net/dsa/dsa2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index dfcb6247f2f2..06bcdb6bc796 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -590,8 +590,8 @@ static int dsa_dst_parse(struct dsa_switch_tree *dst)
 static int dsa_port_parse_of(struct dsa_port *dp, struct device_node *dn)
 {
struct device_node *ethernet = of_parse_phandle(dn, "ethernet", 0);
-   struct device_node *link = of_parse_phandle(dn, "link", 0);
const char *name = of_get_property(dn, "label", NULL);
+   bool link = of_property_read_bool(dn, "link");
 
if (ethernet) {
struct net_device *master;
-- 
2.14.3



[PATCH net-next 04/11] net: dsa: get and put tree reference counting

2017-11-03 Thread Vivien Didelot
Provide convenient dsa_tree_get and dsa_tree_put functions scoping a DSA
tree used to increment and decrement its reference counter, instead of
poking directly its kref structure.

Signed-off-by: Vivien Didelot 
---
 net/dsa/dsa2.c | 40 
 1 file changed, 28 insertions(+), 12 deletions(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index d3f1a7607463..609d92684505 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -38,15 +38,6 @@ static struct dsa_switch_tree *dsa_get_dst(unsigned int 
index)
return NULL;
 }
 
-static void dsa_free_dst(struct kref *ref)
-{
-   struct dsa_switch_tree *dst = container_of(ref, struct dsa_switch_tree,
-  refcount);
-
-   list_del(>list);
-   kfree(dst);
-}
-
 static struct dsa_switch_tree *dsa_add_dst(unsigned int index)
 {
struct dsa_switch_tree *dst;
@@ -65,10 +56,35 @@ static struct dsa_switch_tree *dsa_add_dst(unsigned int 
index)
return dst;
 }
 
-static void dsa_dst_add_ds(struct dsa_switch_tree *dst,
-  struct dsa_switch *ds, u32 index)
+static void dsa_tree_free(struct dsa_switch_tree *dst)
+{
+   list_del(>list);
+   kfree(dst);
+}
+
+static void dsa_tree_get(struct dsa_switch_tree *dst)
 {
kref_get(>refcount);
+}
+
+static void dsa_tree_release(struct kref *ref)
+{
+   struct dsa_switch_tree *dst;
+
+   dst = container_of(ref, struct dsa_switch_tree, refcount);
+
+   dsa_tree_free(dst);
+}
+
+static void dsa_tree_put(struct dsa_switch_tree *dst)
+{
+   kref_put(>refcount, dsa_tree_release);
+}
+
+static void dsa_dst_add_ds(struct dsa_switch_tree *dst,
+  struct dsa_switch *ds, u32 index)
+{
+   dsa_tree_get(dst);
dst->ds[index] = ds;
 }
 
@@ -76,7 +92,7 @@ static void dsa_dst_del_ds(struct dsa_switch_tree *dst,
   struct dsa_switch *ds, u32 index)
 {
dst->ds[index] = NULL;
-   kref_put(>refcount, dsa_free_dst);
+   dsa_tree_put(dst);
 }
 
 /* For platform data configurations, we need to have a valid name argument to
-- 
2.14.3



[PATCH net-next 08/11] net: dsa: rework switch parsing

2017-11-03 Thread Vivien Didelot
When parsing a switch, we have to identify to which tree it belongs and
parse its ports. Provide two functions to separate the OF and platform
data specific paths.

Also use the of_property_read_variable_u32_array function to parse the
OF member array instead of calling of_property_read_u32_index twice.

Signed-off-by: Vivien Didelot 
---
 net/dsa/dsa2.c | 117 -
 1 file changed, 58 insertions(+), 59 deletions(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 5918fbddb0ab..dfcb6247f2f2 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -617,7 +617,8 @@ static int dsa_port_parse_of(struct dsa_port *dp, struct 
device_node *dn)
return 0;
 }
 
-static int dsa_parse_ports_of(struct device_node *dn, struct dsa_switch *ds)
+static int dsa_switch_parse_ports_of(struct dsa_switch *ds,
+struct device_node *dn)
 {
struct device_node *ports, *port;
struct dsa_port *dp;
@@ -648,6 +649,39 @@ static int dsa_parse_ports_of(struct device_node *dn, 
struct dsa_switch *ds)
return 0;
 }
 
+static int dsa_switch_parse_member_of(struct dsa_switch *ds,
+ struct device_node *dn)
+{
+   u32 m[2] = { 0, 0 };
+   int sz;
+
+   /* Don't error out if this optional property isn't found */
+   sz = of_property_read_variable_u32_array(dn, "dsa,member", m, 2, 2);
+   if (sz < 0 && sz != -EINVAL)
+   return sz;
+
+   ds->index = m[1];
+   if (ds->index >= DSA_MAX_SWITCHES)
+   return -EINVAL;
+
+   ds->dst = dsa_tree_touch(m[0]);
+   if (!ds->dst)
+   return -ENOMEM;
+
+   return 0;
+}
+
+static int dsa_switch_parse_of(struct dsa_switch *ds, struct device_node *dn)
+{
+   int err;
+
+   err = dsa_switch_parse_member_of(ds, dn);
+   if (err)
+   return err;
+
+   return dsa_switch_parse_ports_of(ds, dn);
+}
+
 static int dsa_port_parse(struct dsa_port *dp, const char *name,
  struct device *dev)
 {
@@ -673,7 +707,8 @@ static int dsa_port_parse(struct dsa_port *dp, const char 
*name,
return 0;
 }
 
-static int dsa_parse_ports(struct dsa_chip_data *cd, struct dsa_switch *ds)
+static int dsa_switch_parse_ports(struct dsa_switch *ds,
+ struct dsa_chip_data *cd)
 {
bool valid_name_found = false;
struct dsa_port *dp;
@@ -703,40 +738,19 @@ static int dsa_parse_ports(struct dsa_chip_data *cd, 
struct dsa_switch *ds)
return 0;
 }
 
-static int dsa_parse_member_dn(struct device_node *np, u32 *tree, u32 *index)
+static int dsa_switch_parse(struct dsa_switch *ds, struct dsa_chip_data *cd)
 {
-   int err;
+   ds->cd = cd;
 
-   *tree = *index = 0;
+   /* We don't support interconnected switches nor multiple trees via
+* platform data, so this is the unique switch of the tree.
+*/
+   ds->index = 0;
+   ds->dst = dsa_tree_touch(0);
+   if (!ds->dst)
+   return -ENOMEM;
 
-   err = of_property_read_u32_index(np, "dsa,member", 0, tree);
-   if (err) {
-   /* Does not exist, but it is optional */
-   if (err == -EINVAL)
-   return 0;
-   return err;
-   }
-
-   err = of_property_read_u32_index(np, "dsa,member", 1, index);
-   if (err)
-   return err;
-
-   if (*index >= DSA_MAX_SWITCHES)
-   return -EINVAL;
-
-   return 0;
-}
-
-static int dsa_parse_member(struct dsa_chip_data *pd, u32 *tree, u32 *index)
-{
-   if (!pd)
-   return -ENODEV;
-
-   /* We do not support complex trees with dsa_chip_data */
-   *tree = 0;
-   *index = 0;
-
-   return 0;
+   return dsa_switch_parse_ports(ds, cd);
 }
 
 static int _dsa_register_switch(struct dsa_switch *ds)
@@ -744,36 +758,21 @@ static int _dsa_register_switch(struct dsa_switch *ds)
struct dsa_chip_data *pdata = ds->dev->platform_data;
struct device_node *np = ds->dev->of_node;
struct dsa_switch_tree *dst;
-   u32 tree, index;
+   unsigned int index;
int i, err;
 
-   if (np) {
-   err = dsa_parse_member_dn(np, , );
-   if (err)
-   return err;
-   } else {
-   err = dsa_parse_member(pdata, , );
-   if (err)
-   return err;
-   }
+   if (np)
+   err = dsa_switch_parse_of(ds, np);
+   else if (pdata)
+   err = dsa_switch_parse(ds, pdata);
+   else
+   err = -ENODEV;
 
-   dst = dsa_tree_touch(tree);
-   if (!dst)
-   return -ENOMEM;
+   if (err)
+   return err;
 
-   ds->dst = dst;
-   ds->index = index;
-   ds->cd = pdata;
-
-   if (np) {
-   err = dsa_parse_ports_of(np, ds);
-   

Re: [PATCH net-next] tcp: tcp_mtu_probing() cleanup

2017-11-03 Thread Neal Cardwell
On Fri, Nov 3, 2017 at 9:09 AM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> Reduce one indentation level to make code more readable.
> tcp_sync_mss() can be factorized.
>
> Signed-off-by: Eric Dumazet 
> ---
>  net/ipv4/tcp_timer.c |   31 ++-
>  1 file changed, 14 insertions(+), 17 deletions(-)

Acked-by: Neal Cardwell 

Thanks, Eric!

neal


Re: Bond recovery from BOND_LINK_FAIL state not working

2017-11-03 Thread Jarod Wilson

On 2017-11-03 3:30 PM, Alex Sidorenko wrote:
Indeed, we do not print slave's ->link_new_state on each entry - so it 
is quite possible that we are at stage 6.


It is even possible that this has something to do with how NM initially 
created bonds.
Customer says that the problem occurs once only after host reboot, after 
that

failover works fine no matter how many times he changes the state of
VirtualConnect modules.

Jarod,

could you please add printing slave->link_new_state for both slaves at each
entry to bond_miimon_inspect?

(and instead of nudging slave->new_link like I suggested, use Jay's patch).


Will do, test build is just about ready here.

--
Jarod Wilson
ja...@redhat.com


Re: TCP connection closed without FIN or RST

2017-11-03 Thread Eric Dumazet
On Fri, 2017-11-03 at 14:28 -0400, Vitaly Davidovich wrote:

> So Eric, while I still have your interest here (although I know it's
> waning :)), any code pointers to where I might look to see if a
> specific small-ish rcv buf size may interact poorly with the rest of
> the stack? Is it possible some buffer was starved in the client stack
> which prevented it from sending any segments to the server? Maybe the
> incoming retrans were actually dropped somewhere in the ingress pkt
> processing and so the stack doesn't know it needs to react to
> something? Pulling at straws here but clearly the recv buf size, and a
> somewhat small one at that, has some play.
> 
> I checked dmesg (just in case something would pop up there) but didn't
> observe any warnings or anything interesting.

I believe you could reproduce the issue with packetdrill.

If you can provide a packetdrill file demonstrating the issue, that
would be awesome ;)





[PATCH net-next] liquidio: do not consider packets dropped by network stack as driver Rx dropped

2017-11-03 Thread Felix Manlunas
From: Intiyaz Basha 

netdev->rx_dropped was including packets dropped by napi_gro_receive.
If a packet is dropped by network stack, it should not be counted under
driver Rx dropped.

Made necessary changes to not include network stack drops under
netdev->rx_dropped.

Signed-off-by: Intiyaz Basha 
Signed-off-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_core.c | 15 +++
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_core.c 
b/drivers/net/ethernet/cavium/liquidio/lio_core.c
index 89b7820..32ae63b 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_core.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_core.c
@@ -467,7 +467,6 @@ static int octeon_setup_droq(struct octeon_device *oct, int 
q_no, int num_descs,
if (netdev) {
struct lio *lio = GET_LIO(netdev);
struct octeon_device *oct = lio->oct_dev;
-   int packet_was_received;
 
/* Do not proceed if the interface is not in RUNNING state. */
if (!ifstate_check(lio, LIO_IFSTATE_RUNNING)) {
@@ -570,18 +569,10 @@ static int octeon_setup_droq(struct octeon_device *oct, 
int q_no, int num_descs,
__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vtag);
}
 
-   packet_was_received = (napi_gro_receive(napi, skb) != GRO_DROP);
-
-   if (packet_was_received) {
-   droq->stats.rx_bytes_received += len;
-   droq->stats.rx_pkts_received++;
-   } else {
-   droq->stats.rx_dropped++;
-   netif_info(lio, rx_err, lio->netdev,
-  "droq:%d  error rx_dropped:%llu\n",
-  droq->q_no, droq->stats.rx_dropped);
-   }
+   napi_gro_receive(napi, skb);
 
+   droq->stats.rx_bytes_received += len;
+   droq->stats.rx_pkts_received++;
} else {
recv_buffer_free(skb);
}
-- 
1.8.3.1



Re: [PATCH net] tcp: fix tcp_mtu_probe() vs highest_sack

2017-11-03 Thread Eric Dumazet
On Fri, 2017-11-03 at 19:22 +0100, Oleksandr Natalenko wrote:
> Hi.
> 
> Thanks for the fix.
> 
> However, tcp_fastretrans_alert() warning case still remains open even with 
> this patch. Do I understand correctly that these are 2 different issues?
> 
> Currently, I use latest 4.13 stable kernel + this patch and still get:
> 
> WARNING: CPU: 1 PID: 736 at net/ipv4/tcp_input.c:2826 tcp_fastretrans_alert
> +0x7c8/


My patch only fixed the panics that you guys reported.

The warning issue in fastretrans is a separate problem,
we are still working on it, but at least the effects are not
catastrophic.





Re: [PATCH net-next 09/12] tools: bpftool: turn err() and info() macros into functions

2017-11-03 Thread Quentin Monnet
2017-11-002 17:59 UTC-0700 ~ Joe Perches 
> On Mon, 2017-10-23 at 09:24 -0700, Jakub Kicinski wrote:
>> From: Quentin Monnet 
>>
>> Turn err() and info() macros into functions.
>>
>> In order to avoid naming conflicts with variables in the code, rename
>> them as p_err() and p_info() respectively.
>>
>> The behavior of these functions is similar to the one of the macros for
>> plain output. However, when JSON output is requested, these macros
>> return a JSON-formatted "error" object instead of printing a message to
>> stderr.
>>
>> To handle error messages correctly with JSON, a modification was brought
>> to their behavior nonetheless: the functions now append a end-of-line
>> character at the end of the message. This way, we can remove end-of-line
>> characters at the end of the argument strings, and not have them in the
>> JSON output.
>>
>> All error messages are formatted to hold in a single call to p_err(), in
>> order to produce a single JSON field.
>> Signed-off-by: Quentin Monnet 
>> Acked-by: Jakub Kicinski 
> []
>> diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
> []
>> @@ -97,4 +93,35 @@ int prog_parse_fd(int *argc, char ***argv);
>>  void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes);
>>  void print_hex_data_json(uint8_t *data, size_t len);
>>  
>> +static inline void p_err(const char *fmt, ...)
>> +{
>> +va_list ap;
>> +
>> +va_start(ap, fmt);
>> +if (json_output) {
>> +jsonw_start_object(json_wtr);
>> +jsonw_name(json_wtr, "error");
>> +jsonw_vprintf_enquote(json_wtr, fmt, ap);
>> +jsonw_end_object(json_wtr);
>> +} else {
>> +fprintf(stderr, "Error: ");
>> +vfprintf(stderr, fmt, ap);
>> +fprintf(stderr, "\n");
>> +}
>> +va_end(ap);
>> +}
> inline seems very wasteful.
>
> Why not move p_err and p_info to common.c ?

Hi Joe,
That's a good point. I wrote a patch to change that, Jakub posted it
some minutes ago. Thanks for your feedback!

Quentin


Re: next-20171103 build: 3 failures 22 warnings (next-20171103)

2017-11-03 Thread Masami Hiramatsu
On Fri, 3 Nov 2017 21:16:21 +0100
Arnd Bergmann  wrote:

> On Fri, Nov 3, 2017 at 8:27 PM, Masami Hiramatsu  wrote:
> > On Fri, 3 Nov 2017 15:44:53 +0100 Arnd Bergmann  wrote:
> >> On Fri, Nov 3, 2017 at 1:44 PM, Build bot for Mark Brown 
> >>  wrote:
> >>
> >> > Warnings Summary: 22
> >> >   2 ../net/sctp/probe.c:240:2: warning: 'unregister_jprobe' is 
> >> > deprecated [-Wdeprecated-declarations]
> >> >   2 ../net/sctp/probe.c:194:3: warning: 'register_jprobe' is 
> >> > deprecated [-Wdeprecated-declarations]
> >> >   2 ../net/sctp/probe.c:189:2: warning: 'register_jprobe' is 
> >> > deprecated [-Wdeprecated-declarations]
> >> >   2 ../net/ipv4/tcp_probe.c:298:2: warning: 'unregister_jprobe' 
> >> > is deprecated [-Wdeprecated-declarations]
> >> >   2 ../net/ipv4/tcp_probe.c:280:2: warning: 'register_jprobe' is 
> >> > deprecated [-Wdeprecated-declarations]
> >> >   2 ../net/dccp/probe.c:190:2: warning: 'unregister_jprobe' is 
> >> > deprecated [-Wdeprecated-declarations]
> >> >   2 ../net/dccp/probe.c:170:4: warning: 'register_jprobe' is 
> >> > deprecated [-Wdeprecated-declarations]
> >> >   2 ../net/dccp/probe.c:166:2: warning: 'register_jprobe' is 
> >> > deprecated [-Wdeprecated-declarations]
> >> >   1 ../arch/arm/probes/kprobes/test-core.c:398:2: warning: 
> >> > 'unregister_jprobe' is deprecated [-Wdeprecated-declarations]
> >> >   1 ../arch/arm/probes/kprobes/test-core.c:390:2: warning: 
> >> > 'register_jprobe' is deprecated [-Wdeprecated-declarations]
> >>
> >> I need a little help from Masami Hiramatsu to understand what the plan is 
> >> here:
> >> Do we just need to remove those files now that jprobes are gone, or do
> >> we actually want to restore the functionality using some other replacement 
> >> code?
> >>
> >> I'm asking because the __deprecated warning seems unhelpful if there
> >> isn't an easy way to address the warning.
> >
> > It seems that the arm/probes case is just for testing, I'll just remove it
> > because it's functionality is gone.
> 
> Makes sense, thanks!
> 
> > Others should be decided by network maintainers. Maybe those are not used 
> > anymore,
> > or should be rewritten by kprobes or ftrace.
> 
> Added a few people to cc that worked on {tcp,dccp,sctp}_probe in the past
> for clarification. Do you already have plans to deal with this?

As far as I can see, those features can not be fixed simply using kprobes 
because
they are using function arguments. Fortunately we already have trace-events 
nowadays.
So my suggestion is adding trace-events in the target function with required
arguments, rewrite user-space tools to use ftrace, and remove the modules.

Thank you,

-- 
Masami Hiramatsu 


Re: [PATCH 1/2] bpf: add a bpf_override_function helper

2017-11-03 Thread Alexei Starovoitov
On Fri, Nov 03, 2017 at 05:52:22PM +0100, Daniel Borkmann wrote:
> On 11/03/2017 03:31 PM, Josef Bacik wrote:
> > On Fri, Nov 03, 2017 at 12:12:13AM +0100, Daniel Borkmann wrote:
> > > Hi Josef,
> > > 
> > > one more issue I just noticed, see comment below:
> > > 
> > > On 11/02/2017 03:37 PM, Josef Bacik wrote:
> > > [...]
> > > > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > > > index cdd78a7beaae..dfa44fd74bae 100644
> > > > --- a/include/linux/filter.h
> > > > +++ b/include/linux/filter.h
> > > > @@ -458,7 +458,8 @@ struct bpf_prog {
> > > > locked:1,   /* Program image 
> > > > locked? */
> > > > gpl_compatible:1, /* Is filter GPL 
> > > > compatible? */
> > > > cb_access:1,/* Is control block 
> > > > accessed? */
> > > > -   dst_needed:1;   /* Do we need dst 
> > > > entry? */
> > > > +   dst_needed:1,   /* Do we need dst 
> > > > entry? */
> > > > +   kprobe_override:1; /* Do we override a 
> > > > kprobe? */
> > > > kmemcheck_bitfield_end(meta);
> > > > enum bpf_prog_type  type;   /* Type of BPF program 
> > > > */
> > > > u32 len;/* Number of filter 
> > > > blocks */
> > > [...]
> > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > index d906775e12c1..f8f7927a9152 100644
> > > > --- a/kernel/bpf/verifier.c
> > > > +++ b/kernel/bpf/verifier.c
> > > > @@ -4189,6 +4189,8 @@ static int fixup_bpf_calls(struct 
> > > > bpf_verifier_env *env)
> > > > prog->dst_needed = 1;
> > > > if (insn->imm == BPF_FUNC_get_prandom_u32)
> > > > bpf_user_rnd_init_once();
> > > > +   if (insn->imm == BPF_FUNC_override_return)
> > > > +   prog->kprobe_override = 1;
> > > > if (insn->imm == BPF_FUNC_tail_call) {
> > > > /* If we tail call into other programs, we
> > > >  * cannot make any assumptions since they can
> > > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > > index 9660ee65fbef..0d7fce52391d 100644
> > > > --- a/kernel/events/core.c
> > > > +++ b/kernel/events/core.c
> > > > @@ -8169,6 +8169,13 @@ static int perf_event_set_bpf_prog(struct 
> > > > perf_event *event, u32 prog_fd)
> > > > return -EINVAL;
> > > > }
> > > > 
> > > > +   /* Kprobe override only works for kprobes, not uprobes. */
> > > > +   if (prog->kprobe_override &&
> > > > +   !(event->tp_event->flags & TRACE_EVENT_FL_KPROBE)) {
> > > > +   bpf_prog_put(prog);
> > > > +   return -EINVAL;
> > > > +   }
> > > 
> > > Can we somehow avoid the prog->kprobe_override flag here completely
> > > and also same in the perf_event_attach_bpf_prog() handler?
> > > 
> > > Reason is that it's not reliable for bailing out this way: Think of
> > > the main program you're attaching doesn't use bpf_override_return()
> > > helper, but it tail-calls into other BPF progs that make use of it
> > > instead. So above check would be useless and will fail and we continue
> > > to attach the prog for probes where it's not intended to be used.
> > > 
> > > We've had similar issues in the past e.g. c2002f983767 ("bpf: fix
> > > checking xdp_adjust_head on tail calls") is just one of those. Thus,
> > > can we avoid the flag altogether and handle such error case differently?
> > 
> > So if I'm reading this right there's no way to know what we'll tail call at 
> > any
> > given point, so I need to go back to my previous iteration of this patch and
> > always save the state of the kprobe in the per-cpu variable to make sure we
> > don't use bpf_override_return in the wrong case?
> 
> Yeah.
> 
> > The tail call functions won't be in the BPF_PROG_ARRAY right?  It'll be just
> > some other arbitrary function?  If that's the case then we really need 
> > something
> > like this
> 
> With BPF_PROG_ARRAY you mean BPF_MAP_TYPE_PROG_ARRAY or the prog array
> for the tracing/multiprog attach point? The program you're calling into
> is inside the BPF_MAP_TYPE_PROG_ARRAY map, but can change at any time
> and can have nesting as well.
> 
> > https://patchwork.kernel.org/patch/10034815/
> > 
> > and I need to bring that back right?  Thanks,
> 
> I'm afraid so. The thing with skb cb_access which was brought up there is
> that once you have a tail call in the prog you cannot make any assumptions
> anymore, therefore the cb_access flag is set to 1 so we save/restore for
> those cases precautionary since it could be accessed or not later on. In
> your case I think this wouldn't work since legitimate bpf kprobes progs could
> use tail calls today, so setting prog->kprobe_override there would prevent
> attaching for non-kprobes due to subsequent flags & 

[PATCH net-next] tools: bpftool: move p_err() and p_info() from main.h to common.c

2017-11-03 Thread Jakub Kicinski
From: Quentin Monnet 

The two functions were declared as static inline in a header file. There
is no particular reason why they should be inlined, they just happened to
remain in the same header file when they were turned from macros to
functions in a precious commit.

Make them non-inlined functions and move them to common.c file instead.

Suggested-by: Joe Perches 
Signed-off-by: Quentin Monnet 
Signed-off-by: Jakub Kicinski 
---
 tools/bpf/bpftool/common.c | 31 +++
 tools/bpf/bpftool/main.h   | 34 +++---
 2 files changed, 34 insertions(+), 31 deletions(-)

diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
index f0288269dae8..aa7017098b2a 100644
--- a/tools/bpf/bpftool/common.c
+++ b/tools/bpf/bpftool/common.c
@@ -50,6 +50,37 @@
 
 #include "main.h"
 
+void p_err(const char *fmt, ...)
+{
+   va_list ap;
+
+   va_start(ap, fmt);
+   if (json_output) {
+   jsonw_start_object(json_wtr);
+   jsonw_name(json_wtr, "error");
+   jsonw_vprintf_enquote(json_wtr, fmt, ap);
+   jsonw_end_object(json_wtr);
+   } else {
+   fprintf(stderr, "Error: ");
+   vfprintf(stderr, fmt, ap);
+   fprintf(stderr, "\n");
+   }
+   va_end(ap);
+}
+
+void p_info(const char *fmt, ...)
+{
+   va_list ap;
+
+   if (json_output)
+   return;
+
+   va_start(ap, fmt);
+   vfprintf(stderr, fmt, ap);
+   fprintf(stderr, "\n");
+   va_end(ap);
+}
+
 static bool is_bpffs(char *path)
 {
struct statfs st_fs;
diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index d315d01be645..ff5ad05b137b 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -71,6 +71,9 @@ extern const char *bin_name;
 extern json_writer_t *json_wtr;
 extern bool json_output;
 
+void p_err(const char *fmt, ...);
+void p_info(const char *fmt, ...);
+
 bool is_prefix(const char *pfx, const char *str);
 void fprint_hex(FILE *f, void *arg, unsigned int n, const char *sep);
 void usage(void) __attribute__((noreturn));
@@ -97,35 +100,4 @@ int prog_parse_fd(int *argc, char ***argv);
 void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes);
 void print_hex_data_json(uint8_t *data, size_t len);
 
-static inline void p_err(const char *fmt, ...)
-{
-   va_list ap;
-
-   va_start(ap, fmt);
-   if (json_output) {
-   jsonw_start_object(json_wtr);
-   jsonw_name(json_wtr, "error");
-   jsonw_vprintf_enquote(json_wtr, fmt, ap);
-   jsonw_end_object(json_wtr);
-   } else {
-   fprintf(stderr, "Error: ");
-   vfprintf(stderr, fmt, ap);
-   fprintf(stderr, "\n");
-   }
-   va_end(ap);
-}
-
-static inline void p_info(const char *fmt, ...)
-{
-   va_list ap;
-
-   if (json_output)
-   return;
-
-   va_start(ap, fmt);
-   vfprintf(stderr, fmt, ap);
-   fprintf(stderr, "\n");
-   va_end(ap);
-}
-
 #endif
-- 
2.14.1



[PATCH net-next v2 01/15] net: bpf: rename ndo_xdp to ndo_bpf

2017-11-03 Thread Jakub Kicinski
ndo_xdp is a control path callback for setting up XDP in the
driver.  We can reuse it for other forms of communication
between the eBPF stack and the drivers.  Rename the callback
and associated structures and definitions.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c  |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c  |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h  |  2 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   |  4 +--
 drivers/net/ethernet/intel/i40e/i40e_main.c|  6 ++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |  4 +--
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  6 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  4 +--
 .../net/ethernet/netronome/nfp/nfp_net_common.c|  4 +--
 drivers/net/ethernet/qlogic/qede/qede.h|  2 +-
 drivers/net/ethernet/qlogic/qede/qede_filter.c |  2 +-
 drivers/net/ethernet/qlogic/qede/qede_main.c   |  4 +--
 drivers/net/tun.c  |  4 +--
 drivers/net/virtio_net.c   |  4 +--
 include/linux/netdevice.h  | 23 ---
 net/core/dev.c | 34 +++---
 net/core/rtnetlink.c   |  4 +--
 17 files changed, 56 insertions(+), 55 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 4e3d569bf32e..96416f5d97f3 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -7775,7 +7775,7 @@ static const struct net_device_ops bnxt_netdev_ops = {
 #endif
.ndo_udp_tunnel_add = bnxt_udp_tunnel_add,
.ndo_udp_tunnel_del = bnxt_udp_tunnel_del,
-   .ndo_xdp= bnxt_xdp,
+   .ndo_bpf= bnxt_xdp,
.ndo_bridge_getlink = bnxt_bridge_getlink,
.ndo_bridge_setlink = bnxt_bridge_setlink,
.ndo_get_phys_port_name = bnxt_get_phys_port_name
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 06ce63c00821..261e5847557a 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -208,7 +208,7 @@ static int bnxt_xdp_set(struct bnxt *bp, struct bpf_prog 
*prog)
return 0;
 }
 
-int bnxt_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 {
struct bnxt *bp = netdev_priv(dev);
int rc;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
index 12a5ad66b564..414b748038ca 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
@@ -16,6 +16,6 @@ void bnxt_tx_int_xdp(struct bnxt *bp, struct bnxt_napi 
*bnapi, int nr_pkts);
 bool bnxt_rx_xdp(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, u16 cons,
 struct page *page, u8 **data_ptr, unsigned int *len,
 u8 *event);
-int bnxt_xdp(struct net_device *dev, struct netdev_xdp *xdp);
+int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp);
 
 #endif
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 71989e180289..a063c36c4c58 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1741,7 +1741,7 @@ static int nicvf_xdp_setup(struct nicvf *nic, struct 
bpf_prog *prog)
return 0;
 }
 
-static int nicvf_xdp(struct net_device *netdev, struct netdev_xdp *xdp)
+static int nicvf_xdp(struct net_device *netdev, struct netdev_bpf *xdp)
 {
struct nicvf *nic = netdev_priv(netdev);
 
@@ -1774,7 +1774,7 @@ static const struct net_device_ops nicvf_netdev_ops = {
.ndo_tx_timeout = nicvf_tx_timeout,
.ndo_fix_features   = nicvf_fix_features,
.ndo_set_features   = nicvf_set_features,
-   .ndo_xdp= nicvf_xdp,
+   .ndo_bpf= nicvf_xdp,
 };
 
 static int nicvf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index dfecaeda0654..05b94d87a6c3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11648,12 +11648,12 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi,
 }
 
 /**
- * i40e_xdp - implements ndo_xdp for i40e
+ * i40e_xdp - implements ndo_bpf for i40e
  * @dev: netdevice
  * @xdp: XDP command
  **/
 static int i40e_xdp(struct net_device *dev,
-   struct netdev_xdp *xdp)
+   struct netdev_bpf 

[PATCH net-next v2 03/15] bpf: report offload info to user space

2017-11-03 Thread Jakub Kicinski
Extend struct bpf_prog_info to contain information about program
being bound to a device.  Since the netdev may get destroyed while
program still exists we need a flag to indicate the program is
loaded for a device, even if the device is gone.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Reviewed-by: Quentin Monnet 
---
 include/linux/bpf.h  |  1 +
 include/uapi/linux/bpf.h |  6 ++
 kernel/bpf/offload.c | 12 
 kernel/bpf/syscall.c |  5 +
 4 files changed, 24 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e45d43f9ec92..98bacd0fa5cc 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -506,6 +506,7 @@ static inline int cpu_map_enqueue(struct bpf_cpu_map_entry 
*rcpu,
 
 int bpf_prog_offload_compile(struct bpf_prog *prog);
 void bpf_prog_offload_destroy(struct bpf_prog *prog);
+u32 bpf_prog_offload_ifindex(struct bpf_prog *prog);
 
 #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
 int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 727a3dba13e6..e92f62cf933a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -894,6 +894,10 @@ enum sk_action {
 
 #define BPF_TAG_SIZE   8
 
+enum bpf_prog_status {
+   BPF_PROG_STATUS_DEV_BOUND   = (1 << 0),
+};
+
 struct bpf_prog_info {
__u32 type;
__u32 id;
@@ -907,6 +911,8 @@ struct bpf_prog_info {
__u32 nr_map_ids;
__aligned_u64 map_ids;
char name[BPF_OBJ_NAME_LEN];
+   __u32 ifindex;
+   __u32 status;
 } __attribute__((aligned(8)));
 
 struct bpf_map_info {
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 5553e0e2f8b1..2816feb38be1 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -144,6 +144,18 @@ int bpf_prog_offload_compile(struct bpf_prog *prog)
return bpf_prog_offload_translate(prog);
 }
 
+u32 bpf_prog_offload_ifindex(struct bpf_prog *prog)
+{
+   struct bpf_dev_offload *offload = prog->aux->offload;
+   u32 ifindex;
+
+   rtnl_lock();
+   ifindex = offload->netdev ? offload->netdev->ifindex : 0;
+   rtnl_unlock();
+
+   return ifindex;
+}
+
 const struct bpf_prog_ops bpf_offload_prog_ops = {
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 1574b9f0f24e..3217c20ea91b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1592,6 +1592,11 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
return -EFAULT;
}
 
+   if (bpf_prog_is_dev_bound(prog->aux)) {
+   info.status |= BPF_PROG_STATUS_DEV_BOUND;
+   info.ifindex = bpf_prog_offload_ifindex(prog);
+   }
+
 done:
if (copy_to_user(uinfo, , info_len) ||
put_user(info_len, >info.info_len))
-- 
2.14.1



[PATCH net-next v2 05/15] xdp: allow attaching programs loaded for specific device

2017-11-03 Thread Jakub Kicinski
Pass the netdev pointer to bpf_prog_get_type().  This way
BPF code can decide whether the device matches what the
code was loaded/translated for.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Reviewed-by: Quentin Monnet 
---
 include/linux/bpf.h  | 10 ++
 kernel/bpf/syscall.c | 33 +
 net/core/dev.c   |  6 +-
 3 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 98bacd0fa5cc..c397934f91dd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -335,6 +335,8 @@ extern const struct bpf_verifier_ops xdp_analyzer_ops;
 
 struct bpf_prog *bpf_prog_get(u32 ufd);
 struct bpf_prog *bpf_prog_get_type(u32 ufd, enum bpf_prog_type type);
+struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
+  struct net_device *netdev);
 struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog, int i);
 void bpf_prog_sub(struct bpf_prog *prog, int i);
 struct bpf_prog * __must_check bpf_prog_inc(struct bpf_prog *prog);
@@ -428,6 +430,14 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
 {
return ERR_PTR(-EOPNOTSUPP);
 }
+
+static inline struct bpf_prog *bpf_prog_get_type_dev(u32 ufd,
+enum bpf_prog_type type,
+struct net_device *netdev)
+{
+   return ERR_PTR(-EOPNOTSUPP);
+}
+
 static inline struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog 
*prog,
  int i)
 {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3217c20ea91b..68f9123acd39 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1057,7 +1057,22 @@ struct bpf_prog *bpf_prog_inc_not_zero(struct bpf_prog 
*prog)
 }
 EXPORT_SYMBOL_GPL(bpf_prog_inc_not_zero);
 
-static struct bpf_prog *__bpf_prog_get(u32 ufd, enum bpf_prog_type 
*attach_type)
+static bool bpf_prog_can_attach(struct bpf_prog *prog,
+   enum bpf_prog_type *attach_type,
+   struct net_device *netdev)
+{
+   struct bpf_dev_offload *offload = prog->aux->offload;
+
+   if (prog->type != *attach_type)
+   return false;
+   if (offload && offload->netdev != netdev)
+   return false;
+
+   return true;
+}
+
+static struct bpf_prog *__bpf_prog_get(u32 ufd, enum bpf_prog_type 
*attach_type,
+  struct net_device *netdev)
 {
struct fd f = fdget(ufd);
struct bpf_prog *prog;
@@ -1065,7 +1080,7 @@ static struct bpf_prog *__bpf_prog_get(u32 ufd, enum 
bpf_prog_type *attach_type)
prog = bpf_prog_get(f);
if (IS_ERR(prog))
return prog;
-   if (attach_type && (prog->type != *attach_type || prog->aux->offload)) {
+   if (attach_type && !bpf_prog_can_attach(prog, attach_type, netdev)) {
prog = ERR_PTR(-EINVAL);
goto out;
}
@@ -1078,12 +1093,12 @@ static struct bpf_prog *__bpf_prog_get(u32 ufd, enum 
bpf_prog_type *attach_type)
 
 struct bpf_prog *bpf_prog_get(u32 ufd)
 {
-   return __bpf_prog_get(ufd, NULL);
+   return __bpf_prog_get(ufd, NULL, NULL);
 }
 
 struct bpf_prog *bpf_prog_get_type(u32 ufd, enum bpf_prog_type type)
 {
-   struct bpf_prog *prog = __bpf_prog_get(ufd, );
+   struct bpf_prog *prog = __bpf_prog_get(ufd, , NULL);
 
if (!IS_ERR(prog))
trace_bpf_prog_get_type(prog);
@@ -1091,6 +1106,16 @@ struct bpf_prog *bpf_prog_get_type(u32 ufd, enum 
bpf_prog_type type)
 }
 EXPORT_SYMBOL_GPL(bpf_prog_get_type);
 
+struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
+  struct net_device *netdev)
+{
+   struct bpf_prog *prog = __bpf_prog_get(ufd, , netdev);
+
+   if (!IS_ERR(prog))
+   trace_bpf_prog_get_type(prog);
+   return prog;
+}
+
 /* last field in 'union bpf_attr' used by this command */
 #defineBPF_PROG_LOAD_LAST_FIELD prog_target_ifindex
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 10cde58d3275..30b5fe32c525 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7157,7 +7157,11 @@ int dev_change_xdp_fd(struct net_device *dev, struct 
netlink_ext_ack *extack,
__dev_xdp_attached(dev, bpf_op, NULL))
return -EBUSY;
 
-   prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP);
+   if (bpf_op == ops->ndo_bpf)
+   prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_XDP,
+dev);
+   else
+   prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP);
if (IS_ERR(prog))
return PTR_ERR(prog);

[PATCH net-next v2 08/15] nfp: bpf: remove the register renumbering leftovers

2017-11-03 Thread Jakub Kicinski
The register renumbering was removed and will not be coming back
in its old, naive form, given that it would be fundamentally
incompatible with calling functions.  Remove the leftovers.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c |  4 
 drivers/net/ethernet/netronome/nfp/bpf/main.h|  6 --
 drivers/net/ethernet/netronome/nfp/bpf/offload.c | 13 -
 3 files changed, 4 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index e1907a1d269e..ff150c27f411 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -2314,9 +2314,6 @@ nfp_bpf_jit(struct bpf_prog *filter, void *prog_mem,
if (ret)
goto out;
 
-   nfp_prog->num_regs = MAX_BPF_REG;
-   nfp_prog->regs_per_thread = 32;
-
nfp_prog->prog = prog_mem;
nfp_prog->__prog_alloc_len = prog_sz;
 
@@ -2331,7 +2328,6 @@ nfp_bpf_jit(struct bpf_prog *filter, void *prog_mem,
ret = nfp_bpf_ustore_calc(nfp_prog, (__force __le64 *)prog_mem);
 
res->n_instr = nfp_prog->prog_len;
-   res->dense_mode = false;
 out:
nfp_prog_free(nfp_prog);
 
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index c5280de2ab14..85b7d9398cda 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -141,8 +141,6 @@ static inline u8 mbpf_mode(const struct nfp_insn_meta *meta)
  * @prog_len: number of valid instructions in @prog array
  * @__prog_alloc_len: alloc size of @prog array
  * @type: BPF program type
- * @num_regs: number of registers used by this program
- * @regs_per_thread: number of basic registers allocated per thread
  * @start_off: address of the first instruction in the memory
  * @tgt_out: jump target for normal exit
  * @tgt_abort: jump target for abort (e.g. access outside of packet buffer)
@@ -159,9 +157,6 @@ struct nfp_prog {
 
enum bpf_prog_type type;
 
-   unsigned int num_regs;
-   unsigned int regs_per_thread;
-
unsigned int start_off;
unsigned int tgt_out;
unsigned int tgt_abort;
@@ -177,7 +172,6 @@ struct nfp_prog {
 
 struct nfp_bpf_result {
unsigned int n_instr;
-   bool dense_mode;
 };
 
 int
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c 
b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
index b9b5d675c4d3..268ba1ba82db 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
@@ -98,19 +98,14 @@ nfp_net_bpf_offload_prepare(struct nfp_net *nn,
 static void
 nfp_net_bpf_load_and_start(struct nfp_net *nn, u32 tc_flags,
   void *code, dma_addr_t dma_addr,
-  unsigned int code_sz, unsigned int n_instr,
-  bool dense_mode)
+  unsigned int code_sz, unsigned int n_instr)
 {
-   u64 bpf_addr = dma_addr;
int err;
 
nn->dp.bpf_offload_skip_sw = !!(tc_flags & TCA_CLS_FLAGS_SKIP_SW);
 
-   if (dense_mode)
-   bpf_addr |= NFP_NET_CFG_BPF_CFG_8CTX;
-
nn_writew(nn, NFP_NET_CFG_BPF_SIZE, n_instr);
-   nn_writeq(nn, NFP_NET_CFG_BPF_ADDR, bpf_addr);
+   nn_writeq(nn, NFP_NET_CFG_BPF_ADDR, dma_addr);
 
/* Load up the JITed code */
err = nfp_net_reconfig(nn, NFP_NET_CFG_UPDATE_BPF);
@@ -169,7 +164,7 @@ int nfp_net_bpf_offload(struct nfp_net *nn, struct 
tc_cls_bpf_offload *cls_bpf)
nfp_net_bpf_stop(nn);
nfp_net_bpf_load_and_start(nn, cls_bpf->gen_flags, code,
   dma_addr, max_instr * sizeof(u64),
-  res.n_instr, res.dense_mode);
+  res.n_instr);
return 0;
 
case TC_CLSBPF_ADD:
@@ -183,7 +178,7 @@ int nfp_net_bpf_offload(struct nfp_net *nn, struct 
tc_cls_bpf_offload *cls_bpf)
 
nfp_net_bpf_load_and_start(nn, cls_bpf->gen_flags, code,
   dma_addr, max_instr * sizeof(u64),
-  res.n_instr, res.dense_mode);
+  res.n_instr);
return 0;
 
case TC_CLSBPF_DESTROY:
-- 
2.14.1



[PATCH net-next v2 09/15] nfp: bpf: remove unnecessary include of nfp_net.h

2017-11-03 Thread Jakub Kicinski
BPF offload's main header does not need to include nfp_net.h.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index 85b7d9398cda..9f0df6a9786d 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -41,7 +41,6 @@
 #include 
 
 #include "../nfp_asm.h"
-#include "../nfp_net.h"
 
 /* For branch fixup logic use up-most byte of branch instruction as scratch
  * area.  Remember to clear this before sending instructions to HW!
-- 
2.14.1



[PATCH net-next v2 10/15] nfp: bpf: refactor offload logic

2017-11-03 Thread Jakub Kicinski
We currently create a fake cls_bpf offload object when we want
to offload XDP.  Simplify and clarify the code by moving the
TC/XDP specific logic out of common offload code.  This is easy
now that we don't support legacy TC actions.  We only need the
bpf program and state of the skip_sw flag.

Temporarily set @code to NULL in nfp_net_bpf_offload(), compilers
seem to have trouble recognizing it's always initialized.  Next
patches will eliminate that variable.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.c| 67 +++---
 drivers/net/ethernet/netronome/nfp/bpf/main.h|  4 +-
 drivers/net/ethernet/netronome/nfp/bpf/offload.c | 73 ++--
 3 files changed, 67 insertions(+), 77 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index 2ff97f12c160..9e1286346d42 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -54,28 +54,25 @@ static int
 nfp_bpf_xdp_offload(struct nfp_app *app, struct nfp_net *nn,
struct bpf_prog *prog)
 {
-   struct tc_cls_bpf_offload cmd = {
-   .prog = prog,
-   };
+   bool running, xdp_running;
int ret;
 
if (!nfp_net_ebpf_capable(nn))
return -EINVAL;
 
-   if (nn->dp.ctrl & NFP_NET_CFG_CTRL_BPF) {
-   if (!nn->dp.bpf_offload_xdp)
-   return prog ? -EBUSY : 0;
-   cmd.command = prog ? TC_CLSBPF_REPLACE : TC_CLSBPF_DESTROY;
-   } else {
-   if (!prog)
-   return 0;
-   cmd.command = TC_CLSBPF_ADD;
-   }
+   running = nn->dp.ctrl & NFP_NET_CFG_CTRL_BPF;
+   xdp_running = running && nn->dp.bpf_offload_xdp;
+
+   if (!prog && !xdp_running)
+   return 0;
+   if (prog && running && !xdp_running)
+   return -EBUSY;
 
-   ret = nfp_net_bpf_offload(nn, );
+   ret = nfp_net_bpf_offload(nn, prog, running, true);
/* Stop offload if replace not possible */
-   if (ret && cmd.command == TC_CLSBPF_REPLACE)
+   if (ret && prog)
nfp_bpf_xdp_offload(app, nn, NULL);
+
nn->dp.bpf_offload_xdp = prog && !ret;
return ret;
 }
@@ -96,27 +93,33 @@ static int nfp_bpf_setup_tc_block_cb(enum tc_setup_type 
type,
 {
struct tc_cls_bpf_offload *cls_bpf = type_data;
struct nfp_net *nn = cb_priv;
+   bool skip_sw;
+
+   if (type != TC_SETUP_CLSBPF ||
+   !tc_can_offload(nn->dp.netdev) ||
+   !nfp_net_ebpf_capable(nn) ||
+   cls_bpf->common.protocol != htons(ETH_P_ALL) ||
+   cls_bpf->common.chain_index)
+   return -EOPNOTSUPP;
+   if (nn->dp.bpf_offload_xdp)
+   return -EBUSY;
 
-   if (!tc_can_offload(nn->dp.netdev))
+   /* Only support TC direct action */
+   if (!cls_bpf->exts_integrated ||
+   tcf_exts_has_actions(cls_bpf->exts)) {
+   nn_err(nn, "only direct action with no legacy actions 
supported\n");
return -EOPNOTSUPP;
+   }
 
-   switch (type) {
-   case TC_SETUP_CLSBPF:
-   if (!nfp_net_ebpf_capable(nn) ||
-   cls_bpf->common.protocol != htons(ETH_P_ALL) ||
-   cls_bpf->common.chain_index)
-   return -EOPNOTSUPP;
-   if (nn->dp.bpf_offload_xdp)
-   return -EBUSY;
-
-   /* Only support TC direct action */
-   if (!cls_bpf->exts_integrated ||
-   tcf_exts_has_actions(cls_bpf->exts)) {
-   nn_err(nn, "only direct action with no legacy actions 
supported\n");
-   return -EOPNOTSUPP;
-   }
-
-   return nfp_net_bpf_offload(nn, cls_bpf);
+   skip_sw = !!(cls_bpf->gen_flags & TCA_CLS_FLAGS_SKIP_SW);
+
+   switch (cls_bpf->command) {
+   case TC_CLSBPF_REPLACE:
+   return nfp_net_bpf_offload(nn, cls_bpf->prog, true, !skip_sw);
+   case TC_CLSBPF_ADD:
+   return nfp_net_bpf_offload(nn, cls_bpf->prog, false, !skip_sw);
+   case TC_CLSBPF_DESTROY:
+   return nfp_net_bpf_offload(nn, NULL, true, !skip_sw);
default:
return -EOPNOTSUPP;
}
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index 9f0df6a9786d..6dddab95d57a 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -181,8 +181,8 @@ nfp_bpf_jit(struct bpf_prog *filter, void *prog,
 int nfp_prog_verify(struct nfp_prog *nfp_prog, struct bpf_prog *prog);
 
 struct nfp_net;
-struct tc_cls_bpf_offload;
 
-int nfp_net_bpf_offload(struct nfp_net *nn, struct tc_cls_bpf_offload 

[PATCH net-next v2 15/15] bpf: remove old offload/analyzer

2017-11-03 Thread Jakub Kicinski
Thanks to the ability to load a program for a specific device,
running verifier twice is no longer needed.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 include/linux/bpf_verifier.h |  5 ---
 kernel/bpf/verifier.c| 75 
 net/core/filter.c| 42 -
 3 files changed, 122 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index e45011dbc02d..07b96aaca256 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -152,9 +152,7 @@ struct bpf_verifier_env {
bool strict_alignment;  /* perform strict pointer alignment 
checks */
struct bpf_verifier_state *cur_state; /* current verifier state */
struct bpf_verifier_state_list **explored_states; /* search pruning 
optimization */
-   const struct bpf_ext_analyzer_ops *analyzer_ops; /* external analyzer 
ops */
const struct bpf_ext_analyzer_ops *dev_ops; /* device analyzer ops */
-   void *analyzer_priv; /* pointer to external analyzer's private data */
struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by 
eBPF program */
u32 used_map_cnt;   /* number of used maps */
u32 id_gen; /* used to generate unique reg IDs */
@@ -179,7 +177,4 @@ int bpf_prog_offload_verifier_prep(struct bpf_verifier_env 
*env)
 }
 #endif
 
-int bpf_analyzer(struct bpf_prog *prog, const struct bpf_ext_analyzer_ops *ops,
-void *priv);
-
 #endif /* _LINUX_BPF_VERIFIER_H */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 51aabb32ad67..add845fe788a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -949,9 +949,6 @@ static int check_ctx_access(struct bpf_verifier_env *env, 
int insn_idx, int off,
 */
*reg_type = info.reg_type;
 
-   if (env->analyzer_ops)
-   return 0;
-
env->insn_aux_data[insn_idx].ctx_field_size = 
info.ctx_field_size;
/* remember the offset of last byte accessed in ctx */
if (env->prog->aux->max_ctx_offset < off + size)
@@ -3736,9 +3733,6 @@ static int is_state_visited(struct bpf_verifier_env *env, 
int insn_idx)
 static int ext_analyzer_insn_hook(struct bpf_verifier_env *env,
  int insn_idx, int prev_insn_idx)
 {
-   if (env->analyzer_ops && env->analyzer_ops->insn_hook)
-   return env->analyzer_ops->insn_hook(env, insn_idx,
-   prev_insn_idx);
if (env->dev_ops && env->dev_ops->insn_hook)
return env->dev_ops->insn_hook(env, insn_idx, prev_insn_idx);
 
@@ -4601,72 +4595,3 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
*attr)
kfree(env);
return ret;
 }
-
-static const struct bpf_verifier_ops * const bpf_analyzer_ops[] = {
-#ifdef CONFIG_NET
-   [BPF_PROG_TYPE_XDP] = _analyzer_ops,
-   [BPF_PROG_TYPE_SCHED_CLS]   = _cls_act_analyzer_ops,
-#endif
-};
-
-int bpf_analyzer(struct bpf_prog *prog, const struct bpf_ext_analyzer_ops *ops,
-void *priv)
-{
-   struct bpf_verifier_env *env;
-   int ret;
-
-   if (prog->type >= ARRAY_SIZE(bpf_analyzer_ops) ||
-   !bpf_analyzer_ops[prog->type])
-   return -EOPNOTSUPP;
-
-   env = kzalloc(sizeof(struct bpf_verifier_env), GFP_KERNEL);
-   if (!env)
-   return -ENOMEM;
-
-   env->insn_aux_data = vzalloc(sizeof(struct bpf_insn_aux_data) *
-prog->len);
-   ret = -ENOMEM;
-   if (!env->insn_aux_data)
-   goto err_free_env;
-   env->prog = prog;
-   env->ops = bpf_analyzer_ops[env->prog->type];
-   env->analyzer_ops = ops;
-   env->analyzer_priv = priv;
-
-   /* grab the mutex to protect few globals used by verifier */
-   mutex_lock(_verifier_lock);
-
-   env->strict_alignment = false;
-   if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS))
-   env->strict_alignment = true;
-
-   env->explored_states = kcalloc(env->prog->len,
-  sizeof(struct bpf_verifier_state_list *),
-  GFP_KERNEL);
-   ret = -ENOMEM;
-   if (!env->explored_states)
-   goto skip_full_check;
-
-   ret = check_cfg(env);
-   if (ret < 0)
-   goto skip_full_check;
-
-   env->allow_ptr_leaks = capable(CAP_SYS_ADMIN);
-
-   ret = do_check(env);
-   if (env->cur_state) {
-   free_verifier_state(env->cur_state, true);
-   env->cur_state = NULL;
-   }
-
-skip_full_check:
-   while (!pop_stack(env, NULL, NULL));
-   free_states(env);
-
-   mutex_unlock(_verifier_lock);
-   

[PATCH net-next v2 13/15] nfp: bpf: move translation prepare to offload.c

2017-11-03 Thread Jakub Kicinski
struct nfp_prog is currently only used internally by the translator.
This means there is a lot of parameter passing going on, between
the translator and different stages of offload.  Simplify things
by allocating nfp_prog in offload.c already.

We will now use kmalloc() to allocate the program area and only
DMA map it for the time of loading (instead of allocating DMA
coherent memory upfront).

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c |  43 ++--
 drivers/net/ethernet/netronome/nfp/bpf/main.h|  14 +--
 drivers/net/ethernet/netronome/nfp/bpf/offload.c | 128 +++
 3 files changed, 94 insertions(+), 91 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 2eddbb45fd60..eae7a137a7a8 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -2245,58 +2245,27 @@ static int nfp_bpf_ustore_calc(struct nfp_prog 
*nfp_prog, __le64 *ustore)
 
 /**
  * nfp_bpf_jit() - translate BPF code into NFP assembly
+ * @nfp_prog:  nfp_prog prepared based on @filter
  * @filter:kernel BPF filter struct
- * @prog_mem:  memory to store assembler instructions
- * @prog_start:offset of the first instruction when loaded
- * @prog_done: where to jump on exit
- * @prog_sz:   size of @prog_mem in instructions
- * @res:   achieved parameters of translation results
  */
-int
-nfp_bpf_jit(struct bpf_prog *filter, void *prog_mem,
-   unsigned int prog_start, unsigned int prog_done,
-   unsigned int prog_sz, struct nfp_bpf_result *res)
+int nfp_bpf_jit(struct nfp_prog *nfp_prog, struct bpf_prog *filter)
 {
-   struct nfp_prog *nfp_prog;
int ret;
 
-   nfp_prog = kzalloc(sizeof(*nfp_prog), GFP_KERNEL);
-   if (!nfp_prog)
-   return -ENOMEM;
-
-   INIT_LIST_HEAD(_prog->insns);
-   nfp_prog->type = filter->type;
-   nfp_prog->start_off = prog_start;
-   nfp_prog->tgt_done = prog_done;
-
-   ret = nfp_prog_prepare(nfp_prog, filter->insnsi, filter->len);
-   if (ret)
-   goto out;
-
ret = nfp_prog_verify(nfp_prog, filter);
if (ret)
-   goto out;
+   return ret;
 
ret = nfp_bpf_optimize(nfp_prog);
if (ret)
-   goto out;
-
-   nfp_prog->prog = prog_mem;
-   nfp_prog->__prog_alloc_len = prog_sz;
+   return ret;
 
ret = nfp_translate(nfp_prog);
if (ret) {
pr_err("Translation failed with error %d (translated: %u)\n",
   ret, nfp_prog->n_translated);
-   ret = -EINVAL;
-   goto out;
+   return -EINVAL;
}
 
-   ret = nfp_bpf_ustore_calc(nfp_prog, (__force __le64 *)prog_mem);
-
-   res->n_instr = nfp_prog->prog_len;
-out:
-   nfp_prog_free(nfp_prog);
-
-   return ret;
+   return nfp_bpf_ustore_calc(nfp_prog, (__force __le64 *)nfp_prog->prog);
 }
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index b77231a134b9..36b4eda2d3f8 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -169,19 +169,7 @@ struct nfp_prog {
struct list_head insns;
 };
 
-struct nfp_bpf_result {
-   unsigned int n_instr;
-};
-
-int
-nfp_prog_prepare(struct nfp_prog *nfp_prog, const struct bpf_insn *prog,
-unsigned int cnt);
-void nfp_prog_free(struct nfp_prog *nfp_prog);
-
-int
-nfp_bpf_jit(struct bpf_prog *filter, void *prog,
-   unsigned int prog_start, unsigned int prog_done,
-   unsigned int prog_sz, struct nfp_bpf_result *res);
+int nfp_bpf_jit(struct nfp_prog *nfp_prog, struct bpf_prog *filter);
 
 int nfp_prog_verify(struct nfp_prog *nfp_prog, struct bpf_prog *prog);
 
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c 
b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
index 3200051e..c5546c0e87d8 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
@@ -51,7 +51,7 @@
 #include "../nfp_net_ctrl.h"
 #include "../nfp_net.h"
 
-int
+static int
 nfp_prog_prepare(struct nfp_prog *nfp_prog, const struct bpf_insn *prog,
 unsigned int cnt)
 {
@@ -73,7 +73,7 @@ nfp_prog_prepare(struct nfp_prog *nfp_prog, const struct 
bpf_insn *prog,
return 0;
 }
 
-void nfp_prog_free(struct nfp_prog *nfp_prog)
+static void nfp_prog_free(struct nfp_prog *nfp_prog)
 {
struct nfp_insn_meta *meta, *tmp;
 
@@ -84,25 +84,36 @@ void nfp_prog_free(struct nfp_prog *nfp_prog)
kfree(nfp_prog);
 }
 
-static int
-nfp_net_bpf_offload_prepare(struct nfp_net *nn, struct bpf_prog *prog,
-   struct nfp_bpf_result *res,
-

[PATCH net-next v2 02/15] bpf: offload: add infrastructure for loading programs for a specific netdev

2017-11-03 Thread Jakub Kicinski
The fact that we don't know which device the program is going
to be used on is quite limiting in current eBPF infrastructure.
We have to reverse or limit the changes which kernel makes to
the loaded bytecode if we want it to be offloaded to a networking
device.  We also have to invent new APIs for debugging and
troubleshooting support.

Make it possible to load programs for a specific netdev.  This
helps us to bring the debug information closer to the core
eBPF infrastructure (e.g. we will be able to reuse the verifer
log in device JIT).  It allows device JITs to perform translation
on the original bytecode.

__bpf_prog_get() when called to get a reference for an attachment
point will now refuse to give it if program has a device assigned.
Following patches will add a version of that function which passes
the expected netdev in. @type argument in __bpf_prog_get() is
renamed to attach_type to make it clearer that it's only set on
attachment.

All calls to ndo_bpf are protected by rtnl, only verifier callbacks
are not.  We need a wait queue to make sure netdev doesn't get
destroyed while verifier is still running and calling its driver.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Reviewed-by: Quentin Monnet 
---
 include/linux/bpf.h  |  36 +
 include/linux/bpf_verifier.h |  10 +++
 include/linux/netdevice.h|  14 
 include/uapi/linux/bpf.h |   1 +
 kernel/bpf/Makefile  |   1 +
 kernel/bpf/core.c|  10 ++-
 kernel/bpf/offload.c | 182 +++
 kernel/bpf/syscall.c |  17 +++-
 kernel/bpf/verifier.c|  15 +++-
 9 files changed, 278 insertions(+), 8 deletions(-)
 create mode 100644 kernel/bpf/offload.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 520aeebe0d93..e45d43f9ec92 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct perf_event;
 struct bpf_prog;
@@ -182,6 +183,16 @@ struct bpf_verifier_ops {
  struct bpf_prog *prog, u32 *target_size);
 };
 
+struct bpf_dev_offload {
+   struct bpf_prog *prog;
+   struct net_device   *netdev;
+   void*dev_priv;
+   struct list_headoffloads;
+   booldev_state;
+   boolverifier_running;
+   wait_queue_head_t   verifier_done;
+};
+
 struct bpf_prog_aux {
atomic_t refcnt;
u32 used_map_cnt;
@@ -199,6 +210,7 @@ struct bpf_prog_aux {
 #ifdef CONFIG_SECURITY
void *security;
 #endif
+   struct bpf_dev_offload *offload;
union {
struct work_struct work;
struct rcu_head rcu;
@@ -317,6 +329,7 @@ extern const struct file_operations bpf_prog_fops;
 #undef BPF_PROG_TYPE
 #undef BPF_MAP_TYPE
 
+extern const struct bpf_prog_ops bpf_offload_prog_ops;
 extern const struct bpf_verifier_ops tc_cls_act_analyzer_ops;
 extern const struct bpf_verifier_ops xdp_analyzer_ops;
 
@@ -491,6 +504,29 @@ static inline int cpu_map_enqueue(struct bpf_cpu_map_entry 
*rcpu,
 }
 #endif /* CONFIG_BPF_SYSCALL */
 
+int bpf_prog_offload_compile(struct bpf_prog *prog);
+void bpf_prog_offload_destroy(struct bpf_prog *prog);
+
+#if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
+int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
+
+static inline bool bpf_prog_is_dev_bound(struct bpf_prog_aux *aux)
+{
+   return aux->offload;
+}
+#else
+static inline int bpf_prog_offload_init(struct bpf_prog *prog,
+   union bpf_attr *attr)
+{
+   return -EOPNOTSUPP;
+}
+
+static inline bool bpf_prog_is_dev_bound(struct bpf_prog_aux *aux)
+{
+   return false;
+}
+#endif /* CONFIG_NET && CONFIG_BPF_SYSCALL */
+
 #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_BPF_SYSCALL)
 struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key);
 int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type);
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 3b0976aaac75..e45011dbc02d 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -153,6 +153,7 @@ struct bpf_verifier_env {
struct bpf_verifier_state *cur_state; /* current verifier state */
struct bpf_verifier_state_list **explored_states; /* search pruning 
optimization */
const struct bpf_ext_analyzer_ops *analyzer_ops; /* external analyzer 
ops */
+   const struct bpf_ext_analyzer_ops *dev_ops; /* device analyzer ops */
void *analyzer_priv; /* pointer to external analyzer's private data */
struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by 
eBPF program */
u32 used_map_cnt;   /* number of used maps */
@@ -169,6 +170,15 @@ static inline struct bpf_reg_state 

[PATCH net-next v2 06/15] cls_bpf: allow attaching programs loaded for specific device

2017-11-03 Thread Jakub Kicinski
If TC program is loaded with skip_sw flag, we should allow
the device-specific programs to be accepted.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Reviewed-by: Quentin Monnet 
---
 kernel/bpf/syscall.c |  1 +
 net/sched/cls_bpf.c  | 10 +++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 68f9123acd39..416d70cdfc76 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1115,6 +1115,7 @@ struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum 
bpf_prog_type type,
trace_bpf_prog_get_type(prog);
return prog;
 }
+EXPORT_SYMBOL_GPL(bpf_prog_get_type_dev);
 
 /* last field in 'union bpf_attr' used by this command */
 #defineBPF_PROG_LOAD_LAST_FIELD prog_target_ifindex
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index bc3edde1b9d7..dc9bd9a0070b 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -374,7 +374,7 @@ static int cls_bpf_prog_from_ops(struct nlattr **tb, struct 
cls_bpf_prog *prog)
 }
 
 static int cls_bpf_prog_from_efd(struct nlattr **tb, struct cls_bpf_prog *prog,
-const struct tcf_proto *tp)
+u32 gen_flags, const struct tcf_proto *tp)
 {
struct bpf_prog *fp;
char *name = NULL;
@@ -382,7 +382,11 @@ static int cls_bpf_prog_from_efd(struct nlattr **tb, 
struct cls_bpf_prog *prog,
 
bpf_fd = nla_get_u32(tb[TCA_BPF_FD]);
 
-   fp = bpf_prog_get_type(bpf_fd, BPF_PROG_TYPE_SCHED_CLS);
+   if (gen_flags & TCA_CLS_FLAGS_SKIP_SW)
+   fp = bpf_prog_get_type_dev(bpf_fd, BPF_PROG_TYPE_SCHED_CLS,
+  qdisc_dev(tp->q));
+   else
+   fp = bpf_prog_get_type(bpf_fd, BPF_PROG_TYPE_SCHED_CLS);
if (IS_ERR(fp))
return PTR_ERR(fp);
 
@@ -440,7 +444,7 @@ static int cls_bpf_set_parms(struct net *net, struct 
tcf_proto *tp,
prog->gen_flags = gen_flags;
 
ret = is_bpf ? cls_bpf_prog_from_ops(tb, prog) :
-  cls_bpf_prog_from_efd(tb, prog, tp);
+  cls_bpf_prog_from_efd(tb, prog, gen_flags, tp);
if (ret < 0)
return ret;
 
-- 
2.14.1



[PATCH net-next v2 12/15] nfp: bpf: move program prepare and free into offload.c

2017-11-03 Thread Jakub Kicinski
Most of offload/translation prepare logic will be moved to
offload.c.  To help git generate more reasonable diffs
move nfp_prog_prepare() and nfp_prog_free() functions
there as a first step.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c | 33 
 drivers/net/ethernet/netronome/nfp/bpf/main.h|  5 
 drivers/net/ethernet/netronome/nfp/bpf/offload.c | 33 
 3 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index ff150c27f411..2eddbb45fd60 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -77,17 +77,6 @@ nfp_meta_has_prev(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
return meta->l.prev != _prog->insns;
 }
 
-static void nfp_prog_free(struct nfp_prog *nfp_prog)
-{
-   struct nfp_insn_meta *meta, *tmp;
-
-   list_for_each_entry_safe(meta, tmp, _prog->insns, l) {
-   list_del(>l);
-   kfree(meta);
-   }
-   kfree(nfp_prog);
-}
-
 static void nfp_prog_push(struct nfp_prog *nfp_prog, u64 insn)
 {
if (nfp_prog->__prog_alloc_len == nfp_prog->prog_len) {
@@ -2127,28 +2116,6 @@ static int nfp_translate(struct nfp_prog *nfp_prog)
return nfp_fixup_branches(nfp_prog);
 }
 
-static int
-nfp_prog_prepare(struct nfp_prog *nfp_prog, const struct bpf_insn *prog,
-unsigned int cnt)
-{
-   unsigned int i;
-
-   for (i = 0; i < cnt; i++) {
-   struct nfp_insn_meta *meta;
-
-   meta = kzalloc(sizeof(*meta), GFP_KERNEL);
-   if (!meta)
-   return -ENOMEM;
-
-   meta->insn = prog[i];
-   meta->n = i;
-
-   list_add_tail(>l, _prog->insns);
-   }
-
-   return 0;
-}
-
 /* --- Optimizations --- */
 static void nfp_bpf_opt_reg_init(struct nfp_prog *nfp_prog)
 {
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index df56f40fea7c..b77231a134b9 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -173,6 +173,11 @@ struct nfp_bpf_result {
unsigned int n_instr;
 };
 
+int
+nfp_prog_prepare(struct nfp_prog *nfp_prog, const struct bpf_insn *prog,
+unsigned int cnt);
+void nfp_prog_free(struct nfp_prog *nfp_prog);
+
 int
 nfp_bpf_jit(struct bpf_prog *filter, void *prog,
unsigned int prog_start, unsigned int prog_done,
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c 
b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
index f4b9a46c844d..3200051e 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
@@ -51,6 +51,39 @@
 #include "../nfp_net_ctrl.h"
 #include "../nfp_net.h"
 
+int
+nfp_prog_prepare(struct nfp_prog *nfp_prog, const struct bpf_insn *prog,
+unsigned int cnt)
+{
+   unsigned int i;
+
+   for (i = 0; i < cnt; i++) {
+   struct nfp_insn_meta *meta;
+
+   meta = kzalloc(sizeof(*meta), GFP_KERNEL);
+   if (!meta)
+   return -ENOMEM;
+
+   meta->insn = prog[i];
+   meta->n = i;
+
+   list_add_tail(>l, _prog->insns);
+   }
+
+   return 0;
+}
+
+void nfp_prog_free(struct nfp_prog *nfp_prog)
+{
+   struct nfp_insn_meta *meta, *tmp;
+
+   list_for_each_entry_safe(meta, tmp, _prog->insns, l) {
+   list_del(>l);
+   kfree(meta);
+   }
+   kfree(nfp_prog);
+}
+
 static int
 nfp_net_bpf_offload_prepare(struct nfp_net *nn, struct bpf_prog *prog,
struct nfp_bpf_result *res,
-- 
2.14.1



[PATCH net-next v2 07/15] nfp: bpf: drop support for cls_bpf with legacy actions

2017-11-03 Thread Jakub Kicinski
Only support BPF_PROG_TYPE_SCHED_CLS programs in direct
action mode.  This simplifies preparing the offload since
there will now be only one mode of operation for that type
of program.  We need to know the attachment mode type of
cls_bpf programs, because exit codes are interpreted
differently for legacy vs DA mode.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c  |  87 ++---
 drivers/net/ethernet/netronome/nfp/bpf/main.c |  33 ++-
 drivers/net/ethernet/netronome/nfp/bpf/main.h |  30 +-
 drivers/net/ethernet/netronome/nfp/bpf/offload.c  | 108 +-
 drivers/net/ethernet/netronome/nfp/bpf/verifier.c |  11 +--
 5 files changed, 22 insertions(+), 247 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 2609a2487100..e1907a1d269e 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -201,47 +201,6 @@ emit_br(struct nfp_prog *nfp_prog, enum br_mask mask, u16 
addr, u8 defer)
  BR_CSS_NONE, addr, defer);
 }
 
-static void
-__emit_br_byte(struct nfp_prog *nfp_prog, u8 areg, u8 breg, bool imm8,
-  u8 byte, bool equal, u16 addr, u8 defer, bool src_lmextn)
-{
-   u16 addr_lo, addr_hi;
-   u64 insn;
-
-   addr_lo = addr & (OP_BB_ADDR_LO >> __bf_shf(OP_BB_ADDR_LO));
-   addr_hi = addr != addr_lo;
-
-   insn = OP_BBYTE_BASE |
-   FIELD_PREP(OP_BB_A_SRC, areg) |
-   FIELD_PREP(OP_BB_BYTE, byte) |
-   FIELD_PREP(OP_BB_B_SRC, breg) |
-   FIELD_PREP(OP_BB_I8, imm8) |
-   FIELD_PREP(OP_BB_EQ, equal) |
-   FIELD_PREP(OP_BB_DEFBR, defer) |
-   FIELD_PREP(OP_BB_ADDR_LO, addr_lo) |
-   FIELD_PREP(OP_BB_ADDR_HI, addr_hi) |
-   FIELD_PREP(OP_BB_SRC_LMEXTN, src_lmextn);
-
-   nfp_prog_push(nfp_prog, insn);
-}
-
-static void
-emit_br_byte_neq(struct nfp_prog *nfp_prog,
-swreg src, u8 imm, u8 byte, u16 addr, u8 defer)
-{
-   struct nfp_insn_re_regs reg;
-   int err;
-
-   err = swreg_to_restricted(reg_none(), src, reg_imm(imm), , true);
-   if (err) {
-   nfp_prog->error = err;
-   return;
-   }
-
-   __emit_br_byte(nfp_prog, reg.areg, reg.breg, reg.i8, byte, false, addr,
-  defer, reg.src_lmextn);
-}
-
 static void
 __emit_immed(struct nfp_prog *nfp_prog, u16 areg, u16 breg, u16 imm_hi,
 enum immed_width width, bool invert,
@@ -1547,7 +1506,7 @@ mem_ldx(struct nfp_prog *nfp_prog, struct nfp_insn_meta 
*meta,
unsigned int size)
 {
if (meta->ptr.type == PTR_TO_CTX) {
-   if (nfp_prog->act == NN_ACT_XDP)
+   if (nfp_prog->type == BPF_PROG_TYPE_XDP)
return mem_ldx_xdp(nfp_prog, meta, size);
else
return mem_ldx_skb(nfp_prog, meta, size);
@@ -2022,34 +1981,6 @@ static void nfp_intro(struct nfp_prog *nfp_prog)
 plen_reg(nfp_prog), ALU_OP_AND, pv_len(nfp_prog));
 }
 
-static void nfp_outro_tc_legacy(struct nfp_prog *nfp_prog)
-{
-   const u8 act2code[] = {
-   [NN_ACT_TC_DROP]  = 0x22,
-   [NN_ACT_TC_REDIR] = 0x24
-   };
-   /* Target for aborts */
-   nfp_prog->tgt_abort = nfp_prog_current_offset(nfp_prog);
-   wrp_immed(nfp_prog, reg_both(0), 0);
-
-   /* Target for normal exits */
-   nfp_prog->tgt_out = nfp_prog_current_offset(nfp_prog);
-   /* Legacy TC mode:
-*   00x11 -> pass,  count as stat0
-*  -1  drop  0x22 -> drop,  count as stat1
-* redir  0x24 -> redir, count as stat1
-*  ife mark  0x21 -> pass,  count as stat1
-*  ife + tx  0x24 -> redir, count as stat1
-*/
-   emit_br_byte_neq(nfp_prog, reg_b(0), 0xff, 0, nfp_prog->tgt_done, 2);
-   wrp_mov(nfp_prog, reg_a(0), NFP_BPF_ABI_FLAGS);
-   emit_ld_field(nfp_prog, reg_a(0), 0xc, reg_imm(0x11), SHF_SC_L_SHF, 16);
-
-   emit_br(nfp_prog, BR_UNC, nfp_prog->tgt_done, 1);
-   emit_ld_field(nfp_prog, reg_a(0), 0xc, reg_imm(act2code[nfp_prog->act]),
- SHF_SC_L_SHF, 16);
-}
-
 static void nfp_outro_tc_da(struct nfp_prog *nfp_prog)
 {
/* TC direct-action mode:
@@ -2142,17 +2073,15 @@ static void nfp_outro_xdp(struct nfp_prog *nfp_prog)
 
 static void nfp_outro(struct nfp_prog *nfp_prog)
 {
-   switch (nfp_prog->act) {
-   case NN_ACT_DIRECT:
+   switch (nfp_prog->type) {
+   case BPF_PROG_TYPE_SCHED_CLS:
nfp_outro_tc_da(nfp_prog);
break;
-   case NN_ACT_TC_DROP:
-   case NN_ACT_TC_REDIR:
-   nfp_outro_tc_legacy(nfp_prog);
-   break;
-   case NN_ACT_XDP:
+   case 

[PATCH net-next v2 11/15] nfp: bpf: require seamless reload for program replace

2017-11-03 Thread Jakub Kicinski
Firmware supports live replacement of programs for quite some
time now.  Remove the software-fallback related logic and
depend on the FW for program replace.  Seamless reload will
become a requirement if maps are present, anyway.

Load and start stages have to be split now, since replace
only needs a load, start has already been done on add.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.c| 11 ++---
 drivers/net/ethernet/netronome/nfp/bpf/main.h|  2 +-
 drivers/net/ethernet/netronome/nfp/bpf/offload.c | 62 
 drivers/net/ethernet/netronome/nfp/nfp_net.h |  2 -
 4 files changed, 35 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index 9e1286346d42..7ae7528cd96b 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -68,7 +68,7 @@ nfp_bpf_xdp_offload(struct nfp_app *app, struct nfp_net *nn,
if (prog && running && !xdp_running)
return -EBUSY;
 
-   ret = nfp_net_bpf_offload(nn, prog, running, true);
+   ret = nfp_net_bpf_offload(nn, prog, running);
/* Stop offload if replace not possible */
if (ret && prog)
nfp_bpf_xdp_offload(app, nn, NULL);
@@ -93,7 +93,6 @@ static int nfp_bpf_setup_tc_block_cb(enum tc_setup_type type,
 {
struct tc_cls_bpf_offload *cls_bpf = type_data;
struct nfp_net *nn = cb_priv;
-   bool skip_sw;
 
if (type != TC_SETUP_CLSBPF ||
!tc_can_offload(nn->dp.netdev) ||
@@ -111,15 +110,13 @@ static int nfp_bpf_setup_tc_block_cb(enum tc_setup_type 
type,
return -EOPNOTSUPP;
}
 
-   skip_sw = !!(cls_bpf->gen_flags & TCA_CLS_FLAGS_SKIP_SW);
-
switch (cls_bpf->command) {
case TC_CLSBPF_REPLACE:
-   return nfp_net_bpf_offload(nn, cls_bpf->prog, true, !skip_sw);
+   return nfp_net_bpf_offload(nn, cls_bpf->prog, true);
case TC_CLSBPF_ADD:
-   return nfp_net_bpf_offload(nn, cls_bpf->prog, false, !skip_sw);
+   return nfp_net_bpf_offload(nn, cls_bpf->prog, false);
case TC_CLSBPF_DESTROY:
-   return nfp_net_bpf_offload(nn, NULL, true, !skip_sw);
+   return nfp_net_bpf_offload(nn, NULL, true);
default:
return -EOPNOTSUPP;
}
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index 6dddab95d57a..df56f40fea7c 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -183,6 +183,6 @@ int nfp_prog_verify(struct nfp_prog *nfp_prog, struct 
bpf_prog *prog);
 struct nfp_net;
 
 int nfp_net_bpf_offload(struct nfp_net *nn, struct bpf_prog *prog,
-   bool old_prog, bool sw_fallback);
+   bool old_prog);
 
 #endif
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c 
b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
index c09efa1a9649..f4b9a46c844d 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
@@ -94,14 +94,11 @@ nfp_net_bpf_offload_prepare(struct nfp_net *nn, struct 
bpf_prog *prog,
 }
 
 static void
-nfp_net_bpf_load_and_start(struct nfp_net *nn, bool sw_fallback,
-  void *code, dma_addr_t dma_addr,
-  unsigned int code_sz, unsigned int n_instr)
+nfp_net_bpf_load(struct nfp_net *nn, void *code, dma_addr_t dma_addr,
+unsigned int code_sz, unsigned int n_instr)
 {
int err;
 
-   nn->dp.bpf_offload_skip_sw = !sw_fallback;
-
nn_writew(nn, NFP_NET_CFG_BPF_SIZE, n_instr);
nn_writeq(nn, NFP_NET_CFG_BPF_ADDR, dma_addr);
 
@@ -110,14 +107,19 @@ nfp_net_bpf_load_and_start(struct nfp_net *nn, bool 
sw_fallback,
if (err)
nn_err(nn, "FW command error while loading BPF: %d\n", err);
 
+   dma_free_coherent(nn->dp.dev, code_sz, code, dma_addr);
+}
+
+static void nfp_net_bpf_start(struct nfp_net *nn)
+{
+   int err;
+
/* Enable passing packets through BPF function */
nn->dp.ctrl |= NFP_NET_CFG_CTRL_BPF;
nn_writel(nn, NFP_NET_CFG_CTRL, nn->dp.ctrl);
err = nfp_net_reconfig(nn, NFP_NET_CFG_UPDATE_GEN);
if (err)
nn_err(nn, "FW command error while enabling BPF: %d\n", err);
-
-   dma_free_coherent(nn->dp.dev, code_sz, code, dma_addr);
 }
 
 static int nfp_net_bpf_stop(struct nfp_net *nn)
@@ -127,13 +129,12 @@ static int nfp_net_bpf_stop(struct nfp_net *nn)
 
nn->dp.ctrl &= ~NFP_NET_CFG_CTRL_BPF;
nn_writel(nn, NFP_NET_CFG_CTRL, nn->dp.ctrl);
-   nn->dp.bpf_offload_skip_sw = 0;
 
return nfp_net_reconfig(nn, NFP_NET_CFG_UPDATE_GEN);
 }

[PATCH net-next v2 14/15] nfp: bpf: move to new BPF program offload infrastructure

2017-11-03 Thread Jakub Kicinski
Following steps are taken in the driver to offload an XDP program:

XDP_SETUP_PROG:
 * prepare:
   - allocate program state;
   - run verifier (bpf_analyzer());
   - run translation;
 * load:
   - stop old program if needed;
   - load program;
   - enable BPF if not enabled;
 * clean up:
   - free program image.

With new infrastructure the flow will look like this:

BPF_OFFLOAD_VERIFIER_PREP:
  - allocate program state;
BPF_OFFLOAD_TRANSLATE:
   - run translation;
XDP_SETUP_PROG:
   - stop old program if needed;
   - load program;
   - enable BPF if not enabled;
BPF_OFFLOAD_DESTROY:
   - free program image.

Take advantage of the new infrastructure.  Allocation of driver
metadata has to be moved from jit.c to offload.c since it's now
done at a different stage.  Since there is no separate driver
private data for verification step, move temporary nfp_meta
pointer into nfp_prog.  We will now use user space context
offsets.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c   | 35 -
 drivers/net/ethernet/netronome/nfp/bpf/main.c  |  4 +
 drivers/net/ethernet/netronome/nfp/bpf/main.h  | 15 +++-
 drivers/net/ethernet/netronome/nfp/bpf/offload.c   | 85 ++
 drivers/net/ethernet/netronome/nfp/bpf/verifier.c  | 43 ++-
 drivers/net/ethernet/netronome/nfp/nfp_app.h   | 37 ++
 .../net/ethernet/netronome/nfp/nfp_net_common.c|  8 ++
 7 files changed, 121 insertions(+), 106 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index eae7a137a7a8..995e95410b11 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -1427,19 +1427,18 @@ static int mem_ldx_skb(struct nfp_prog *nfp_prog, 
struct nfp_insn_meta *meta,
swreg dst = reg_both(meta->insn.dst_reg * 2);
 
switch (meta->insn.off) {
-   case offsetof(struct sk_buff, len):
-   if (size != FIELD_SIZEOF(struct sk_buff, len))
+   case offsetof(struct __sk_buff, len):
+   if (size != FIELD_SIZEOF(struct __sk_buff, len))
return -EOPNOTSUPP;
wrp_mov(nfp_prog, dst, plen_reg(nfp_prog));
break;
-   case offsetof(struct sk_buff, data):
-   if (size != sizeof(void *))
+   case offsetof(struct __sk_buff, data):
+   if (size != FIELD_SIZEOF(struct __sk_buff, data))
return -EOPNOTSUPP;
wrp_mov(nfp_prog, dst, pptr_reg(nfp_prog));
break;
-   case offsetof(struct sk_buff, cb) +
-offsetof(struct bpf_skb_data_end, data_end):
-   if (size != sizeof(void *))
+   case offsetof(struct __sk_buff, data_end):
+   if (size != FIELD_SIZEOF(struct __sk_buff, data_end))
return -EOPNOTSUPP;
emit_alu(nfp_prog, dst,
 plen_reg(nfp_prog), ALU_OP_ADD, pptr_reg(nfp_prog));
@@ -1458,14 +1457,15 @@ static int mem_ldx_xdp(struct nfp_prog *nfp_prog, 
struct nfp_insn_meta *meta,
 {
swreg dst = reg_both(meta->insn.dst_reg * 2);
 
-   if (size != sizeof(void *))
-   return -EINVAL;
-
switch (meta->insn.off) {
-   case offsetof(struct xdp_buff, data):
+   case offsetof(struct xdp_md, data):
+   if (size != FIELD_SIZEOF(struct xdp_md, data))
+   return -EOPNOTSUPP;
wrp_mov(nfp_prog, dst, pptr_reg(nfp_prog));
break;
-   case offsetof(struct xdp_buff, data_end):
+   case offsetof(struct xdp_md, data_end):
+   if (size != FIELD_SIZEOF(struct xdp_md, data_end))
+   return -EOPNOTSUPP;
emit_alu(nfp_prog, dst,
 plen_reg(nfp_prog), ALU_OP_ADD, pptr_reg(nfp_prog));
break;
@@ -2243,19 +2243,10 @@ static int nfp_bpf_ustore_calc(struct nfp_prog 
*nfp_prog, __le64 *ustore)
return 0;
 }
 
-/**
- * nfp_bpf_jit() - translate BPF code into NFP assembly
- * @nfp_prog:  nfp_prog prepared based on @filter
- * @filter:kernel BPF filter struct
- */
-int nfp_bpf_jit(struct nfp_prog *nfp_prog, struct bpf_prog *filter)
+int nfp_bpf_jit(struct nfp_prog *nfp_prog)
 {
int ret;
 
-   ret = nfp_prog_verify(nfp_prog, filter);
-   if (ret)
-   return ret;
-
ret = nfp_bpf_optimize(nfp_prog);
if (ret)
return ret;
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index 7ae7528cd96b..e379b78e86ef 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -173,4 +173,8 @@ const struct nfp_app_type app_bpf = {
.setup_tc   = nfp_bpf_setup_tc,
.tc_busy= 

[PATCH net-next v2 00/15] bpf: add offload as a first class citizen

2017-11-03 Thread Jakub Kicinski
Hi!

This series is my stab at what was discussed at a recent IOvisor
bi-weekly call.  The idea is to make the device translator run at
the program load time.  This makes the offload more explicit to
the user space.  It also makes it easy for the device translator
to insert information into the original verifier log.

v2:
 - include linux/bug.h instead of asm/bug.h;
 - rebased on top of Craig's verifier fix (no changes, the last patch
   just removes more code now).  I checked the set doesn't conflict 
   with Jiri's, Josef's or Roman's patches, but missed Craig's fix :(
v1:
 - rename the ifindex member on load;
 - improve commit messages;
 - split nfp patches more.

Jakub Kicinski (15):
  net: bpf: rename ndo_xdp to ndo_bpf
  bpf: offload: add infrastructure for loading programs for a specific
netdev
  bpf: report offload info to user space
  bpftool: print program device bound info
  xdp: allow attaching programs loaded for specific device
  cls_bpf: allow attaching programs loaded for specific device
  nfp: bpf: drop support for cls_bpf with legacy actions
  nfp: bpf: remove the register renumbering leftovers
  nfp: bpf: remove unnecessary include of nfp_net.h
  nfp: bpf: refactor offload logic
  nfp: bpf: require seamless reload for program replace
  nfp: bpf: move program prepare and free into offload.c
  nfp: bpf: move translation prepare to offload.c
  nfp: bpf: move to new BPF program offload infrastructure
  bpf: remove old offload/analyzer

 drivers/net/ethernet/broadcom/bnxt/bnxt.c  |   2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c  |   2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h  |   2 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   |   4 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c|   6 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |   4 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |   6 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   4 +-
 drivers/net/ethernet/netronome/nfp/bpf/jit.c   | 194 ++
 drivers/net/ethernet/netronome/nfp/bpf/main.c  |  87 +++
 drivers/net/ethernet/netronome/nfp/bpf/main.h  |  59 ++---
 drivers/net/ethernet/netronome/nfp/bpf/offload.c   | 282 +
 drivers/net/ethernet/netronome/nfp/bpf/verifier.c  |  54 +---
 drivers/net/ethernet/netronome/nfp/nfp_app.h   |  37 +++
 drivers/net/ethernet/netronome/nfp/nfp_net.h   |   2 -
 .../net/ethernet/netronome/nfp/nfp_net_common.c|  12 +-
 drivers/net/ethernet/qlogic/qede/qede.h|   2 +-
 drivers/net/ethernet/qlogic/qede/qede_filter.c |   2 +-
 drivers/net/ethernet/qlogic/qede/qede_main.c   |   4 +-
 drivers/net/tun.c  |   4 +-
 drivers/net/virtio_net.c   |   4 +-
 include/linux/bpf.h|  47 
 include/linux/bpf_verifier.h   |  13 +-
 include/linux/netdevice.h  |  37 ++-
 include/uapi/linux/bpf.h   |   7 +
 kernel/bpf/Makefile|   1 +
 kernel/bpf/core.c  |  10 +-
 kernel/bpf/offload.c   | 194 ++
 kernel/bpf/syscall.c   |  52 +++-
 kernel/bpf/verifier.c  |  84 +-
 net/core/dev.c |  40 +--
 net/core/filter.c  |  42 ---
 net/core/rtnetlink.c   |   4 +-
 net/sched/cls_bpf.c|  10 +-
 tools/bpf/bpftool/prog.c   |  31 +++
 tools/include/uapi/linux/bpf.h |   7 +
 36 files changed, 686 insertions(+), 666 deletions(-)
 create mode 100644 kernel/bpf/offload.c

-- 
2.14.1



Re: [PATCH RFC,WIP 5/5] netfilter: nft_flow_offload: add ndo hooks for hardware offload

2017-11-03 Thread Florian Westphal
Pablo Neira Ayuso  wrote:
> +static void flow_offload_work(struct work_struct *work)
> +{
> + struct flow_hw_offload *offload, *next;
> +
> + spin_lock_bh(_hw_offload_lock);
> + list_for_each_entry_safe(offload, next, _hw_offload_pending_list, 
> list) {
> + do_flow_offload(offload->flow);

This should not offload flows that already have DYING bit set.

> + nf_conntrack_put(>ct->ct_general);
> + list_del(>list);
> + kfree(offload);
> + }
> + spin_unlock_bh(_hw_offload_lock);
> +
> + queue_delayed_work(system_power_efficient_wq, _flow_offload_dwork, 
> HZ);
> +}

Missed this on first round, 1 second is quite large.

[..]

>  static int nft_flow_route(const struct nft_pktinfo *pkt,
> const struct nf_conn *ct,
> union flow_gateway *orig_gw,
> @@ -211,6 +290,7 @@ static void nft_flow_offload_eval(const struct nft_expr 
> *expr,
>   union flow_gateway orig_gateway, reply_gateway;
>   struct net_device *outdev = pkt->xt.state->out;
>   struct net_device *indev = pkt->xt.state->in;
> + struct flow_hw_offload *offload;
>   enum ip_conntrack_info ctinfo;
>   struct flow_offload *flow;
>   struct nf_conn *ct;
> @@ -250,6 +330,21 @@ static void nft_flow_offload_eval(const struct nft_expr 
> *expr,
>   if (ret < 0)
>   goto err2;
>  
> + if (!indev->netdev_ops->ndo_flow_add)
> + return;
> +
> + offload = kmalloc(sizeof(struct flow_hw_offload), GFP_ATOMIC);
> + if (!offload)
> + return;
> +
> + nf_conntrack_get(>ct_general);
> + offload->ct = ct;
> + offload->flow = flow;
> +
> + spin_lock_bh(_hw_offload_lock);
> + list_add_tail(>list, _hw_offload_pending_list);
> + spin_unlock_bh(_hw_offload_lock);
> +
>   return;

So this aims for lazy offloading (up to 1 second delay).
Is this intentional, e.g. to avoid offloading short-lived 'RR' flows?

I would have expected this to schedule the workqueue here, and not use
delayed wq at all (i.e., also no self-rescheduling from
flow_offload_work()).


[PATCH net-next v2 04/15] bpftool: print program device bound info

2017-11-03 Thread Jakub Kicinski
If program is bound to a device, print the name of the relevant
interface or unknown if the netdev has since been removed.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Reviewed-by: Quentin Monnet 
---
 tools/bpf/bpftool/prog.c   | 31 +++
 tools/include/uapi/linux/bpf.h |  7 +++
 2 files changed, 38 insertions(+)

diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 250f80fd46aa..d3ab808dc882 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -229,6 +230,21 @@ static void print_prog_json(struct bpf_prog_info *info, 
int fd)
 info->tag[0], info->tag[1], info->tag[2], info->tag[3],
 info->tag[4], info->tag[5], info->tag[6], info->tag[7]);
 
+   if (info->status & BPF_PROG_STATUS_DEV_BOUND) {
+   jsonw_name(json_wtr, "dev");
+   if (info->ifindex) {
+   char name[IF_NAMESIZE];
+
+   if (!if_indextoname(info->ifindex, name))
+   jsonw_printf(json_wtr, "\"ifindex:%d\"",
+info->ifindex);
+   else
+   jsonw_printf(json_wtr, "\"%s\"", name);
+   } else {
+   jsonw_printf(json_wtr, "\"unknown\"");
+   }
+   }
+
if (info->load_time) {
char buf[32];
 
@@ -274,6 +290,21 @@ static void print_prog_plain(struct bpf_prog_info *info, 
int fd)
 
printf("tag ");
fprint_hex(stdout, info->tag, BPF_TAG_SIZE, "");
+   printf(" ");
+
+   if (info->status & BPF_PROG_STATUS_DEV_BOUND) {
+   printf("dev ");
+   if (info->ifindex) {
+   char name[IF_NAMESIZE];
+
+   if (!if_indextoname(info->ifindex, name))
+   printf("ifindex:%d ", info->ifindex);
+   else
+   printf("%s ", name);
+   } else {
+   printf("unknown ");
+   }
+   }
printf("\n");
 
if (info->load_time) {
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7cebba491011..e92f62cf933a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -259,6 +259,7 @@ union bpf_attr {
__u32   kern_version;   /* checked when 
prog_type=kprobe */
__u32   prog_flags;
charprog_name[BPF_OBJ_NAME_LEN];
+   __u32   prog_target_ifindex;/* ifindex of netdev to 
prep for */
};
 
struct { /* anonymous struct used by BPF_OBJ_* commands */
@@ -893,6 +894,10 @@ enum sk_action {
 
 #define BPF_TAG_SIZE   8
 
+enum bpf_prog_status {
+   BPF_PROG_STATUS_DEV_BOUND   = (1 << 0),
+};
+
 struct bpf_prog_info {
__u32 type;
__u32 id;
@@ -906,6 +911,8 @@ struct bpf_prog_info {
__u32 nr_map_ids;
__aligned_u64 map_ids;
char name[BPF_OBJ_NAME_LEN];
+   __u32 ifindex;
+   __u32 status;
 } __attribute__((aligned(8)));
 
 struct bpf_map_info {
-- 
2.14.1



Re: [PATCH RFC,WIP 2/5] netfilter: add software flow offload infrastructure

2017-11-03 Thread Florian Westphal
Pablo Neira Ayuso  wrote:
> +static int __init nf_flow_offload_module_init(void)
> +{
> + struct rhashtable_params params = flow_offload_rhash_params;
> + struct nf_hook_ops flow_offload_hook = {
> + .hook   = nf_flow_offload_hook,
> + .pf = NFPROTO_NETDEV,
> + .hooknum= NF_NETDEV_INGRESS,
> + .priority   = -100,

Magic number.  Should this be documented in nft?

Alternatively we could reject NETDEV_INGRESS base chains from
userspace if prio < 0 to prevent userspace rules from messing
with this flow offlaod infrastructure.

I guess the rationale of using auto-builtin hook is to avoid
forcing users to configure this with nftables rules?

> + rtnl_lock();
> + for_each_netdev(_net, dev) {
> + entry = kmalloc(sizeof(*entry), GFP_KERNEL);
> + if (!entry) {
> + rtnl_unlock();
> + return -ENOMEM;

This would need error unwinding (Unregistering the already-registered
hooks).

> + err = nf_register_net_hook(_net, >ops);
> + if (err < 0)
> + return err;

And here as well.


Re: next-20171103 build: 3 failures 22 warnings (next-20171103)

2017-11-03 Thread Arnd Bergmann
On Fri, Nov 3, 2017 at 8:27 PM, Masami Hiramatsu  wrote:
> On Fri, 3 Nov 2017 15:44:53 +0100 Arnd Bergmann  wrote:
>> On Fri, Nov 3, 2017 at 1:44 PM, Build bot for Mark Brown 
>>  wrote:
>>
>> > Warnings Summary: 22
>> >   2 ../net/sctp/probe.c:240:2: warning: 'unregister_jprobe' is 
>> > deprecated [-Wdeprecated-declarations]
>> >   2 ../net/sctp/probe.c:194:3: warning: 'register_jprobe' is 
>> > deprecated [-Wdeprecated-declarations]
>> >   2 ../net/sctp/probe.c:189:2: warning: 'register_jprobe' is 
>> > deprecated [-Wdeprecated-declarations]
>> >   2 ../net/ipv4/tcp_probe.c:298:2: warning: 'unregister_jprobe' is 
>> > deprecated [-Wdeprecated-declarations]
>> >   2 ../net/ipv4/tcp_probe.c:280:2: warning: 'register_jprobe' is 
>> > deprecated [-Wdeprecated-declarations]
>> >   2 ../net/dccp/probe.c:190:2: warning: 'unregister_jprobe' is 
>> > deprecated [-Wdeprecated-declarations]
>> >   2 ../net/dccp/probe.c:170:4: warning: 'register_jprobe' is 
>> > deprecated [-Wdeprecated-declarations]
>> >   2 ../net/dccp/probe.c:166:2: warning: 'register_jprobe' is 
>> > deprecated [-Wdeprecated-declarations]
>> >   1 ../arch/arm/probes/kprobes/test-core.c:398:2: warning: 
>> > 'unregister_jprobe' is deprecated [-Wdeprecated-declarations]
>> >   1 ../arch/arm/probes/kprobes/test-core.c:390:2: warning: 
>> > 'register_jprobe' is deprecated [-Wdeprecated-declarations]
>>
>> I need a little help from Masami Hiramatsu to understand what the plan is 
>> here:
>> Do we just need to remove those files now that jprobes are gone, or do
>> we actually want to restore the functionality using some other replacement 
>> code?
>>
>> I'm asking because the __deprecated warning seems unhelpful if there
>> isn't an easy way to address the warning.
>
> It seems that the arm/probes case is just for testing, I'll just remove it
> because it's functionality is gone.

Makes sense, thanks!

> Others should be decided by network maintainers. Maybe those are not used 
> anymore,
> or should be rewritten by kprobes or ftrace.

Added a few people to cc that worked on {tcp,dccp,sctp}_probe in the past
for clarification. Do you already have plans to deal with this?

Arnd


Re: [patch net-next 3/5] net: sched: introduce block mechanism to handle netif_keep_dst calls

2017-11-03 Thread Daniel Borkmann

On 11/03/2017 06:19 PM, Jiri Pirko wrote:

From: Jiri Pirko 

Couple of classifiers call netif_keep_dst directly on q->dev. That is
not possible to do directly for shared blocke where multiple qdiscs are
owning the block. So introduce a infrastructure to keep track of the
block owners in list and use this list to implement block variant of
netif_keep_dst.

Signed-off-by: Jiri Pirko 

[...]

+struct tcf_block_owner_item {
+   struct list_head list;
+   struct Qdisc *q;
+   enum tcf_block_binder_type binder_type;
+};
+
+static void
+tcf_block_owner_netif_keep_dst(struct tcf_block *block,
+  struct Qdisc *q,
+  enum tcf_block_binder_type binder_type)
+{
+   if (block->keep_dst &&
+   binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)


Why we need to keep dst on TCF_BLOCK_BINDER_TYPE_CLSACT_EGRESS ?
I presume this enum means sch_handle_egress() ? dst is dropped
later ...


+   netif_keep_dst(qdisc_dev(q));
+}
+
+void tcf_block_netif_keep_dst(struct tcf_block *block)
+{
+   struct tcf_block_owner_item *item;
+
+   block->keep_dst = true;
+   list_for_each_entry(item, >owner_list, list)




[PATCH] ath9k: dfs: use swap macro in ath9k_check_chirping

2017-11-03 Thread Gustavo A. R. Silva
Make use of the swap macro and remove unnecessary variable temp.
This makes the code easier to read and maintain.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva 
---
 drivers/net/wireless/ath/ath9k/dfs.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/wireless/ath/ath9k/dfs.c 
b/drivers/net/wireless/ath/ath9k/dfs.c
index 40a397f..6fee9a4 100644
--- a/drivers/net/wireless/ath/ath9k/dfs.c
+++ b/drivers/net/wireless/ath/ath9k/dfs.c
@@ -123,11 +123,9 @@ static bool ath9k_check_chirping(struct ath_softc *sc, u8 
*data,
fft = (struct ath9k_dfs_fft_40 *) (data + 2);
ath_dbg(common, DFS, "fixing datalen by 2\n");
}
-   if (IS_CHAN_HT40MINUS(ah->curchan)) {
-   int temp = is_ctl;
-   is_ctl = is_ext;
-   is_ext = temp;
-   }
+   if (IS_CHAN_HT40MINUS(ah->curchan))
+   swap(is_ctl, is_ext);
+
for (i = 0; i < FFT_NUM_SAMPLES; i++)
max_bin[i] = ath9k_get_max_index_ht40(fft + i, is_ctl,
  is_ext);
-- 
2.7.4



[PATCH] Net: netfilter: vmalloc/vfree to kvmalloc/kvfree

2017-11-03 Thread Charlie Sale
Fixed FIXME by changing memory allocation and freeing
in htable_create to kvmalloc and kvfree from vmalloc and vfree.
Changes are made throughout the file in order to account
for the different allocation of htable_create.
Small note: This is a replacement of an earlier patch that did not
work. Enough was changed, so I thought I would just submit a new
patch.

Signed-off-by: Charlie Sale 
---
 net/netfilter/xt_hashlimit.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
index 5da8746f7b88..28ad74b3e3d0 100644
--- a/net/netfilter/xt_hashlimit.c
+++ b/net/netfilter/xt_hashlimit.c
@@ -286,9 +286,9 @@ static int htable_create(struct net *net, struct 
hashlimit_cfg3 *cfg,
if (size < 16)
size = 16;
}
-   /* FIXME: don't use vmalloc() here or anywhere else -HW */
-   hinfo = vmalloc(sizeof(struct xt_hashlimit_htable) +
-   sizeof(struct hlist_head) * size);
+
+   hinfo = kvmalloc(sizeof(*hinfo) + sizeof(struct hlist_head) * size,
+GPT_KERNEL);
if (hinfo == NULL)
return -ENOMEM;
*out_hinfo = hinfo;
@@ -314,7 +314,7 @@ static int htable_create(struct net *net, struct 
hashlimit_cfg3 *cfg,
hinfo->rnd_initialized = false;
hinfo->name = kstrdup(name, GFP_KERNEL);
if (!hinfo->name) {
-   vfree(hinfo);
+   kvfree(hinfo);
return -ENOMEM;
}
spin_lock_init(>lock);
@@ -336,7 +336,7 @@ static int htable_create(struct net *net, struct 
hashlimit_cfg3 *cfg,
fops, hinfo);
if (hinfo->pde == NULL) {
kfree(hinfo->name);
-   vfree(hinfo);
+   kvfree(hinfo);
return -ENOMEM;
}
hinfo->net = net;
@@ -414,7 +414,7 @@ static void htable_destroy(struct xt_hashlimit_htable 
*hinfo)
htable_remove_proc_entry(hinfo);
htable_selective_cleanup(hinfo, select_all);
kfree(hinfo->name);
-   vfree(hinfo);
+   kvfree(hinfo);
 }
 
 static struct xt_hashlimit_htable *htable_find_get(struct net *net,
-- 
2.13.6



Re: [PATCH net-next 5/6] net: hns3: add support for nway_reset

2017-11-03 Thread Florian Fainelli
On 11/02/2017 09:18 PM, Lipeng wrote:
> From: Fuyun Liang 
> 
> This patch adds nway_reset support for ethtool cmd.
> 
> Signed-off-by: Fuyun Liang 
> Signed-off-by: Lipeng 
> ---
>  .../net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c  | 18 
> ++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c 
> b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
> index 7fe193b..a21470c 100644
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
> @@ -832,6 +832,23 @@ static int hns3_set_rxnfc(struct net_device *netdev, 
> struct ethtool_rxnfc *cmd)
>   }
>  }
>  
> +static int hns3_nway_reset(struct net_device *netdev)
> +{
> + struct phy_device *phy = netdev->phydev;
> +
> + if (!netif_running(netdev))
> + return 0;
> +
> + /* Only support nway_reset for netdev with phy attached for now */
> + if (!phy)
> + return -EOPNOTSUPP;
> +
> + if (phy->autoneg != AUTONEG_ENABLE)
> + return -EINVAL;
> +
> + return genphy_restart_aneg(phy);

Consider using phy_ethtool_nway_reset() which properly checks for
phydev->drv (you don't). phy_restart_aneg() already checks for
phydev->autoneg.
-- 
Florian


Re: [PATCH net-next 4/6] net: hns3: add support for set_link_ksettings

2017-11-03 Thread Florian Fainelli
On 11/02/2017 09:18 PM, Lipeng wrote:
> From: Fuyun Liang 
> 
> This patch adds set_link_ksettings support for ethtool cmd.
> 
> Signed-off-by: Fuyun Liang 
> Signed-off-by: Lipeng 
> ---
>  drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c 
> b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
> index c7b8ebd..7fe193b 100644
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c
> @@ -653,6 +653,16 @@ static int hns3_get_link_ksettings(struct net_device 
> *netdev,
>   return 0;
>  }
>  
> +static int hns3_set_link_ksettings(struct net_device *netdev,
> +const struct ethtool_link_ksettings *cmd)
> +{
> + /* Only support ksettings_set for netdev with phy attached for now */
> + if (netdev->phydev)
> + return phy_ethtool_ksettings_set(netdev->phydev, cmd);
> +
> + return -EOPNOTSUPP;

Consider using phy_ethtool_get_link_ksettings() which already checks for
netdev->phydev.
-- 
Florian


Re: [PATCH RFC,WIP 3/5] netfilter: nf_flow_offload: integration with conntrack

2017-11-03 Thread Florian Westphal
Pablo Neira Ayuso  wrote:
> This patch adds the IPS_OFFLOAD status bit, this new bit tells us that
> the conntrack entry is owned by the flow offload infrastructure. The
> timer of such conntrack entries is stopped - the conntrack garbage
> collector skips them - and they display no internal state in the case of
> TCP flows.
>
> Conntrack entries that have been offloaded to the flow table
> infrastructure cannot be deleted/flushed via ctnetlink. The flow table
> infrastructure is also responsible for releasing this conntrack entry.
> 
> Signed-off-by: Pablo Neira Ayuso 
> ---
> Instead of nf_flow_release_ct(), I'd rather keep a pointer reference to
> the conntrack object from the flow_offload entry, so we can skip the
> conntrack look up.

I agree, this would make sense.

> diff --git a/include/net/netfilter/nf_conntrack.h 
> b/include/net/netfilter/nf_conntrack.h
> index 8f3bd30511de..9af4bb0c2f46 100644
> --- a/include/net/netfilter/nf_conntrack.h
> +++ b/include/net/netfilter/nf_conntrack.h
> @@ -272,7 +272,8 @@ static inline unsigned long nf_ct_expires(const struct 
> nf_conn *ct)
>  
>  static inline bool nf_ct_is_expired(const struct nf_conn *ct)
>  {
> - return (__s32)(ct->timeout - nfct_time_stamp) <= 0;
> + return (__s32)(ct->timeout - nfct_time_stamp) <= 0 &&
> +!test_bit(IPS_OFFLOAD_BIT, >status);

An alternative would be to not touch nf_ct_is_expired() and instead ...
>  }
>  
> @@ -1011,12 +1014,14 @@ static void gc_worker(struct work_struct *work)
>   tmp = nf_ct_tuplehash_to_ctrack(h);
>  
>   scanned++;
> + if (test_bit(IPS_OFFLOAD_BIT, >status))
> + continue;
 
... advance/refresh ct->timeout from gc worker, i.e.

 if (test_bit(IPS_OFFLOAD_BIT, >status)) {
 ct->timeout = nfct_time_stamp + (1 DAY);
 continue;
 }

Would prevent normal path to ever see offloaded entry
as 'timed out', without having to check for the flag in lookup path
(OTOH the check should not be an issue either because lookup path
 has to access ct->status anyway).



Re: Bond recovery from BOND_LINK_FAIL state not working

2017-11-03 Thread Alex Sidorenko

Indeed, we do not print slave's ->link_new_state on each entry - so it is quite 
possible that we are at stage 6.

It is even possible that this has something to do with how NM initially created 
bonds.
Customer says that the problem occurs once only after host reboot, after that
failover works fine no matter how many times he changes the state of
VirtualConnect modules.

Jarod,

could you please add printing slave->link_new_state for both slaves at each
entry to bond_miimon_inspect?

(and instead of nudging slave->new_link like I suggested, use Jay's patch).

Alex


On 11/03/2017 02:26 PM, Jay Vosburgh wrote:

Alex Sidorenko  wrote:


Jay,

while scenario you describe makes sense, it does not match what we see in our 
tests.

The instrumentation prints info every time we enter bond_mii_monitor(), 
bond_miimon_inspect(),
bond_miimon_commit() and every time we are committing link state. And we print 
a message every time we
propose BOND_LINK_FAIL in bond_miimon_inspect()

So in your scenario,

2) bond_mii_monitor rtnl_trylock fails, it reschedules

we should see bond_miimon_inspect() 'proposed BOND_LINK_FAIL' debugging message 
without matching
'commit' messages. But what we see in reality is that for each 'proposed' there 
is 'commit' message.
(And we don't expect ens3f1 to really go down when VC module is rebooted - it 
is not connected to it).

Here is debugging output (with the fix I have suggested in my first email 
*applied*),
my comments inline.

  (FYI: in "bond_mii_monitor: ens3f0 current link state: 0" lines we print 
slave->link when entering bond_mii_monitor)

***

Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link 
state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link 
state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
BOND_LINK_UP case on slave ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
BOND_LINK_UP case on slave ens3f1
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link 
state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link 
state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
BOND_LINK_UP case on slave ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
BOND_LINK_UP case on slave ens3f1

Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link 
state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link 
state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
BOND_LINK_UP case on slave ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: proposed 
BOND_LINK_FAIL on slave ens3f0
 /*FALLTHRU*/
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
BOND_LINK_FAIL case on slave ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: set 
new_link=BOND_LINK_DOWN on slave ens3f0

Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
BOND_LINK_UP case on slave ens3f1
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect returned non-zero

   As you see in lines above, we reach BOND_LINK_FAIL on ens3f0 only, ens3f1 
has good link_state and we
   do not reach BOND_LINK_FAIL fallthru and do not propose anything
   (otherwise there would be debugging output for it)

   Now we are in bond_mii_monitor() bond_for_each_slave() loop, committing link 
states - and suddenly
   we have state 0->1 for interface ens3f1 as well as for ens3f0

Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 0->1 for 
interface ens3f0, 0 Mbps full duplex
Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for interface 
ens3f0 now: 1
Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 0->1 for 
interface ens3f1, 2 Mbps full duplex
Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for interface 
ens3f1 now: 1

Does your instrumentation show each slave's ->link_new_state at
each entry to bond_miimon_inspect?  Not just when commit, et al, is
called, but the actual value of the variable each time through the
function?  Or maybe you've got an entry for the propose way back in the
log somewhere?

I'm wondering here whether you're seeing "step 6" from the
failure path I described, i.e., the slave->link_new_state on ens3f1 was
set some time previously (perhaps a long time) and has been silently
pending until some event happens on the other slave to trigger a commit
cycle.

Something had to have set the variable, and from your
instrumentation, it does not appear that it was the immediately prior
instance of bond_miimon_inspect.  In net-next's bonding driver, nothing
other than the _propose call will enter BOND_LINK_FAIL state.

-J


   Entering bond_miimon_commit()
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave 

[PATCH net-next] liquidio: Fix an issue with multiple switchdev enable disables

2017-11-03 Thread Felix Manlunas
From: Vijaya Mohan Guvva 

Return success if the same dispatch function is being registered for
a given opcode and subcode, there by allow multiple switchdev enable
and disables.

Signed-off-by: Vijaya Mohan Guvva 
Signed-off-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/octeon_device.c | 4 
 drivers/net/ethernet/cavium/liquidio/octeon_droq.c   | 4 ++--
 drivers/net/ethernet/cavium/liquidio/octeon_droq.h   | 3 +++
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_device.c 
b/drivers/net/ethernet/cavium/liquidio/octeon_device.c
index e4aa339..2c615ab 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_device.c
@@ -1180,6 +1180,10 @@ void octeon_delete_dispatch_list(struct octeon_device 
*oct)
spin_unlock_bh(>dispatch.lock);
 
} else {
+   if (pfn == fn &&
+   octeon_get_dispatch_arg(oct, opcode, subcode) == fn_arg)
+   return 0;
+
dev_err(>pci_dev->dev,
"Found previously registered dispatch fn for 
opcode/subcode: %x/%x\n",
opcode, subcode);
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_droq.c 
b/drivers/net/ethernet/cavium/liquidio/octeon_droq.c
index 9372d4c..3461d65 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_droq.c
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_droq.c
@@ -52,8 +52,8 @@ struct __dispatch {
  *  @return  Failure: NULL
  *
  */
-static inline void *octeon_get_dispatch_arg(struct octeon_device *octeon_dev,
-   u16 opcode, u16 subcode)
+void *octeon_get_dispatch_arg(struct octeon_device *octeon_dev,
+ u16 opcode, u16 subcode)
 {
int idx;
struct list_head *dispatch;
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_droq.h 
b/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
index f91bc84..815a9f5 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
@@ -400,6 +400,9 @@ int octeon_register_dispatch_fn(struct octeon_device *oct,
u16 subcode,
octeon_dispatch_fn_t fn, void *fn_arg);
 
+void *octeon_get_dispatch_arg(struct octeon_device *oct,
+ u16 opcode, u16 subcode);
+
 void octeon_droq_print_stats(void);
 
 u32 octeon_droq_check_hw_for_pkts(struct octeon_droq *droq);
-- 
1.8.3.1



Re: [PATCH net] tcp: fix tcp_mtu_probe() vs highest_sack

2017-11-03 Thread Oleksandr Natalenko
Hi.

Thanks for the fix.

However, tcp_fastretrans_alert() warning case still remains open even with 
this patch. Do I understand correctly that these are 2 different issues?

Currently, I use latest 4.13 stable kernel + this patch and still get:

WARNING: CPU: 1 PID: 736 at net/ipv4/tcp_input.c:2826 tcp_fastretrans_alert
+0x7c8/0x990

Any idea on this?

On úterý 31. října 2017 7:08:20 CET Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> Based on SNMP values provided by Roman, Yuchung made the observation
> that some crashes in tcp_sacktag_walk() might be caused by MTU probing.
> 
> Looking at tcp_mtu_probe(), I found that when a new skb was placed
> in front of the write queue, we were not updating tcp highest sack.
> 
> If one skb is freed because all its content was copied to the new skb
> (for MTU probing), then tp->highest_sack could point to a now freed skb.
> 
> Bad things would then happen, including infinite loops.
> 
> This patch renames tcp_highest_sack_combine() and uses it
> from tcp_mtu_probe() to fix the bug.
> 
> Note that I also removed one test against tp->sacked_out,
> since we want to replace tp->highest_sack regardless of whatever
> condition, since keeping a stale pointer to freed skb is a recipe
> for disaster.
> 
> Fixes: a47e5a988a57 ("[TCP]: Convert highest_sack to sk_buff to allow direct
> access") Signed-off-by: Eric Dumazet 
> Reported-by: Alexei Starovoitov 
> Reported-by: Roman Gushchin 
> Reported-by: Oleksandr Natalenko 
> ---
>  include/net/tcp.h |6 +++---
>  net/ipv4/tcp_output.c |3 ++-
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index
> 33599d17522d6a19b9d9a316cc1579cd5e71ee32..e6d0002a1b0bc5f28c331a760823c8dc9
> 2f8fe24 100644 --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1771,12 +1771,12 @@ static inline void tcp_highest_sack_reset(struct
> sock *sk) tcp_sk(sk)->highest_sack = tcp_write_queue_head(sk);
>  }
> 
> -/* Called when old skb is about to be deleted (to be combined with new skb)
> */ -static inline void tcp_highest_sack_combine(struct sock *sk,
> +/* Called when old skb is about to be deleted and replaced by new skb */
> +static inline void tcp_highest_sack_replace(struct sock *sk,
>   struct sk_buff *old,
>   struct sk_buff *new)
>  {
> - if (tcp_sk(sk)->sacked_out && (old == tcp_sk(sk)->highest_sack))
> + if (old == tcp_highest_sack(sk))
>   tcp_sk(sk)->highest_sack = new;
>  }
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index
> ae60dd3faed0adc71731bc686f878afd4c628d32..823003eef3a21a5cc5c27e0be9f46159a
> fa060df 100644 --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2062,6 +2062,7 @@ static int tcp_mtu_probe(struct sock *sk)
>   nskb->ip_summed = skb->ip_summed;
> 
>   tcp_insert_write_queue_before(nskb, skb, sk);
> + tcp_highest_sack_replace(sk, skb, nskb);
> 
>   len = 0;
>   tcp_for_write_queue_from_safe(skb, next, sk) {
> @@ -2665,7 +2666,7 @@ static bool tcp_collapse_retrans(struct sock *sk,
> struct sk_buff *skb) else if (!skb_shift(skb, next_skb, next_skb_size))
>   return false;
>   }
> - tcp_highest_sack_combine(sk, next_skb, skb);
> + tcp_highest_sack_replace(sk, next_skb, skb);
> 
>   tcp_unlink_write_queue(next_skb, sk);




Re: TCP connection closed without FIN or RST

2017-11-03 Thread Vitaly Davidovich
On Fri, Nov 3, 2017 at 1:58 PM, Eric Dumazet  wrote:
> On Fri, 2017-11-03 at 13:23 -0400, Vitaly Davidovich wrote:
>> On Fri, Nov 3, 2017 at 12:05 PM, Eric Dumazet  wrote:
>> > On Fri, 2017-11-03 at 11:13 -0400, Vitaly Davidovich wrote:
>> >> Ok, an interesting finding.  The client was originally running with
>> >> SO_RCVBUF of 75K (apparently someone decided to set that for some
>> >> unknown reason).  I tried the test with a 1MB recv buffer and
>> >> everything works perfectly! The client responds with 0 window alerts,
>> >> the server just hits the persist condition and sends keep-alive
>> >> probes; the client continues answering with a 0 window up until it
>> >> wakes up and starts processing data in its receive buffer.  At that
>> >> point, the window opens up and the server sends more data.  Basically,
>> >> things look as one would expect in this situation :).
>> >>
>> >> /proc/sys/net/ipv4/tcp_rmem is 131072  1048576   20971520.  The
>> >> conversation flows normally, as described above, when I change the
>> >> client's recv buf size to 1048576.  I also tried 131072, but that
>> >> doesn't work - same retrans/no ACKs situation.
>> >>
>> >> I think this eliminates (right?) any middleware from the equation.
>> >> Instead, perhaps it's some bad interaction between a low recv buf size
>> >> and either some other TCP setting or TSO mechanics (LRO specifically).
>> >> Still investigating further.
>> >
>> > Just in case, have you tried a more recent linux kernel ?
>> I haven't but will look into that.  I was mostly hoping to see if
>> anyone perhaps has seen similar symptoms/behavior and figured out what
>> the root cause is - just a stab in the dark with the well-informed
>> folks on this list :).  As of right now, based on the fact that a 1MB
>> recv buffer works, I would surmise the issue is perhaps some poor
>> interaction between a lower recv buffer size and some other tcp
>> settings.  But I'm just speculating - will continue investigating, and
>> I'll update this thread if I get to the bottom of it.
>> >
>> > I would rather not spend time on some problem that might already be
>> > fixed.
>> Completely understandable - I really appreciate the tips and pointers
>> thus far Eric, they've been helpful in their own right.
>
> I am interested to see if the issue with small sk_rcvbuf is still there.
>
> We have an upcoming change to rcvbuf autotuning to not blindly give
> tcp_rmem[2] to all sockets, but use a function based on RTT.
>
> Meaning that local flows could use small sk_rcvbuf instead of inflated
> ones.
>
> And meaning that we could increase tcp_rmem[2] to better match modern
> capabilities (more memory on hosts, larger BDP)

So Eric, while I still have your interest here (although I know it's
waning :)), any code pointers to where I might look to see if a
specific small-ish rcv buf size may interact poorly with the rest of
the stack? Is it possible some buffer was starved in the client stack
which prevented it from sending any segments to the server? Maybe the
incoming retrans were actually dropped somewhere in the ingress pkt
processing and so the stack doesn't know it needs to react to
something? Pulling at straws here but clearly the recv buf size, and a
somewhat small one at that, has some play.

I checked dmesg (just in case something would pop up there) but didn't
observe any warnings or anything interesting.

>
>
>


Re: Bond recovery from BOND_LINK_FAIL state not working

2017-11-03 Thread Jay Vosburgh
Alex Sidorenko  wrote:

>Jay,
>
>while scenario you describe makes sense, it does not match what we see in our 
>tests.
>
>The instrumentation prints info every time we enter bond_mii_monitor(), 
>bond_miimon_inspect(),
>bond_miimon_commit() and every time we are committing link state. And we print 
>a message every time we
>propose BOND_LINK_FAIL in bond_miimon_inspect()
>
>So in your scenario,
>
>2) bond_mii_monitor rtnl_trylock fails, it reschedules
>
>we should see bond_miimon_inspect() 'proposed BOND_LINK_FAIL' debugging 
>message without matching
>'commit' messages. But what we see in reality is that for each 'proposed' 
>there is 'commit' message.
>(And we don't expect ens3f1 to really go down when VC module is rebooted - it 
>is not connected to it).
>
>Here is debugging output (with the fix I have suggested in my first email 
>*applied*),
>my comments inline.
>
>  (FYI: in "bond_mii_monitor: ens3f0 current link state: 0" lines we print 
> slave->link when entering bond_mii_monitor)
>
>***
>
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link 
>state: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link 
>state: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
>BOND_LINK_UP case on slave ens3f0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
>BOND_LINK_UP case on slave ens3f1
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link 
>state: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link 
>state: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
>BOND_LINK_UP case on slave ens3f0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
>BOND_LINK_UP case on slave ens3f1
>
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link 
>state: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link 
>state: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
>BOND_LINK_UP case on slave ens3f0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: proposed 
>BOND_LINK_FAIL on slave ens3f0
> /*FALLTHRU*/
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
>BOND_LINK_FAIL case on slave ens3f0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: set 
>new_link=BOND_LINK_DOWN on slave ens3f0
>
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
>BOND_LINK_UP case on slave ens3f1
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect returned non-zero
>
>   As you see in lines above, we reach BOND_LINK_FAIL on ens3f0 only, ens3f1 
> has good link_state and we
>   do not reach BOND_LINK_FAIL fallthru and do not propose anything
>   (otherwise there would be debugging output for it)
>
>   Now we are in bond_mii_monitor() bond_for_each_slave() loop, committing 
> link states - and suddenly
>   we have state 0->1 for interface ens3f1 as well as for ens3f0
>
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 0->1 for 
>interface ens3f0, 0 Mbps full duplex
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for 
>interface ens3f0 now: 1
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 0->1 for 
>interface ens3f1, 2 Mbps full duplex
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for 
>interface ens3f1 now: 1

Does your instrumentation show each slave's ->link_new_state at
each entry to bond_miimon_inspect?  Not just when commit, et al, is
called, but the actual value of the variable each time through the
function?  Or maybe you've got an entry for the propose way back in the
log somewhere?

I'm wondering here whether you're seeing "step 6" from the
failure path I described, i.e., the slave->link_new_state on ens3f1 was
set some time previously (perhaps a long time) and has been silently
pending until some event happens on the other slave to trigger a commit
cycle.

Something had to have set the variable, and from your
instrumentation, it does not appear that it was the immediately prior
instance of bond_miimon_inspect.  In net-next's bonding driver, nothing
other than the _propose call will enter BOND_LINK_FAIL state.

-J

>   Entering bond_miimon_commit()
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave 
>ens3f0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_DOWN
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: link status definitely down for 
>interface ens3f0, disabling it
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 2, 
>up: false, slave->delay: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 1, 
>up: true, slave->delay: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: no bestslave 
>found, bond failure imminent
>Oct 31 09:09:25 SYDC1LNX 

​​ATTN: My Dear

2017-11-03 Thread Mr Robert Williams


ATTN: My Dear

Good news,The BRITISH HIGH COMMISSION has actually verified and discovered
that your payment has been unnecessarily Delayed by corrupt officials of the
Company who are Trying to divert your fund of $4,700.000.00 Million into their 
private
accounts. Therefore we have obtained an irrevocable payment guarantee on your
Payment with WORLD Bank to make your payment through our new ATM VISA CARD
system which you can use to withdraw your money in any ATM MACHINE around your
area.

So we are here by inviting you to our office to pick up your ATM VISA CARD but
if you cannot be able to come down here in our office in person be inform that
you are going to pay for shipping fee of your ATM visa CARD, so if you are
unable to come down here then you are required to update us so that we will
proceed with the necessary arrangement for the delivery of your ATM VISA CARD.

As of now be informed that all arrangement has been done and the ATM VISA CARD
has be in your name, but to RE-ACTIVATE the ATM Card you have to forward your
current information as requested below to the bank for the ATM Card re-
activcation, then we will send you the ATM CARD for your immediate use.

Here are the information you have to forward to the bank:
1. Your Full Names:__
2. Postal Address:___
3. Direct Cell Numbers:___
4. E-mail Address:
5. Sex:_
6.Age:_
7. Occupation:
8.Nationality:


Therefore you are advised to contact UBA Bank accountant Manager  Mrs.Suzan Mark

with below Email address(  mrs.suzanmar...@gmail.com  )

CONTACT PERSON: Mrs.Suzan Mark
Chief Executive Officer UBA Bank Plc
Benin. Republic
Direct Hotline: +229-99374614

E-mail:( mrs.suzanmar...@yahoo.com )

Mr Robert Williams


Re: [jkirsher/next-queue PATCH 3/5] ixgbe: Fix handling of macvlan Tx offload

2017-11-03 Thread Alexander Duyck
On Thu, Nov 2, 2017 at 4:34 PM, Alexander Duyck
 wrote:
> From: Alexander Duyck 
>
> This update makes it so that we report the actual number of Tx queues via
> real_num_tx_queues but are still restricted to RSS on only the first pool
> by setting num_tc equal to 1. Doing this locks us into only having the
> ability to setup XPS on the queues in that pool, and only those queues
> should be used for transmitting anything other than macvlan traffic.
>
> Signed-off-by: Alexander Duyck 
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   17 +++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 69ef35d13c36..b22ec4b9d02c 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -6638,8 +6638,9 @@ int ixgbe_open(struct net_device *netdev)
> goto err_req_irq;
>
> /* Notify the stack of the actual queue counts. */
> -   if (adapter->num_rx_pools > 1)
> -   queues = adapter->num_rx_queues_per_pool;
> +   if (adapter->num_rx_pools > 1 &&
> +   adapter->num_tx_queues > IXGBE_MAX_L2A_QUEUES)
> +   queues = IXGBE_MAX_L2A_QUEUES;
> else
> queues = adapter->num_tx_queues;
>

I'm going to need to redo this portion of the patch anyway. It turns
out I was only enabling up to 4 queues (which worked for my test case
with 2 macvlan interfaces) because I didn't quite grok what this was
doing.

> @@ -8901,6 +8902,18 @@ int ixgbe_setup_tc(struct net_device *dev, u8 tc)
> if (adapter->hw.mac.type == ixgbe_mac_82598EB)
> adapter->hw.fc.requested_mode = 
> adapter->last_lfc_mode;
>
> +   /* To support macvlan offload we have to use num_tc to
> +* restrict the queues that can be used by the device.
> +* By doing this we can avoid reporing a false number of
> +* queues.
> +*/
> +   if (adapter->num_rx_pools > 1) {
> +   u16 qpp = adapter->num_rx_queues_per_pool;
> +
> +   netdev_set_num_tc(dev, 1);
> +   netdev_set_tc_queue(dev, 0, qpp, 0);
> +   }
> +
> adapter->flags &= ~IXGBE_FLAG_DCB_ENABLED;
>
> adapter->temp_dcb_cfg.pfc_mode_enable = false;
>


Your Response Required from Sgt.Monica Lin Brown

2017-11-03 Thread digecapam . subinc
I am Sgt.Monica Lin Brown, originally from Lake Jackson Texas USA.I personally 
made a special research on Internet address book and I came across your 
information. I am presently writing this mail to you from U.S Military base 
Kabul Afghanistan I have a secured business proposal for you.Reply me if 
interested for more info on my private E-mail sgtmonica-...@outlook.com


Re: TCP connection closed without FIN or RST

2017-11-03 Thread Eric Dumazet
On Fri, 2017-11-03 at 13:23 -0400, Vitaly Davidovich wrote:
> On Fri, Nov 3, 2017 at 12:05 PM, Eric Dumazet  wrote:
> > On Fri, 2017-11-03 at 11:13 -0400, Vitaly Davidovich wrote:
> >> Ok, an interesting finding.  The client was originally running with
> >> SO_RCVBUF of 75K (apparently someone decided to set that for some
> >> unknown reason).  I tried the test with a 1MB recv buffer and
> >> everything works perfectly! The client responds with 0 window alerts,
> >> the server just hits the persist condition and sends keep-alive
> >> probes; the client continues answering with a 0 window up until it
> >> wakes up and starts processing data in its receive buffer.  At that
> >> point, the window opens up and the server sends more data.  Basically,
> >> things look as one would expect in this situation :).
> >>
> >> /proc/sys/net/ipv4/tcp_rmem is 131072  1048576   20971520.  The
> >> conversation flows normally, as described above, when I change the
> >> client's recv buf size to 1048576.  I also tried 131072, but that
> >> doesn't work - same retrans/no ACKs situation.
> >>
> >> I think this eliminates (right?) any middleware from the equation.
> >> Instead, perhaps it's some bad interaction between a low recv buf size
> >> and either some other TCP setting or TSO mechanics (LRO specifically).
> >> Still investigating further.
> >
> > Just in case, have you tried a more recent linux kernel ?
> I haven't but will look into that.  I was mostly hoping to see if
> anyone perhaps has seen similar symptoms/behavior and figured out what
> the root cause is - just a stab in the dark with the well-informed
> folks on this list :).  As of right now, based on the fact that a 1MB
> recv buffer works, I would surmise the issue is perhaps some poor
> interaction between a lower recv buffer size and some other tcp
> settings.  But I'm just speculating - will continue investigating, and
> I'll update this thread if I get to the bottom of it.
> >
> > I would rather not spend time on some problem that might already be
> > fixed.
> Completely understandable - I really appreciate the tips and pointers
> thus far Eric, they've been helpful in their own right.

I am interested to see if the issue with small sk_rcvbuf is still there.

We have an upcoming change to rcvbuf autotuning to not blindly give
tcp_rmem[2] to all sockets, but use a function based on RTT.

Meaning that local flows could use small sk_rcvbuf instead of inflated
ones.

And meaning that we could increase tcp_rmem[2] to better match modern
capabilities (more memory on hosts, larger BDP)





[PATCH net v5 1/2] ARM: dts: imx: name the interrupts for the fec ethernet driver

2017-11-03 Thread Troy Kisky
imx7s/imx7d has the ptp interrupt newly added as well.

For imx7, "int0" is the interrupt for queue 0 and ENET_MII
"int1" is for queue 1
"int2" is for queue 2

For imx6sx, "int0" handles all 3 queues and ENET_MII

And of course, the "pps" interrupt is for the PTP_CLOCK_PPS interrupts
This will help document what each interrupt does.

Signed-off-by: Troy Kisky 

---
v2: replaced empty names with "int0","int1", or "int2"

reordered imx7 interrupts so that "int0", for queue 0, is first.

v3: renamed "ptp" interrupt to "pps", added binding documentation
for interrupt-names.

v4: add blank, ie s/"int0","pps"/"int0", "pps"/ as suggested by Andrew Lunn

v5: moved binding documentation to next patch
---
 arch/arm/boot/dts/imx6qdl.dtsi | 1 +
 arch/arm/boot/dts/imx6sx.dtsi  | 2 ++
 arch/arm/boot/dts/imx6ul.dtsi  | 2 ++
 arch/arm/boot/dts/imx7d.dtsi   | 6 --
 arch/arm/boot/dts/imx7s.dtsi   | 6 --
 5 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/arm/boot/dts/imx6qdl.dtsi b/arch/arm/boot/dts/imx6qdl.dtsi
index 8884b4a3cafb..f97dee04b9be 100644
--- a/arch/arm/boot/dts/imx6qdl.dtsi
+++ b/arch/arm/boot/dts/imx6qdl.dtsi
@@ -1017,6 +1017,7 @@
fec: ethernet@02188000 {
compatible = "fsl,imx6q-fec";
reg = <0x02188000 0x4000>;
+   interrupt-names = "int0", "pps";
interrupts-extended =
< 0 118 IRQ_TYPE_LEVEL_HIGH>,
< 0 119 IRQ_TYPE_LEVEL_HIGH>;
diff --git a/arch/arm/boot/dts/imx6sx.dtsi b/arch/arm/boot/dts/imx6sx.dtsi
index 6c7eb54be9e2..7f3773b88c47 100644
--- a/arch/arm/boot/dts/imx6sx.dtsi
+++ b/arch/arm/boot/dts/imx6sx.dtsi
@@ -861,6 +861,7 @@
fec1: ethernet@02188000 {
compatible = "fsl,imx6sx-fec", "fsl,imx6q-fec";
reg = <0x02188000 0x4000>;
+   interrupt-names = "int0", "pps";
interrupts = ,
 ;
clocks = < IMX6SX_CLK_ENET>,
@@ -970,6 +971,7 @@
fec2: ethernet@021b4000 {
compatible = "fsl,imx6sx-fec", "fsl,imx6q-fec";
reg = <0x021b4000 0x4000>;
+   interrupt-names = "int0", "pps";
interrupts = ,
 ;
clocks = < IMX6SX_CLK_ENET>,
diff --git a/arch/arm/boot/dts/imx6ul.dtsi b/arch/arm/boot/dts/imx6ul.dtsi
index f11a241a340d..5ac00ff959b2 100644
--- a/arch/arm/boot/dts/imx6ul.dtsi
+++ b/arch/arm/boot/dts/imx6ul.dtsi
@@ -476,6 +476,7 @@
fec2: ethernet@020b4000 {
compatible = "fsl,imx6ul-fec", "fsl,imx6q-fec";
reg = <0x020b4000 0x4000>;
+   interrupt-names = "int0", "pps";
interrupts = ,
 ;
clocks = < IMX6UL_CLK_ENET>,
@@ -775,6 +776,7 @@
fec1: ethernet@02188000 {
compatible = "fsl,imx6ul-fec", "fsl,imx6q-fec";
reg = <0x02188000 0x4000>;
+   interrupt-names = "int0", "pps";
interrupts = ,
 ;
clocks = < IMX6UL_CLK_ENET>,
diff --git a/arch/arm/boot/dts/imx7d.dtsi b/arch/arm/boot/dts/imx7d.dtsi
index 4d308d17f040..8b9c8c0695df 100644
--- a/arch/arm/boot/dts/imx7d.dtsi
+++ b/arch/arm/boot/dts/imx7d.dtsi
@@ -114,9 +114,11 @@
fec2: ethernet@30bf {
compatible = "fsl,imx7d-fec", "fsl,imx6sx-fec";
reg = <0x30bf 0x1>;
-   interrupts = ,
+   interrupt-names = "int0", "int1", "int2", "pps";
+   interrupts = ,
+   ,
,
-   ;
+   ;
clocks = < IMX7D_ENET_AXI_ROOT_CLK>,
< IMX7D_ENET_AXI_ROOT_CLK>,
< IMX7D_ENET2_TIME_ROOT_CLK>,
diff --git a/arch/arm/boot/dts/imx7s.dtsi b/arch/arm/boot/dts/imx7s.dtsi
index 82ad26e766eb..966b97fdc394 100644
--- a/arch/arm/boot/dts/imx7s.dtsi
+++ b/arch/arm/boot/dts/imx7s.dtsi
@@ -1007,9 +1007,11 @@
fec1: ethernet@30be {
compatible = "fsl,imx7d-fec", "fsl,imx6sx-fec";
reg = <0x30be 0x1>;
-   interrupts = ,
+   interrupt-names = "int0", "int1", "int2", "pps";
+   

[PATCH net v5 2/2] net: fec: Let fec_ptp have its own interrupt routine

2017-11-03 Thread Troy Kisky
This is better for code locality and should slightly
speed up normal interrupts.

This also allows PPS clock output to start working for
i.mx7. This is because i.mx7 was already using the limit
of 3 interrupts, and needed another.

Signed-off-by: Troy Kisky 

---

v2: made this change independent of any devicetree change
so that old dtbs continue to work.

Continue to register ptp clock if interrupt is not found.

v3: renamed "ptp" interrupt to "pps" interrupt

v4: no change

v5: moving binding documentation to this patch
as requested by Shawn Guo
s/irq_index/irq_idx/
add function fec_enet_get_irq_cnt() to encapsulate if,
as requested by Andy Duan
---
 Documentation/devicetree/bindings/net/fsl-fec.txt | 13 
 drivers/net/ethernet/freescale/fec.h  |  3 +-
 drivers/net/ethernet/freescale/fec_main.c | 31 ++---
 drivers/net/ethernet/freescale/fec_ptp.c  | 82 +--
 4 files changed, 84 insertions(+), 45 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/fsl-fec.txt 
b/Documentation/devicetree/bindings/net/fsl-fec.txt
index 6f55bdd52f8a..f0dc94409107 100644
--- a/Documentation/devicetree/bindings/net/fsl-fec.txt
+++ b/Documentation/devicetree/bindings/net/fsl-fec.txt
@@ -34,6 +34,19 @@ Optional properties:
 - fsl,err006687-workaround-present: If present indicates that the system has
   the hardware workaround for ERR006687 applied and does not need a software
   workaround.
+ -interrupt-names:  names of the interrupts listed in interrupts property in
+  the same order. The defaults if not specified are
+  __Number of interrupts__   __Default__
+   1   "int0"
+   2   "int0", "pps"
+   3   "int0", "int1", "int2"
+   4   "int0", "int1", "int2", "pps"
+  The order may be changed as long as they correspond to the interrupts
+  property. Currently, only i.mx7 uses "int1" and "int2". They correspond to
+  tx/rx queues 1 and 2. "int0" will be used for queue 0 and ENET_MII 
interrupts.
+  For imx6sx, "int0" handles all 3 queues and ENET_MII. "pps" is for the pulse
+  per second interrupt associated with 1588 precision time protocol(PTP).
+
 
 Optional subnodes:
 - mdio : specifies the mdio bus in the FEC, used as a container for phy nodes
diff --git a/drivers/net/ethernet/freescale/fec.h 
b/drivers/net/ethernet/freescale/fec.h
index ede1876a9a19..0af58991ca8f 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -582,12 +582,11 @@ struct fec_enet_private {
u64 ethtool_stats[0];
 };
 
-void fec_ptp_init(struct platform_device *pdev);
+void fec_ptp_init(struct platform_device *pdev, int irq_idx);
 void fec_ptp_stop(struct platform_device *pdev);
 void fec_ptp_start_cyclecounter(struct net_device *ndev);
 int fec_ptp_set(struct net_device *ndev, struct ifreq *ifr);
 int fec_ptp_get(struct net_device *ndev, struct ifreq *ifr);
-uint fec_ptp_check_pps_event(struct fec_enet_private *fep);
 
 //
 #endif /* FEC_H */
diff --git a/drivers/net/ethernet/freescale/fec_main.c 
b/drivers/net/ethernet/freescale/fec_main.c
index 3dc2d771a222..610573855213 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1602,10 +1602,6 @@ fec_enet_interrupt(int irq, void *dev_id)
ret = IRQ_HANDLED;
complete(>mdio_done);
}
-
-   if (fep->ptp_clock)
-   if (fec_ptp_check_pps_event(fep))
-   ret = IRQ_HANDLED;
return ret;
 }
 
@@ -3312,6 +3308,19 @@ fec_enet_get_queue_num(struct platform_device *pdev, int 
*num_tx, int *num_rx)
 
 }
 
+static int fec_enet_get_irq_cnt(struct platform_device *pdev)
+{
+   int irq_cnt = platform_irq_count(pdev);
+
+   if (irq_cnt > FEC_IRQ_NUM)
+   irq_cnt = FEC_IRQ_NUM;  /* last for pps */
+   else if (irq_cnt == 2)
+   irq_cnt = 1;/* last for pps */
+   else if (irq_cnt <= 0)
+   irq_cnt = 1;/* At least 1 irq is needed */
+   return irq_cnt;
+}
+
 static int
 fec_probe(struct platform_device *pdev)
 {
@@ -3325,6 +3334,8 @@ fec_probe(struct platform_device *pdev)
struct device_node *np = pdev->dev.of_node, *phy_node;
int num_tx_qs;
int num_rx_qs;
+   char irq_name[8];
+   int irq_cnt;
 
fec_enet_get_queue_num(pdev, _tx_qs, _rx_qs);
 
@@ -3465,18 +3476,20 @@ fec_probe(struct platform_device *pdev)
if (ret)
goto failed_reset;
 
+   irq_cnt = fec_enet_get_irq_cnt(pdev);
if (fep->bufdesc_ex)
-   fec_ptp_init(pdev);
+   fec_ptp_init(pdev, irq_cnt);
 
ret = fec_enet_init(ndev);
if (ret)
goto failed_init;
 
-   for (i = 0; i < FEC_IRQ_NUM; i++) {
-   

Re: [PATCH iproute2] Add "show" subcommand to "ip fou"

2017-11-03 Thread Tom Herbert
On Fri, Nov 3, 2017 at 10:19 AM, Greg Greenway  wrote:
> On Nov 1, 2017, at 2:03 PM, Stephen Hemminger  
> wrote:
>>
>> On Tue, 31 Oct 2017 13:00:47 -0700
>> Greg Greenway  wrote:
>>
>>> +if (tb[FOU_ATTR_AF]) {
>>> +family = rta_getattr_u8(tb[FOU_ATTR_AF]);
>>> +if (family == AF_INET)
>>> +family_str = "AF_INET";
>>> +else if (family == AF_INET6)
>>> +family_str = "AF_INET6";
>>> +else
>>> +family_str = "unknown";
>>> +fprintf(fp, "af %s ", family_str);
>>
>> The unwritten rule for ip commands is that the show function
>> must format the output with same command syntax as the other commands 
>> set/add/delete.
>> Since there is no "af AF_INET" option to ip fou, this breaks that convention.
>> Either ignore the address family, change the add command, or output with same
>> syntax (-6); preferably the latter.
>
> That makes sense.  Here's a corrected version.  It also avoids a 
> trailing-space in the output.
>
> From: Greg Greenway 
> Date: Tue, 31 Oct 2017 12:47:35 -0700
> Subject: [PATCH] Add "show" subcommand to "ip fou".
>
> Sample output:
>
> $ sudo ./ip/ip fou add port 111 ipproto 11
> $ sudo ./ip/ip fou add port 222 ipproto 22 -6
> $ ./ip/ip fou show
> port 222 ipproto 22 -6
> port 111 ipproto 11
>
> Signed-off-by: Greg Greenway 
> ---
>  ip/ipfou.c | 60 
>  1 file changed, 60 insertions(+)
>
> diff --git a/ip/ipfou.c b/ip/ipfou.c
> index 00dbe15..ecbaf11 100644
> --- a/ip/ipfou.c
> +++ b/ip/ipfou.c
> @@ -28,6 +28,7 @@ static void usage(void)
> fprintf(stderr, "Usage: ip fou add port PORT "
> "{ ipproto PROTO  | gue } [ -6 ]\n");
> fprintf(stderr, "   ip fou del port PORT [ -6 ]\n");
> +   fprintf(stderr, "   ip fou show\n");
> fprintf(stderr, "\n");
> fprintf(stderr, "Where: PROTO { ipproto-name | 1..255 }\n");
> fprintf(stderr, "   PORT { 1..65535 }\n");
> @@ -134,6 +135,63 @@ static int do_del(int argc, char **argv)
> return 0;
>  }
>
> +static int print_fou_mapping(const struct sockaddr_nl *who,
> +struct nlmsghdr *n, void *arg)
> +{
> +   FILE *fp = (FILE *)arg;
> +   struct genlmsghdr *ghdr;
> +   struct rtattr *tb[FOU_ATTR_MAX + 1];
> +   int len = n->nlmsg_len;
> +   unsigned family;
> +
> +   if (n->nlmsg_type != genl_family)
> +   return 0;
> +
> +   len -= NLMSG_LENGTH(GENL_HDRLEN);
> +   if (len < 0)
> +   return -1;
> +
> +   ghdr = NLMSG_DATA(n);
> +   parse_rtattr(tb, FOU_ATTR_MAX, (void *) ghdr + GENL_HDRLEN, len);
> +
> +   if (tb[FOU_ATTR_PORT])
> +   fprintf(fp, "port %u", 
> ntohs(rta_getattr_u16(tb[FOU_ATTR_PORT])));
> +   if (tb[FOU_ATTR_TYPE] && rta_getattr_u8(tb[FOU_ATTR_TYPE]) == 
> FOU_ENCAP_GUE)
> +   fprintf(fp, " gue");
> +   else if (tb[FOU_ATTR_IPPROTO])
> +   fprintf(fp, " ipproto %u", 
> rta_getattr_u8(tb[FOU_ATTR_IPPROTO]));
> +   if (tb[FOU_ATTR_AF]) {
> +   family = rta_getattr_u8(tb[FOU_ATTR_AF]);
> +   if (family == AF_INET6)
> +   fprintf(fp, " -6");
> +   }
> +   fprintf(fp, "\n");
> +
> +   return 0;
> +}
> +
> +static int do_show(int argc, char **argv)
> +{
> +   FOU_REQUEST(req, 4096, FOU_CMD_GET, NLM_F_REQUEST | NLM_F_DUMP);
> +
> +   if (argc > 0) {
> +   fprintf(stderr, "\"ip fou show\" does not take any 
> arguments.\n");
> +   return -1;
> +   }
> +
> +   if (rtnl_send(_rth, , req.n.nlmsg_len) < 0) {
> +   perror("Cannot send show request");
> +   exit(1);
> +   }
> +
> +   if (rtnl_dump_filter(_rth, print_fou_mapping, stdout) < 0) {
> +   fprintf(stderr, "Dump terminated\n");
> +   return 1;
> +   }
> +
> +   return 0;
> +}
> +
>  int do_ipfou(int argc, char **argv)
>  {
> if (argc < 1)
> @@ -149,6 +207,8 @@ int do_ipfou(int argc, char **argv)
> return do_add(argc-1, argv+1);
> if (matches(*argv, "delete") == 0)
> return do_del(argc-1, argv+1);
> +   if (matches(*argv, "show") == 0)
> +   return do_show(argc-1, argv+1);
> fprintf(stderr, "Command \"%s\" is unknown, try \"ip fou help\".\n", 
> *argv);
> exit(-1);
>  }
> --
> 2.7.4
>
Acked-by: Tom Herbert 


Re: TCP connection closed without FIN or RST

2017-11-03 Thread Vitaly Davidovich
On Fri, Nov 3, 2017 at 12:05 PM, Eric Dumazet  wrote:
> On Fri, 2017-11-03 at 11:13 -0400, Vitaly Davidovich wrote:
>> Ok, an interesting finding.  The client was originally running with
>> SO_RCVBUF of 75K (apparently someone decided to set that for some
>> unknown reason).  I tried the test with a 1MB recv buffer and
>> everything works perfectly! The client responds with 0 window alerts,
>> the server just hits the persist condition and sends keep-alive
>> probes; the client continues answering with a 0 window up until it
>> wakes up and starts processing data in its receive buffer.  At that
>> point, the window opens up and the server sends more data.  Basically,
>> things look as one would expect in this situation :).
>>
>> /proc/sys/net/ipv4/tcp_rmem is 131072  1048576   20971520.  The
>> conversation flows normally, as described above, when I change the
>> client's recv buf size to 1048576.  I also tried 131072, but that
>> doesn't work - same retrans/no ACKs situation.
>>
>> I think this eliminates (right?) any middleware from the equation.
>> Instead, perhaps it's some bad interaction between a low recv buf size
>> and either some other TCP setting or TSO mechanics (LRO specifically).
>> Still investigating further.
>
> Just in case, have you tried a more recent linux kernel ?
I haven't but will look into that.  I was mostly hoping to see if
anyone perhaps has seen similar symptoms/behavior and figured out what
the root cause is - just a stab in the dark with the well-informed
folks on this list :).  As of right now, based on the fact that a 1MB
recv buffer works, I would surmise the issue is perhaps some poor
interaction between a lower recv buffer size and some other tcp
settings.  But I'm just speculating - will continue investigating, and
I'll update this thread if I get to the bottom of it.
>
> I would rather not spend time on some problem that might already be
> fixed.
Completely understandable - I really appreciate the tips and pointers
thus far Eric, they've been helpful in their own right.
>
>
>


[patch net-next 1/5] net: sched: introduce support for multiple filter chain pointers registration

2017-11-03 Thread Jiri Pirko
From: Jiri Pirko 

So far, there was possible only to register a single filter chain
pointer to block->chain[0]. However, when the blocks will get shareable,
we need to allow multiple filter chain pointers registration.

Signed-off-by: Jiri Pirko 
---
 include/net/pkt_cls.h |   3 +
 include/net/sch_generic.h |   5 +-
 net/sched/cls_api.c   | 247 +++---
 3 files changed, 218 insertions(+), 37 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 505d4b7..05c478e 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -28,6 +28,8 @@ struct tcf_block_ext_info {
enum tcf_block_binder_type binder_type;
tcf_chain_head_change_t *chain_head_change;
void *chain_head_change_priv;
+   bool shareable;
+   u32 block_index;
 };
 
 struct tcf_block_cb;
@@ -47,6 +49,7 @@ void tcf_block_put_ext(struct tcf_block *block, struct Qdisc 
*q,
 
 static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
 {
+   WARN_ON(block->refcnt != 1);
return block->q;
 }
 
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index c64e62c..8cbdd82 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -264,8 +264,7 @@ typedef void tcf_chain_head_change_t(struct tcf_proto 
*tp_head, void *priv);
 
 struct tcf_chain {
struct tcf_proto __rcu *filter_chain;
-   tcf_chain_head_change_t *chain_head_change;
-   void *chain_head_change_priv;
+   struct list_head filter_chain_list;
struct list_head list;
struct tcf_block *block;
u32 index; /* chain index */
@@ -274,6 +273,8 @@ struct tcf_chain {
 
 struct tcf_block {
struct list_head chain_list;
+   u32 index; /* block index for shared blocks */
+   unsigned int refcnt;
struct net *net;
struct Qdisc *q;
struct list_head cb_list;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 206e19f..4576b2d 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -180,6 +181,12 @@ static void tcf_proto_destroy(struct tcf_proto *tp)
kfree_rcu(tp, rcu);
 }
 
+struct tcf_filter_chain_list_item {
+   struct list_head list;
+   tcf_chain_head_change_t *chain_head_change;
+   void *chain_head_change_priv;
+};
+
 static struct tcf_chain *tcf_chain_create(struct tcf_block *block,
  u32 chain_index)
 {
@@ -188,6 +195,7 @@ static struct tcf_chain *tcf_chain_create(struct tcf_block 
*block,
chain = kzalloc(sizeof(*chain), GFP_KERNEL);
if (!chain)
return NULL;
+   INIT_LIST_HEAD(>filter_chain_list);
list_add_tail(>list, >chain_list);
chain->block = block;
chain->index = chain_index;
@@ -195,12 +203,19 @@ static struct tcf_chain *tcf_chain_create(struct 
tcf_block *block,
return chain;
 }
 
+static void tcf_chain_head_change_item(struct tcf_filter_chain_list_item *item,
+  struct tcf_proto *tp_head)
+{
+   if (item->chain_head_change)
+   item->chain_head_change(tp_head, item->chain_head_change_priv);
+}
 static void tcf_chain_head_change(struct tcf_chain *chain,
  struct tcf_proto *tp_head)
 {
-   if (chain->chain_head_change)
-   chain->chain_head_change(tp_head,
-chain->chain_head_change_priv);
+   struct tcf_filter_chain_list_item *item;
+
+   list_for_each_entry(item, >filter_chain_list, list)
+   tcf_chain_head_change_item(item, tp_head);
 }
 
 static void tcf_chain_flush(struct tcf_chain *chain)
@@ -276,15 +291,84 @@ static void tcf_block_offload_unbind(struct tcf_block 
*block, struct Qdisc *q,
tcf_block_offload_cmd(block, q, ei, TC_BLOCK_UNBIND);
 }
 
-int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
- struct tcf_block_ext_info *ei)
+static int
+tcf_chain_head_change_cb_add(struct tcf_chain *chain,
+struct tcf_block_ext_info *ei)
+{
+   struct tcf_filter_chain_list_item *item;
+
+   item = kmalloc(sizeof(*item), GFP_KERNEL);
+   if (!item)
+   return -ENOMEM;
+   item->chain_head_change = ei->chain_head_change;
+   item->chain_head_change_priv = ei->chain_head_change_priv;
+   if (chain->filter_chain)
+   tcf_chain_head_change_item(item, chain->filter_chain);
+   list_add(>list, >filter_chain_list);
+   return 0;
+}
+
+static void
+tcf_chain_head_change_cb_del(struct tcf_chain *chain,
+struct tcf_block_ext_info *ei)
+{
+   struct tcf_filter_chain_list_item *item;
+
+   list_for_each_entry(item, >filter_chain_list, list) {
+   if ((!ei->chain_head_change && 

[patch net-next 3/5] net: sched: introduce block mechanism to handle netif_keep_dst calls

2017-11-03 Thread Jiri Pirko
From: Jiri Pirko 

Couple of classifiers call netif_keep_dst directly on q->dev. That is
not possible to do directly for shared blocke where multiple qdiscs are
owning the block. So introduce a infrastructure to keep track of the
block owners in list and use this list to implement block variant of
netif_keep_dst.

Signed-off-by: Jiri Pirko 
---
 include/net/pkt_cls.h |  1 +
 include/net/sch_generic.h |  2 ++
 net/sched/cls_api.c   | 68 +++
 net/sched/cls_bpf.c   |  4 +--
 net/sched/cls_flow.c  |  2 +-
 net/sched/cls_route.c |  2 +-
 6 files changed, 75 insertions(+), 4 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 05c478e..683c5d5 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -39,6 +39,7 @@ bool tcf_queue_work(struct work_struct *work);
 struct tcf_chain *tcf_chain_get(struct tcf_block *block, u32 chain_index,
bool create);
 void tcf_chain_put(struct tcf_chain *chain);
+void tcf_block_netif_keep_dst(struct tcf_block *block);
 int tcf_block_get(struct tcf_block **p_block,
  struct tcf_proto __rcu **p_filter_chain, struct Qdisc *q);
 int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 8cbdd82..ef907d4 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -278,6 +278,8 @@ struct tcf_block {
struct net *net;
struct Qdisc *q;
struct list_head cb_list;
+   struct list_head owner_list;
+   bool keep_dst;
struct work_struct work;
 };
 
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index b3e313f..4166e3f 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -371,6 +371,7 @@ static struct tcf_block *tcf_block_create(struct net *net, 
struct Qdisc *q)
return ERR_PTR(-ENOMEM);
INIT_LIST_HEAD(>chain_list);
INIT_LIST_HEAD(>cb_list);
+   INIT_LIST_HEAD(>owner_list);
 
/* Create chain 0 by default, it has to be always present. */
chain = tcf_chain_create(block, 0);
@@ -425,6 +426,64 @@ static struct tcf_chain *tcf_block_chain_zero(struct 
tcf_block *block)
return list_first_entry(>chain_list, struct tcf_chain, list);
 }
 
+struct tcf_block_owner_item {
+   struct list_head list;
+   struct Qdisc *q;
+   enum tcf_block_binder_type binder_type;
+};
+
+static void
+tcf_block_owner_netif_keep_dst(struct tcf_block *block,
+  struct Qdisc *q,
+  enum tcf_block_binder_type binder_type)
+{
+   if (block->keep_dst &&
+   binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
+   netif_keep_dst(qdisc_dev(q));
+}
+
+void tcf_block_netif_keep_dst(struct tcf_block *block)
+{
+   struct tcf_block_owner_item *item;
+
+   block->keep_dst = true;
+   list_for_each_entry(item, >owner_list, list)
+   tcf_block_owner_netif_keep_dst(block, item->q,
+  item->binder_type);
+}
+EXPORT_SYMBOL(tcf_block_netif_keep_dst);
+
+static int tcf_block_owner_add(struct tcf_block *block,
+  struct Qdisc *q,
+  enum tcf_block_binder_type binder_type)
+{
+   struct tcf_block_owner_item *item;
+
+   item = kmalloc(sizeof(*item), GFP_KERNEL);
+   if (!item)
+   return -ENOMEM;
+   item->q = q;
+   item->binder_type = binder_type;
+   list_add(>list, >owner_list);
+   return 0;
+}
+
+static void tcf_block_owner_del(struct tcf_block *block,
+   struct Qdisc *q,
+   enum tcf_block_binder_type binder_type)
+{
+   struct tcf_block_owner_item *item;
+
+   list_for_each_entry(item, >owner_list, list) {
+   if (item->q == q && item->binder_type == binder_type) {
+   list_del(>list);
+   kfree(item);
+   return;
+   }
+   }
+   WARN_ON(1);
+}
+
 int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
  struct tcf_block_ext_info *ei)
 {
@@ -451,6 +510,12 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct 
Qdisc *q,
}
}
 
+   err = tcf_block_owner_add(block, q, ei->binder_type);
+   if (err)
+   goto err_block_owner_add;
+
+   tcf_block_owner_netif_keep_dst(block, q, ei->binder_type);
+
err = tcf_chain_head_change_cb_add(tcf_block_chain_zero(block), ei);
if (err)
goto err_chain_head_change_cb_add;
@@ -460,6 +525,8 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct 
Qdisc *q,
return 0;
 
 err_chain_head_change_cb_add:
+   tcf_block_owner_del(block, q, ei->binder_type);
+err_block_owner_add:
if (created) 

Re: [PATCH iproute2] Add "show" subcommand to "ip fou"

2017-11-03 Thread Greg Greenway
On Nov 1, 2017, at 2:03 PM, Stephen Hemminger  
wrote:
> 
> On Tue, 31 Oct 2017 13:00:47 -0700
> Greg Greenway  wrote:
> 
>> +if (tb[FOU_ATTR_AF]) {
>> +family = rta_getattr_u8(tb[FOU_ATTR_AF]);
>> +if (family == AF_INET)
>> +family_str = "AF_INET";
>> +else if (family == AF_INET6)
>> +family_str = "AF_INET6";
>> +else
>> +family_str = "unknown";
>> +fprintf(fp, "af %s ", family_str);
> 
> The unwritten rule for ip commands is that the show function
> must format the output with same command syntax as the other commands 
> set/add/delete.
> Since there is no "af AF_INET" option to ip fou, this breaks that convention.
> Either ignore the address family, change the add command, or output with same
> syntax (-6); preferably the latter.

That makes sense.  Here's a corrected version.  It also avoids a trailing-space 
in the output.

From: Greg Greenway 
Date: Tue, 31 Oct 2017 12:47:35 -0700
Subject: [PATCH] Add "show" subcommand to "ip fou".

Sample output:

$ sudo ./ip/ip fou add port 111 ipproto 11
$ sudo ./ip/ip fou add port 222 ipproto 22 -6
$ ./ip/ip fou show
port 222 ipproto 22 -6
port 111 ipproto 11

Signed-off-by: Greg Greenway 
---
 ip/ipfou.c | 60 
 1 file changed, 60 insertions(+)

diff --git a/ip/ipfou.c b/ip/ipfou.c
index 00dbe15..ecbaf11 100644
--- a/ip/ipfou.c
+++ b/ip/ipfou.c
@@ -28,6 +28,7 @@ static void usage(void)
fprintf(stderr, "Usage: ip fou add port PORT "
"{ ipproto PROTO  | gue } [ -6 ]\n");
fprintf(stderr, "   ip fou del port PORT [ -6 ]\n");
+   fprintf(stderr, "   ip fou show\n");
fprintf(stderr, "\n");
fprintf(stderr, "Where: PROTO { ipproto-name | 1..255 }\n");
fprintf(stderr, "   PORT { 1..65535 }\n");
@@ -134,6 +135,63 @@ static int do_del(int argc, char **argv)
return 0;
 }
 
+static int print_fou_mapping(const struct sockaddr_nl *who,
+struct nlmsghdr *n, void *arg)
+{
+   FILE *fp = (FILE *)arg;
+   struct genlmsghdr *ghdr;
+   struct rtattr *tb[FOU_ATTR_MAX + 1];
+   int len = n->nlmsg_len;
+   unsigned family;
+
+   if (n->nlmsg_type != genl_family)
+   return 0;
+
+   len -= NLMSG_LENGTH(GENL_HDRLEN);
+   if (len < 0)
+   return -1;
+
+   ghdr = NLMSG_DATA(n);
+   parse_rtattr(tb, FOU_ATTR_MAX, (void *) ghdr + GENL_HDRLEN, len);
+
+   if (tb[FOU_ATTR_PORT])
+   fprintf(fp, "port %u", 
ntohs(rta_getattr_u16(tb[FOU_ATTR_PORT])));
+   if (tb[FOU_ATTR_TYPE] && rta_getattr_u8(tb[FOU_ATTR_TYPE]) == 
FOU_ENCAP_GUE)
+   fprintf(fp, " gue");
+   else if (tb[FOU_ATTR_IPPROTO])
+   fprintf(fp, " ipproto %u", 
rta_getattr_u8(tb[FOU_ATTR_IPPROTO]));
+   if (tb[FOU_ATTR_AF]) {
+   family = rta_getattr_u8(tb[FOU_ATTR_AF]);
+   if (family == AF_INET6)
+   fprintf(fp, " -6");
+   }
+   fprintf(fp, "\n");
+
+   return 0;
+}
+
+static int do_show(int argc, char **argv)
+{
+   FOU_REQUEST(req, 4096, FOU_CMD_GET, NLM_F_REQUEST | NLM_F_DUMP);
+
+   if (argc > 0) {
+   fprintf(stderr, "\"ip fou show\" does not take any 
arguments.\n");
+   return -1;
+   }
+
+   if (rtnl_send(_rth, , req.n.nlmsg_len) < 0) {
+   perror("Cannot send show request");
+   exit(1);
+   }
+
+   if (rtnl_dump_filter(_rth, print_fou_mapping, stdout) < 0) {
+   fprintf(stderr, "Dump terminated\n");
+   return 1;
+   }
+
+   return 0;
+}
+
 int do_ipfou(int argc, char **argv)
 {
if (argc < 1)
@@ -149,6 +207,8 @@ int do_ipfou(int argc, char **argv)
return do_add(argc-1, argv+1);
if (matches(*argv, "delete") == 0)
return do_del(argc-1, argv+1);
+   if (matches(*argv, "show") == 0)
+   return do_show(argc-1, argv+1);
fprintf(stderr, "Command \"%s\" is unknown, try \"ip fou help\".\n", 
*argv);
exit(-1);
 }
-- 
2.7.4



[patch net-next 5/5] net: sched: allow ingress and clsact qdiscs to share filter blocks

2017-11-03 Thread Jiri Pirko
From: Jiri Pirko 

Benefit from the previously introduced shared filter blocks
infrastructure and allow ingress and clsact qdisc instances to share
filter blocks. The block index is coming from userspace as qdisc option.

Signed-off-by: Jiri Pirko 
---
 include/uapi/linux/pkt_sched.h | 11 ++
 net/sched/sch_ingress.c| 89 +-
 2 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 0e88cc2..9197d25 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -923,4 +923,15 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+/* Ingress/clsact */
+
+enum {
+   TCA_CLSACT_UNSPEC,
+   TCA_CLSACT_INGRESS_BLOCK,
+   TCA_CLSACT_EGRESS_BLOCK,
+   __TCA_CLSACT_MAX
+};
+
+#define TCA_CLSACT_MAX (__TCA_CLSACT_MAX - 1)
+
 #endif
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 5ecc38f..ee89efc 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -60,6 +60,29 @@ static void clsact_chain_head_change(struct tcf_proto 
*tp_head, void *priv)
struct mini_Qdisc_pair *miniqp = priv;
 
mini_qdisc_pair_swap(miniqp, tp_head);
+};
+
+static const struct nla_policy ingress_policy[TCA_CLSACT_MAX + 1] = {
+   [TCA_CLSACT_INGRESS_BLOCK]  = { .type = NLA_U32 },
+};
+
+static int ingress_parse_opt(struct nlattr *opt, u32 *p_ingress_block_index)
+{
+   struct nlattr *tb[TCA_CLSACT_MAX + 1];
+   int err;
+
+   *p_ingress_block_index = 0;
+
+   if (!opt)
+   return 0;
+   err = nla_parse_nested(tb, TCA_CLSACT_MAX, opt, ingress_policy, NULL);
+   if (err)
+   return err;
+
+   if (tb[TCA_CLSACT_INGRESS_BLOCK])
+   *p_ingress_block_index =
+   nla_get_u32(tb[TCA_CLSACT_INGRESS_BLOCK]);
+   return 0;
 }
 
 static int ingress_init(struct Qdisc *sch, struct nlattr *opt)
@@ -70,6 +93,11 @@ static int ingress_init(struct Qdisc *sch, struct nlattr 
*opt)
 
mini_qdisc_pair_init(>miniqp, sch, >miniq_ingress);
 
+   err = ingress_parse_opt(opt, >block_info.block_index);
+   if (err)
+   return err;
+
+   q->block_info.shareable = true;
q->block_info.binder_type = TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
q->block_info.chain_head_change = clsact_chain_head_change;
q->block_info.chain_head_change_priv = >miniqp;
@@ -94,11 +122,14 @@ static void ingress_destroy(struct Qdisc *sch)
 
 static int ingress_dump(struct Qdisc *sch, struct sk_buff *skb)
 {
+   struct ingress_sched_data *q = qdisc_priv(sch);
struct nlattr *nest;
 
nest = nla_nest_start(skb, TCA_OPTIONS);
if (nest == NULL)
goto nla_put_failure;
+   if (nla_put_u32(skb, TCA_CLSACT_INGRESS_BLOCK, q->block->index))
+   goto nla_put_failure;
 
return nla_nest_end(skb, nest);
 
@@ -166,6 +197,35 @@ static struct tcf_block *clsact_tcf_block(struct Qdisc 
*sch, unsigned long cl)
}
 }
 
+static const struct nla_policy clsact_policy[TCA_CLSACT_MAX + 1] = {
+   [TCA_CLSACT_INGRESS_BLOCK]  = { .type = NLA_U32 },
+   [TCA_CLSACT_EGRESS_BLOCK]   = { .type = NLA_U32 },
+};
+
+static int clsact_parse_opt(struct nlattr *opt, u32 *p_ingress_block_index,
+   u32 *p_egress_block_index)
+{
+   struct nlattr *tb[TCA_CLSACT_MAX + 1];
+   int err;
+
+   *p_ingress_block_index = 0;
+   *p_egress_block_index = 0;
+
+   if (!opt)
+   return 0;
+   err = nla_parse_nested(tb, TCA_CLSACT_MAX, opt, clsact_policy, NULL);
+   if (err)
+   return err;
+
+   if (tb[TCA_CLSACT_INGRESS_BLOCK])
+   *p_ingress_block_index =
+   nla_get_u32(tb[TCA_CLSACT_INGRESS_BLOCK]);
+   if (tb[TCA_CLSACT_EGRESS_BLOCK])
+   *p_egress_block_index =
+   nla_get_u32(tb[TCA_CLSACT_EGRESS_BLOCK]);
+   return 0;
+}
+
 static int clsact_init(struct Qdisc *sch, struct nlattr *opt)
 {
struct clsact_sched_data *q = qdisc_priv(sch);
@@ -174,6 +234,12 @@ static int clsact_init(struct Qdisc *sch, struct nlattr 
*opt)
 
mini_qdisc_pair_init(>miniqp_ingress, sch, >miniq_ingress);
 
+   err = clsact_parse_opt(opt, >ingress_block_info.block_index,
+  >egress_block_info.block_index);
+   if (err)
+   return err;
+
+   q->ingress_block_info.shareable = true;
q->ingress_block_info.binder_type = 
TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
q->ingress_block_info.chain_head_change = clsact_chain_head_change;
q->ingress_block_info.chain_head_change_priv = >miniqp_ingress;
@@ -184,6 +250,7 @@ static int clsact_init(struct Qdisc *sch, struct nlattr 
*opt)
 
mini_qdisc_pair_init(>miniqp_egress, sch, >miniq_egress);
 
+   

[patch net-next 2/5] net: sched: avoid usage of tp->q in tcf_classify

2017-11-03 Thread Jiri Pirko
From: Jiri Pirko 

Use block index in the messages instead.

Signed-off-by: Jiri Pirko 
---
 net/sched/cls_api.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 4576b2d..b3e313f 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -670,8 +670,9 @@ int tcf_classify(struct sk_buff *skb, const struct 
tcf_proto *tp,
 #ifdef CONFIG_NET_CLS_ACT
 reset:
if (unlikely(limit++ >= max_reclassify_loop)) {
-   net_notice_ratelimited("%s: reclassify loop, rule prio %u, 
protocol %02x\n",
-  tp->q->ops->id, tp->prio & 0x,
+   net_notice_ratelimited("%u: reclassify loop, rule prio %u, 
protocol %02x\n",
+  tp->chain->block->index,
+  tp->prio & 0x,
   ntohs(tp->protocol));
return TC_ACT_SHOT;
}
-- 
2.9.5



[patch net-next 4/5] net: sched: remove classid and q fields from tcf_proto

2017-11-03 Thread Jiri Pirko
From: Jiri Pirko 

Both are no longer used, so remove them.

Signed-off-by: Jiri Pirko 
---
 include/net/sch_generic.h | 2 --
 net/sched/cls_api.c   | 7 ++-
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index ef907d4..f551163 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -244,8 +244,6 @@ struct tcf_proto {
 
/* All the rest */
u32 prio;
-   u32 classid;
-   struct Qdisc*q;
void*data;
const struct tcf_proto_ops  *ops;
struct tcf_chain*chain;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 4166e3f..d50cdac 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -123,8 +123,7 @@ static inline u32 tcf_auto_prio(struct tcf_proto *tp)
 }
 
 static struct tcf_proto *tcf_proto_create(const char *kind, u32 protocol,
- u32 prio, u32 parent, struct Qdisc *q,
- struct tcf_chain *chain)
+ u32 prio, struct tcf_chain *chain)
 {
struct tcf_proto *tp;
int err;
@@ -158,8 +157,6 @@ static struct tcf_proto *tcf_proto_create(const char *kind, 
u32 protocol,
tp->classify = tp->ops->classify;
tp->protocol = protocol;
tp->prio = prio;
-   tp->classid = parent;
-   tp->q = q;
tp->chain = chain;
 
err = tp->ops->init(tp);
@@ -1066,7 +1063,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n,
prio = tcf_auto_prio(tcf_chain_tp_prev(_info));
 
tp = tcf_proto_create(nla_data(tca[TCA_KIND]),
- protocol, prio, parent, q, chain);
+ protocol, prio, chain);
if (IS_ERR(tp)) {
err = PTR_ERR(tp);
goto errout;
-- 
2.9.5



[patch net-next 0/5] net: sched: allow qdiscs to share filter block instances

2017-11-03 Thread Jiri Pirko
From: Jiri Pirko 

Currently the filters added to qdiscs are independent. So for example if you
have 2 netdevices and you create ingress qdisc on both and you want to add
identical filter rules both, you need to add them twice. This patchset
makes this easier and mainly saves resources allowing to share all filters
within a qdisc - I call it a "filter block". Also this helps to save
resources when we do offload to hw for example to expensive TCAM.

So back to the example. First, we create 2 qdiscs. Both will share
block number 22. "22" is just an identification. If we don't pass any
block number, a new one will be generated by kernel:

$ tc qdisc add dev ens7 ingress block 22

$ tc qdisc add dev ens8 ingress block 22


Now if we list the qdiscs, we will see the block index in the output:

$ tc qdisc
qdisc ingress : dev ens7 parent :fff1 block 22
qdisc ingress : dev ens8 parent :fff1 block 22

Now we can add filter to any of qdiscs sharing the same block:

$ tc filter add dev ens7 ingress protocol ip pref 25 flower dst_ip 
192.168.0.0/16 action drop


We will see the same output if we list filters for ens7 and ens8, including 
stats:

$ tc -s filter show dev ens7 ingress
filter protocol ip pref 25 flower chain 0
filter protocol ip pref 25 flower chain 0 handle 0x1
  eth_type ipv4
  dst_ip 192.168.0.0/16
  not_in_hw
action order 1: gact action drop
 random type none pass val 0
 index 1 ref 1 bind 1 installed 39 sec used 2 sec
Action statistics:
Sent 3108 bytes 37 pkt (dropped 37, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

$ tc -s filter show dev ens8 ingress
filter protocol ip pref 25 flower chain 0
filter protocol ip pref 25 flower chain 0 handle 0x1
  eth_type ipv4
  dst_ip 192.168.0.0/16
  not_in_hw
action order 1: gact action drop
 random type none pass val 0
 index 1 ref 1 bind 1 installed 40 sec used 3 sec
Action statistics:
Sent 3108 bytes 37 pkt (dropped 37, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

Jiri Pirko (5):
  net: sched: introduce support for multiple filter chain pointers
registration
  net: sched: avoid usage of tp->q in tcf_classify
  net: sched: introduce block mechanism to handle netif_keep_dst calls
  net: sched: remove classid and q fields from tcf_proto
  net: sched: allow ingress and clsact qdiscs to share filter blocks

 include/net/pkt_cls.h  |   4 +
 include/net/sch_generic.h  |   9 +-
 include/uapi/linux/pkt_sched.h |  11 ++
 net/sched/cls_api.c| 327 +++--
 net/sched/cls_bpf.c|   4 +-
 net/sched/cls_flow.c   |   2 +-
 net/sched/cls_route.c  |   2 +-
 net/sched/sch_ingress.c|  89 ++-
 8 files changed, 397 insertions(+), 51 deletions(-)

-- 
2.9.5



Re: [Intel-wired-lan] [jkirsher/next-queue PATCH 0/5] macvlan offload fixes

2017-11-03 Thread Shannon Nelson

On 11/2/2017 4:33 PM, Alexander Duyck wrote:

I'm looking at performing a refactor of the macvlan offload code. However
before I started I wanted to at least get things into a running state. The
patches in this set are needed to address a number of issues that were
preventing things from working as they were supposed to.

With these changes in place I seem to be able to receive traffic as I am
supposed to in the case of ixgbe and fm10k with the offload enabled, and I
am now transmitting to the correct Tx ring in the case of ixgbe.

The last two patches in the set are what I consider to be minor clean-ups
to address the fact that we don't want packets to somehow stray and end up
being transmitted on a queue that is supposed to be in use by a macvlan
instead of the lowerdev itself.


Other than the little misspelling I flagged,
Acked-by: Shannon Nelson 



---

Alexander Duyck (5):
   ixgbe: Fix interaction between SR-IOV and macvlan offload
   fm10k: Fix VLAN configuration for macvlan offload
   ixgbe: Fix handling of macvlan Tx offload
   dev: Clean-up __skb_tx_hash to match up with traffic class based configs
   dev: Cap number of queues even with accel_priv


  drivers/net/ethernet/intel/fm10k/fm10k_netdev.c |4 ++--
  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |   22 ++
  net/core/dev.c  |   21 ++---
  3 files changed, 30 insertions(+), 17 deletions(-)

--
___
Intel-wired-lan mailing list
intel-wired-...@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan



Re: [Intel-wired-lan] [jkirsher/next-queue PATCH 5/5] dev: Cap number of queues even with accel_priv

2017-11-03 Thread Jesse Brandeburg
On Thu, 2 Nov 2017 16:34:58 -0700
Alexander Duyck  wrote:

> From: Alexander Duyck 
> 
> With the recent fix to ixgbe we can cap the number of queues always
> regardless of if accel_priv is being used or not since the actual number of
> queues are being reported via real_num_tx_queues.

Makes sense.
Reviewed-by: Jesse Brandeburg 


Re: [Intel-wired-lan] [jkirsher/next-queue PATCH 2/5] fm10k: Fix VLAN configuration for macvlan offload

2017-11-03 Thread Jesse Brandeburg
On Thu, 2 Nov 2017 16:33:45 -0700
Alexander Duyck  wrote:

> From: Alexander Duyck 
> 
> The fm10k driver didn't work correctly when macvlan offload was enabled.
> Specifically what would occur is that we would see no unicast packets being
> received. This was traced down to us not correctly configuring the default
> VLAN ID for the port and defaulting to 0.
> 
> To correct this we either use the default ID provided by the switch or
> simply use 1. With that we are able to pass and receive traffic without any
> issues.

Reviewed-by: Jesse Brandeburg 


Re: [PATCH 1/2] bpf: add a bpf_override_function helper

2017-11-03 Thread Daniel Borkmann

On 11/03/2017 03:31 PM, Josef Bacik wrote:

On Fri, Nov 03, 2017 at 12:12:13AM +0100, Daniel Borkmann wrote:

Hi Josef,

one more issue I just noticed, see comment below:

On 11/02/2017 03:37 PM, Josef Bacik wrote:
[...]

diff --git a/include/linux/filter.h b/include/linux/filter.h
index cdd78a7beaae..dfa44fd74bae 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -458,7 +458,8 @@ struct bpf_prog {
locked:1,   /* Program image locked? */
gpl_compatible:1, /* Is filter GPL compatible? 
*/
cb_access:1,/* Is control block accessed? */
-   dst_needed:1;   /* Do we need dst entry? */
+   dst_needed:1,   /* Do we need dst entry? */
+   kprobe_override:1; /* Do we override a kprobe? 
*/
kmemcheck_bitfield_end(meta);
enum bpf_prog_type  type;   /* Type of BPF program */
u32 len;/* Number of filter blocks */

[...]

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d906775e12c1..f8f7927a9152 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4189,6 +4189,8 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
prog->dst_needed = 1;
if (insn->imm == BPF_FUNC_get_prandom_u32)
bpf_user_rnd_init_once();
+   if (insn->imm == BPF_FUNC_override_return)
+   prog->kprobe_override = 1;
if (insn->imm == BPF_FUNC_tail_call) {
/* If we tail call into other programs, we
 * cannot make any assumptions since they can
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9660ee65fbef..0d7fce52391d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8169,6 +8169,13 @@ static int perf_event_set_bpf_prog(struct perf_event 
*event, u32 prog_fd)
return -EINVAL;
}

+   /* Kprobe override only works for kprobes, not uprobes. */
+   if (prog->kprobe_override &&
+   !(event->tp_event->flags & TRACE_EVENT_FL_KPROBE)) {
+   bpf_prog_put(prog);
+   return -EINVAL;
+   }


Can we somehow avoid the prog->kprobe_override flag here completely
and also same in the perf_event_attach_bpf_prog() handler?

Reason is that it's not reliable for bailing out this way: Think of
the main program you're attaching doesn't use bpf_override_return()
helper, but it tail-calls into other BPF progs that make use of it
instead. So above check would be useless and will fail and we continue
to attach the prog for probes where it's not intended to be used.

We've had similar issues in the past e.g. c2002f983767 ("bpf: fix
checking xdp_adjust_head on tail calls") is just one of those. Thus,
can we avoid the flag altogether and handle such error case differently?


So if I'm reading this right there's no way to know what we'll tail call at any
given point, so I need to go back to my previous iteration of this patch and
always save the state of the kprobe in the per-cpu variable to make sure we
don't use bpf_override_return in the wrong case?


Yeah.


The tail call functions won't be in the BPF_PROG_ARRAY right?  It'll be just
some other arbitrary function?  If that's the case then we really need something
like this


With BPF_PROG_ARRAY you mean BPF_MAP_TYPE_PROG_ARRAY or the prog array
for the tracing/multiprog attach point? The program you're calling into
is inside the BPF_MAP_TYPE_PROG_ARRAY map, but can change at any time
and can have nesting as well.


https://patchwork.kernel.org/patch/10034815/

and I need to bring that back right?  Thanks,


I'm afraid so. The thing with skb cb_access which was brought up there is
that once you have a tail call in the prog you cannot make any assumptions
anymore, therefore the cb_access flag is set to 1 so we save/restore for
those cases precautionary since it could be accessed or not later on. In
your case I think this wouldn't work since legitimate bpf kprobes progs could
use tail calls today, so setting prog->kprobe_override there would prevent
attaching for non-kprobes due to subsequent flags & TRACE_EVENT_FL_KPROBE
check.


[PATCH] mISDN: l1oip_core: replace _manual_ swap with swap macro

2017-11-03 Thread Gustavo A. R. Silva
Make use of the swap macro and remove unnecessary variables skb and cnt.
This makes the code easier to read and maintain.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva 
---
 drivers/isdn/mISDN/l1oip_core.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/isdn/mISDN/l1oip_core.c b/drivers/isdn/mISDN/l1oip_core.c
index b5d590e..e365478 100644
--- a/drivers/isdn/mISDN/l1oip_core.c
+++ b/drivers/isdn/mISDN/l1oip_core.c
@@ -440,14 +440,8 @@ l1oip_socket_recv(struct l1oip *hc, u8 remotecodec, u8 
channel, u16 timebase,
 
 #ifdef REORDER_DEBUG
if (hc->chan[channel].disorder_flag) {
-   struct sk_buff *skb;
-   int cnt;
-   skb = hc->chan[channel].disorder_skb;
-   hc->chan[channel].disorder_skb = nskb;
-   nskb = skb;
-   cnt = hc->chan[channel].disorder_cnt;
-   hc->chan[channel].disorder_cnt = rx_counter;
-   rx_counter = cnt;
+   swap(hc->chan[channel].disorder_skb, nskb);
+   swap(hc->chan[channel].disorder_cnt, rx_counter);
}
hc->chan[channel].disorder_flag ^= 1;
if (nskb)
-- 
2.7.4



Re: [PATCH net 0/2] NULL pointer dereference in {ipvlan|macvlan}_port_destroy

2017-11-03 Thread Girish Moodalbail

On 11/2/17 10:05 PM, David Miller wrote:

From: Girish Moodalbail 
Date: Tue, 31 Oct 2017 09:39:45 -0700


When call to register_netdevice() (called from ipvlan_link_new())
fails, inside that function we call ipvlan_uninit() (through
ndo_uninit()) to destroy the ipvlan port. Upon returning
unsuccessfully from register_netdevice() we go ahead and call
ipvlan_port_destroy() again which causes NULL pointer dereference
panic.


The problem is that ipvlan doesn't follow the proper convention that
->ndo_uninit() must only release resources allocated by ->ndo_init().

What needs to happen is that the port allocation occur in
->ndo_init().


I agree, will send out V2. I initially started off making them (ndo_init and 
ndo_uninit) symmetric by moving the port destruction out of ndo_uninit(), but I 
hit some WARN() errors. Will figure it out.


thanks,
~Girish



Your fix, while solving some cases, does not fully cover all of the
posibiities due to this bug.

Please fix this correctly by moving the port allocation and related
setup from link creation to ->ndo_init().

Thank you.





[PATCH 1/2] can: peak_usb: remove some 'struct timeval' users

2017-11-03 Thread Arnd Bergmann
We want to remove 'struct timeval' and related interfaces since this is
generally not safe for use beyond 2038.

For peak_usb, we can simplify the internal interface by using ktime_t
directly. This should not change any behavior, but it avoids a few
conversions.

Signed-off-by: Arnd Bergmann 
---
 drivers/net/can/usb/peak_usb/pcan_usb.c  |  9 +++--
 drivers/net/can/usb/peak_usb/pcan_usb_core.c | 15 +++
 drivers/net/can/usb/peak_usb/pcan_usb_core.h |  3 +--
 drivers/net/can/usb/peak_usb/pcan_usb_pro.c  |  9 +++--
 4 files changed, 14 insertions(+), 22 deletions(-)

diff --git a/drivers/net/can/usb/peak_usb/pcan_usb.c 
b/drivers/net/can/usb/peak_usb/pcan_usb.c
index 25a9b79cc42d..f530a80f5051 100644
--- a/drivers/net/can/usb/peak_usb/pcan_usb.c
+++ b/drivers/net/can/usb/peak_usb/pcan_usb.c
@@ -408,7 +408,6 @@ static int pcan_usb_decode_error(struct 
pcan_usb_msg_context *mc, u8 n,
 {
struct sk_buff *skb;
struct can_frame *cf;
-   struct timeval tv;
enum can_state new_state;
 
/* ignore this error until 1st ts received */
@@ -525,8 +524,8 @@ static int pcan_usb_decode_error(struct 
pcan_usb_msg_context *mc, u8 n,
if (status_len & PCAN_USB_STATUSLEN_TIMESTAMP) {
struct skb_shared_hwtstamps *hwts = skb_hwtstamps(skb);
 
-   peak_usb_get_ts_tv(>pdev->time_ref, mc->ts16, );
-   hwts->hwtstamp = timeval_to_ktime(tv);
+   peak_usb_get_ts_time(>pdev->time_ref, mc->ts16,
+>hwtstamp);
}
 
mc->netdev->stats.rx_packets++;
@@ -610,7 +609,6 @@ static int pcan_usb_decode_data(struct pcan_usb_msg_context 
*mc, u8 status_len)
u8 rec_len = status_len & PCAN_USB_STATUSLEN_DLC;
struct sk_buff *skb;
struct can_frame *cf;
-   struct timeval tv;
struct skb_shared_hwtstamps *hwts;
 
skb = alloc_can_skb(mc->netdev, );
@@ -658,9 +656,8 @@ static int pcan_usb_decode_data(struct pcan_usb_msg_context 
*mc, u8 status_len)
}
 
/* convert timestamp into kernel time */
-   peak_usb_get_ts_tv(>pdev->time_ref, mc->ts16, );
hwts = skb_hwtstamps(skb);
-   hwts->hwtstamp = timeval_to_ktime(tv);
+   peak_usb_get_ts_time(>pdev->time_ref, mc->ts16, >hwtstamp);
 
/* update statistics */
mc->netdev->stats.rx_packets++;
diff --git a/drivers/net/can/usb/peak_usb/pcan_usb_core.c 
b/drivers/net/can/usb/peak_usb/pcan_usb_core.c
index 1ca76e03e965..695a75a9b4bb 100644
--- a/drivers/net/can/usb/peak_usb/pcan_usb_core.c
+++ b/drivers/net/can/usb/peak_usb/pcan_usb_core.c
@@ -148,11 +148,11 @@ void peak_usb_set_ts_now(struct peak_time_ref *time_ref, 
u32 ts_now)
 /*
  * compute timeval according to current ts and time_ref data
  */
-void peak_usb_get_ts_tv(struct peak_time_ref *time_ref, u32 ts,
-   struct timeval *tv)
+void peak_usb_get_ts_time(struct peak_time_ref *time_ref, u32 ts, ktime_t 
*time)
 {
/* protect from getting timeval before setting now */
if (time_ref->tv_host.tv_sec > 0) {
+   struct timeval tv;
u64 delta_us;
 
delta_us = ts - time_ref->ts_dev_2;
@@ -164,10 +164,11 @@ void peak_usb_get_ts_tv(struct peak_time_ref *time_ref, 
u32 ts,
delta_us *= time_ref->adapter->us_per_ts_scale;
delta_us >>= time_ref->adapter->us_per_ts_shift;
 
-   *tv = time_ref->tv_host_0;
-   peak_usb_add_us(tv, (u32)delta_us);
+   tv = time_ref->tv_host_0;
+   peak_usb_add_us(, (u32)delta_us);
+   *time = timeval_to_ktime(tv);
} else {
-   *tv = ktime_to_timeval(ktime_get());
+   *time = ktime_get();
}
 }
 
@@ -178,10 +179,8 @@ int peak_usb_netif_rx(struct sk_buff *skb,
  struct peak_time_ref *time_ref, u32 ts_low, u32 ts_high)
 {
struct skb_shared_hwtstamps *hwts = skb_hwtstamps(skb);
-   struct timeval tv;
 
-   peak_usb_get_ts_tv(time_ref, ts_low, );
-   hwts->hwtstamp = timeval_to_ktime(tv);
+   peak_usb_get_ts_time(time_ref, ts_low, >hwtstamp);
 
return netif_rx(skb);
 }
diff --git a/drivers/net/can/usb/peak_usb/pcan_usb_core.h 
b/drivers/net/can/usb/peak_usb/pcan_usb_core.h
index c01316cac354..b9a221ea7e5c 100644
--- a/drivers/net/can/usb/peak_usb/pcan_usb_core.h
+++ b/drivers/net/can/usb/peak_usb/pcan_usb_core.h
@@ -151,8 +151,7 @@ void peak_usb_init_time_ref(struct peak_time_ref *time_ref,
const struct peak_usb_adapter *adapter);
 void peak_usb_update_ts_now(struct peak_time_ref *time_ref, u32 ts_now);
 void peak_usb_set_ts_now(struct peak_time_ref *time_ref, u32 ts_now);
-void peak_usb_get_ts_tv(struct peak_time_ref *time_ref, u32 ts,
-   struct timeval *tv);
+void peak_usb_get_ts_time(struct peak_time_ref *time_ref, u32 ts, ktime_t *tv);
 int peak_usb_netif_rx(struct sk_buff 

[PATCH 2/2] can: peak_usb: use ktime_t consistently

2017-11-03 Thread Arnd Bergmann
This changes the calculation of the timestamps to use ktime_t
instead of struct timeval as the base. This gets rid of one
of the few remaining users of the deprecated ktime_to_timeval()
and timeval_to_ktime() helpers.

The code should also get more efficient, as we have now removed
all of the divisions.

I have left the cut-off for resetting the counters as 4.200
seconds, in order to leave the behavior unchanged otherwise.

Signed-off-by: Arnd Bergmann 
---
 drivers/net/can/usb/peak_usb/pcan_usb_core.c | 46 +---
 drivers/net/can/usb/peak_usb/pcan_usb_core.h |  2 +-
 2 files changed, 15 insertions(+), 33 deletions(-)

diff --git a/drivers/net/can/usb/peak_usb/pcan_usb_core.c 
b/drivers/net/can/usb/peak_usb/pcan_usb_core.c
index 695a75a9b4bb..8f699ee6a528 100644
--- a/drivers/net/can/usb/peak_usb/pcan_usb_core.c
+++ b/drivers/net/can/usb/peak_usb/pcan_usb_core.c
@@ -80,21 +80,6 @@ void peak_usb_init_time_ref(struct peak_time_ref *time_ref,
}
 }
 
-static void peak_usb_add_us(struct timeval *tv, u32 delta_us)
-{
-   /* number of s. to add to final time */
-   u32 delta_s = delta_us / 100;
-
-   delta_us -= delta_s * 100;
-
-   tv->tv_usec += delta_us;
-   if (tv->tv_usec >= 100) {
-   tv->tv_usec -= 100;
-   delta_s++;
-   }
-   tv->tv_sec += delta_s;
-}
-
 /*
  * sometimes, another now may be  more recent than current one...
  */
@@ -103,7 +88,7 @@ void peak_usb_update_ts_now(struct peak_time_ref *time_ref, 
u32 ts_now)
time_ref->ts_dev_2 = ts_now;
 
/* should wait at least two passes before computing */
-   if (time_ref->tv_host.tv_sec > 0) {
+   if (ktime_to_ns(time_ref->tv_host) > 0) {
u32 delta_ts = time_ref->ts_dev_2 - time_ref->ts_dev_1;
 
if (time_ref->ts_dev_2 < time_ref->ts_dev_1)
@@ -118,26 +103,26 @@ void peak_usb_update_ts_now(struct peak_time_ref 
*time_ref, u32 ts_now)
  */
 void peak_usb_set_ts_now(struct peak_time_ref *time_ref, u32 ts_now)
 {
-   if (time_ref->tv_host_0.tv_sec == 0) {
+   if (ktime_to_ns(time_ref->tv_host_0) == 0) {
/* use monotonic clock to correctly compute further deltas */
-   time_ref->tv_host_0 = ktime_to_timeval(ktime_get());
-   time_ref->tv_host.tv_sec = 0;
+   time_ref->tv_host_0 = ktime_get();
+   time_ref->tv_host = ktime_set(0, 0);
} else {
/*
-* delta_us should not be >= 2^32 => delta_s should be < 4294
+* delta_us should not be >= 2^32 => delta should be < 4294s
 * handle 32-bits wrapping here: if count of s. reaches 4200,
 * reset counters and change time base
 */
-   if (time_ref->tv_host.tv_sec != 0) {
-   u32 delta_s = time_ref->tv_host.tv_sec
-   - time_ref->tv_host_0.tv_sec;
-   if (delta_s > 4200) {
+   if (ktime_to_ns(time_ref->tv_host)) {
+   ktime_t delta = ktime_sub(time_ref->tv_host,
+ time_ref->tv_host_0);
+   if (ktime_to_ns(delta) > (4200ull * NSEC_PER_SEC)) {
time_ref->tv_host_0 = time_ref->tv_host;
time_ref->ts_total = 0;
}
}
 
-   time_ref->tv_host = ktime_to_timeval(ktime_get());
+   time_ref->tv_host = ktime_get();
time_ref->tick_count++;
}
 
@@ -146,13 +131,12 @@ void peak_usb_set_ts_now(struct peak_time_ref *time_ref, 
u32 ts_now)
 }
 
 /*
- * compute timeval according to current ts and time_ref data
+ * compute time according to current ts and time_ref data
  */
 void peak_usb_get_ts_time(struct peak_time_ref *time_ref, u32 ts, ktime_t 
*time)
 {
-   /* protect from getting timeval before setting now */
-   if (time_ref->tv_host.tv_sec > 0) {
-   struct timeval tv;
+   /* protect from getting time before setting now */
+   if (ktime_to_ns(time_ref->tv_host)) {
u64 delta_us;
 
delta_us = ts - time_ref->ts_dev_2;
@@ -164,9 +148,7 @@ void peak_usb_get_ts_time(struct peak_time_ref *time_ref, 
u32 ts, ktime_t *time)
delta_us *= time_ref->adapter->us_per_ts_scale;
delta_us >>= time_ref->adapter->us_per_ts_shift;
 
-   tv = time_ref->tv_host_0;
-   peak_usb_add_us(, (u32)delta_us);
-   *time = timeval_to_ktime(tv);
+   *time = ktime_add_us(time_ref->tv_host_0, delta_us);
} else {
*time = ktime_get();
}
diff --git a/drivers/net/can/usb/peak_usb/pcan_usb_core.h 
b/drivers/net/can/usb/peak_usb/pcan_usb_core.h
index b9a221ea7e5c..29f03dccca10 100644
--- 

Re: [Intel-wired-lan] [jkirsher/next-queue PATCH 3/5] ixgbe: Fix handling of macvlan Tx offload

2017-11-03 Thread Shannon Nelson

On 11/2/2017 4:34 PM, Alexander Duyck wrote:

From: Alexander Duyck 

This update makes it so that we report the actual number of Tx queues via
real_num_tx_queues but are still restricted to RSS on only the first pool
by setting num_tc equal to 1. Doing this locks us into only having the
ability to setup XPS on the queues in that pool, and only those queues
should be used for transmitting anything other than macvlan traffic.

Signed-off-by: Alexander Duyck 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   17 +++--
  1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 69ef35d13c36..b22ec4b9d02c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6638,8 +6638,9 @@ int ixgbe_open(struct net_device *netdev)
goto err_req_irq;
  
  	/* Notify the stack of the actual queue counts. */

-   if (adapter->num_rx_pools > 1)
-   queues = adapter->num_rx_queues_per_pool;
+   if (adapter->num_rx_pools > 1 &&
+   adapter->num_tx_queues > IXGBE_MAX_L2A_QUEUES)
+   queues = IXGBE_MAX_L2A_QUEUES;
else
queues = adapter->num_tx_queues;
  
@@ -8901,6 +8902,18 @@ int ixgbe_setup_tc(struct net_device *dev, u8 tc)

if (adapter->hw.mac.type == ixgbe_mac_82598EB)
adapter->hw.fc.requested_mode = adapter->last_lfc_mode;
  
+		/* To support macvlan offload we have to use num_tc to

+* restrict the queues that can be used by the device.
+* By doing this we can avoid reporing a false number of


s/reporing/reporting/


+* queues.
+*/
+   if (adapter->num_rx_pools > 1) {
+   u16 qpp = adapter->num_rx_queues_per_pool;
+
+   netdev_set_num_tc(dev, 1);
+   netdev_set_tc_queue(dev, 0, qpp, 0);
+   }
+
adapter->flags &= ~IXGBE_FLAG_DCB_ENABLED;
  
  		adapter->temp_dcb_cfg.pfc_mode_enable = false;


___
Intel-wired-lan mailing list
intel-wired-...@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan



Re: [net-next v2 3/4] openvswitch: Add meter infrastructure

2017-11-03 Thread Pravin Shelar
On Thu, Nov 2, 2017 at 7:43 PM, Andy Zhou  wrote:
> On Thu, Nov 2, 2017 at 5:07 AM, Pravin Shelar  wrote:
>> On Thu, Nov 2, 2017 at 3:07 AM, Andy Zhou  wrote:
>>> On Fri, Oct 20, 2017 at 8:32 PM, Pravin Shelar  wrote:
 On Thu, Oct 19, 2017 at 5:58 PM, Andy Zhou  wrote:
>
> On Thu, Oct 19, 2017 at 02:47 Pravin Shelar  wrote:
>>
>> On Tue, Oct 17, 2017 at 12:36 AM, Andy Zhou  wrote:
>> > OVS kernel datapath so far does not support Openflow meter action.
>> > This is the first stab at adding kernel datapath meter support.
>> > This implementation supports only drop band type.
>> >
>> > Signed-off-by: Andy Zhou 
>> > ---
>> >  net/openvswitch/Makefile   |   1 +
>> >  net/openvswitch/datapath.c |  14 +-
>> >  net/openvswitch/datapath.h |   3 +
>> >  net/openvswitch/meter.c| 604
>> > +
>> >  net/openvswitch/meter.h|  54 
>> >  5 files changed, 674 insertions(+), 2 deletions(-)
>> >  create mode 100644 net/openvswitch/meter.c
>> >  create mode 100644 net/openvswitch/meter.h
>> >
>> This patch mostly looks good. I have one comment below.
>>
>> > +static int ovs_meter_cmd_set(struct sk_buff *skb, struct genl_info
>> > *info)
>> > +{
>> > +   struct nlattr **a = info->attrs;
>> > +   struct dp_meter *meter, *old_meter;
>> > +   struct sk_buff *reply;
>> > +   struct ovs_header *ovs_reply_header;
>> > +   struct ovs_header *ovs_header = info->userhdr;
>> > +   struct datapath *dp;
>> > +   int err;
>> > +   u32 meter_id;
>> > +   bool failed;
>> > +
>> > +   meter = dp_meter_create(a);
>> > +   if (IS_ERR_OR_NULL(meter))
>> > +   return PTR_ERR(meter);
>> > +
>> > +   reply = ovs_meter_cmd_reply_start(info, OVS_METER_CMD_SET,
>> > + _reply_header);
>> > +   if (IS_ERR(reply)) {
>> > +   err = PTR_ERR(reply);
>> > +   goto exit_free_meter;
>> > +   }
>> > +
>> > +   ovs_lock();
>> > +   dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
>> > +   if (!dp) {
>> > +   err = -ENODEV;
>> > +   goto exit_unlock;
>> > +   }
>> > +
>> > +   if (!a[OVS_METER_ATTR_ID]) {
>> > +   err = -ENODEV;
>> > +   goto exit_unlock;
>> > +   }
>> > +
>> > +   meter_id = nla_get_u32(a[OVS_METER_ATTR_ID]);
>> > +
>> > +   /* Cannot fail after this. */
>> > +   old_meter = lookup_meter(dp, meter_id);
>> I do not see RCU read lock taken here. This is not correctness issue
>> but it could cause RCU checker to spit out warning message. You could
>> do same trick that is done in get_dp() to avoid this issue.
>
> O.K.
>>
>>
>>
>> Can you also test the code with rcu sparse check config option enabled?
>
>
> Do you mean to sparse compile with CONFIG_PROVE_LOCKING and
> CONFIG_DENUG_OBJECTS_RCU_HEAD?

 You could use all following options simultaneously:
 CONFIG_PREEMPT
 CONFIG_DEBUG_PREEMPT
 CONFIG_DEBUG_SPINLOCK
 CONFIG_DEBUG_ATOMIC_SLEEP
 CONFIG_PROVE_RCU
 CONFIG_DEBUG_OBJECTS_RCU_HEAD
>>>
>>> Thanks, I turned on those flags but did not get any error message. Do you
>>> mind share the RCU checker message?
>>
>> There would be assert failure and stack trace. so it would be pretty
>> obvious in kernel log messages.
>> Let me know if you do not see any stack trace while running meter
>> create, delete and execute.
>
> No I did not see them.
ok, Can you send out patch against latest net-next?


Re: TCP connection closed without FIN or RST

2017-11-03 Thread Eric Dumazet
On Fri, 2017-11-03 at 11:13 -0400, Vitaly Davidovich wrote:
> Ok, an interesting finding.  The client was originally running with
> SO_RCVBUF of 75K (apparently someone decided to set that for some
> unknown reason).  I tried the test with a 1MB recv buffer and
> everything works perfectly! The client responds with 0 window alerts,
> the server just hits the persist condition and sends keep-alive
> probes; the client continues answering with a 0 window up until it
> wakes up and starts processing data in its receive buffer.  At that
> point, the window opens up and the server sends more data.  Basically,
> things look as one would expect in this situation :).
> 
> /proc/sys/net/ipv4/tcp_rmem is 131072  1048576   20971520.  The
> conversation flows normally, as described above, when I change the
> client's recv buf size to 1048576.  I also tried 131072, but that
> doesn't work - same retrans/no ACKs situation.
> 
> I think this eliminates (right?) any middleware from the equation.
> Instead, perhaps it's some bad interaction between a low recv buf size
> and either some other TCP setting or TSO mechanics (LRO specifically).
> Still investigating further.

Just in case, have you tried a more recent linux kernel ?

I would rather not spend time on some problem that might already be
fixed.





RE: [PATCH net-next 0/6] net: hns3: support set_link_ksettings and for nway_reset ethtool command

2017-11-03 Thread Salil Mehta
Hi Andrew,

> -Original Message-
> From: Andrew Lunn [mailto:and...@lunn.ch]
> Sent: Friday, November 03, 2017 3:52 PM
> To: lipeng (Y)
> Cc: da...@davemloft.net; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org; Linuxarm; Salil Mehta
> Subject: Re: [PATCH net-next 0/6] net: hns3: support set_link_ksettings
> and for nway_reset ethtool command
> 
> On Fri, Nov 03, 2017 at 12:18:24PM +0800, Lipeng wrote:
> > This patch-set adds support for set_link_ksettings && for nway_resets
> > ethtool command and fixes some related ethtool bugs.
> > 1, patch[4/6] adds support for ethtool_ops.set_link_ksettings.
> > 2, patch[5/6] adds support ethtool_ops.for nway_reset.
> > 3, patch[1/6,2/6,3/6,6/6] fix some bugs for getting port information
> by
> >ethtool command(ethtool ethx).
> 
> Hi Lipeng
> 
> Do you want the fixes applied to net, and back ported to stable?
> 
> If so, you need to submit them separately, and against the correct
> tree.
Yes, you are correct. This should have been submitted against net repo.
But now this patch-set has been accepted by Dave for net-next do you
think we can still submit it again against Dave's net repo?

Thanks
Salil

> 
>   Andrew


Re: [PATCH net-next 0/6] net: hns3: support set_link_ksettings and for nway_reset ethtool command

2017-11-03 Thread Andrew Lunn
On Fri, Nov 03, 2017 at 12:18:24PM +0800, Lipeng wrote:
> This patch-set adds support for set_link_ksettings && for nway_resets
> ethtool command and fixes some related ethtool bugs.
> 1, patch[4/6] adds support for ethtool_ops.set_link_ksettings.
> 2, patch[5/6] adds support ethtool_ops.for nway_reset.
> 3, patch[1/6,2/6,3/6,6/6] fix some bugs for getting port information by
>ethtool command(ethtool ethx).

Hi Lipeng

Do you want the fixes applied to net, and back ported to stable?

If so, you need to submit them separately, and against the correct
tree.

Andrew


[PATCH net-next v6 3/3] act_vlan: VLAN action rewrite to use RCU lock/unlock and update

2017-11-03 Thread Manish Kurup
Using a spinlock in the VLAN action causes performance issues when the VLAN
action is used on multiple cores. Rewrote the VLAN action to use RCU read
locking for reads and updates instead.

Acked-by: Jamal Hadi Salim 
Acked-by: Jiri Pirko 
Signed-off-by: Manish Kurup 
---
 include/net/tc_act/tc_vlan.h | 46 +--
 net/sched/act_vlan.c | 75 ++--
 2 files changed, 88 insertions(+), 33 deletions(-)

diff --git a/include/net/tc_act/tc_vlan.h b/include/net/tc_act/tc_vlan.h
index c2090df..22ae260 100644
--- a/include/net/tc_act/tc_vlan.h
+++ b/include/net/tc_act/tc_vlan.h
@@ -13,12 +13,17 @@
 #include 
 #include 
 
+struct tcf_vlan_params {
+   int   tcfv_action;
+   u16   tcfv_push_vid;
+   __be16tcfv_push_proto;
+   u8tcfv_push_prio;
+   struct rcu_head   rcu;
+};
+
 struct tcf_vlan {
struct tc_actioncommon;
-   int tcfv_action;
-   u16 tcfv_push_vid;
-   __be16  tcfv_push_proto;
-   u8  tcfv_push_prio;
+   struct tcf_vlan_params __rcu *vlan_p;
 };
 #define to_vlan(a) ((struct tcf_vlan *)a)
 
@@ -33,22 +38,45 @@ static inline bool is_tcf_vlan(const struct tc_action *a)
 
 static inline u32 tcf_vlan_action(const struct tc_action *a)
 {
-   return to_vlan(a)->tcfv_action;
+   u32 tcfv_action;
+
+   rcu_read_lock();
+   tcfv_action = rcu_dereference(to_vlan(a)->vlan_p)->tcfv_action;
+   rcu_read_unlock();
+
+   return tcfv_action;
 }
 
 static inline u16 tcf_vlan_push_vid(const struct tc_action *a)
 {
-   return to_vlan(a)->tcfv_push_vid;
+   u16 tcfv_push_vid;
+
+   rcu_read_lock();
+   tcfv_push_vid = rcu_dereference(to_vlan(a)->vlan_p)->tcfv_push_vid;
+   rcu_read_unlock();
+
+   return tcfv_push_vid;
 }
 
 static inline __be16 tcf_vlan_push_proto(const struct tc_action *a)
 {
-   return to_vlan(a)->tcfv_push_proto;
+   __be16 tcfv_push_proto;
+
+   rcu_read_lock();
+   tcfv_push_proto = rcu_dereference(to_vlan(a)->vlan_p)->tcfv_push_proto;
+   rcu_read_unlock();
+
+   return tcfv_push_proto;
 }
 
 static inline u8 tcf_vlan_push_prio(const struct tc_action *a)
 {
-   return to_vlan(a)->tcfv_push_prio;
-}
+   u8 tcfv_push_prio;
 
+   rcu_read_lock();
+   tcfv_push_prio = rcu_dereference(to_vlan(a)->vlan_p)->tcfv_push_prio;
+   rcu_read_unlock();
+
+   return tcfv_push_prio;
+}
 #endif /* __NET_TC_VLAN_H */
diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
index b093bad..97f717a 100644
--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -26,6 +26,7 @@ static int tcf_vlan(struct sk_buff *skb, const struct 
tc_action *a,
struct tcf_result *res)
 {
struct tcf_vlan *v = to_vlan(a);
+   struct tcf_vlan_params *p;
int action;
int err;
u16 tci;
@@ -33,24 +34,27 @@ static int tcf_vlan(struct sk_buff *skb, const struct 
tc_action *a,
tcf_lastuse_update(>tcf_tm);
bstats_cpu_update(this_cpu_ptr(v->common.cpu_bstats), skb);
 
-   spin_lock(>tcf_lock);
-   action = v->tcf_action;
-
/* Ensure 'data' points at mac_header prior calling vlan manipulating
 * functions.
 */
if (skb_at_tc_ingress(skb))
skb_push_rcsum(skb, skb->mac_len);
 
-   switch (v->tcfv_action) {
+   rcu_read_lock();
+
+   action = READ_ONCE(v->tcf_action);
+
+   p = rcu_dereference(v->vlan_p);
+
+   switch (p->tcfv_action) {
case TCA_VLAN_ACT_POP:
err = skb_vlan_pop(skb);
if (err)
goto drop;
break;
case TCA_VLAN_ACT_PUSH:
-   err = skb_vlan_push(skb, v->tcfv_push_proto, v->tcfv_push_vid |
-   (v->tcfv_push_prio << VLAN_PRIO_SHIFT));
+   err = skb_vlan_push(skb, p->tcfv_push_proto, p->tcfv_push_vid |
+   (p->tcfv_push_prio << VLAN_PRIO_SHIFT));
if (err)
goto drop;
break;
@@ -69,14 +73,14 @@ static int tcf_vlan(struct sk_buff *skb, const struct 
tc_action *a,
goto drop;
}
/* replace the vid */
-   tci = (tci & ~VLAN_VID_MASK) | v->tcfv_push_vid;
+   tci = (tci & ~VLAN_VID_MASK) | p->tcfv_push_vid;
/* replace prio bits, if tcfv_push_prio specified */
-   if (v->tcfv_push_prio) {
+   if (p->tcfv_push_prio) {
tci &= ~VLAN_PRIO_MASK;
-   tci |= v->tcfv_push_prio << VLAN_PRIO_SHIFT;
+   tci |= p->tcfv_push_prio << VLAN_PRIO_SHIFT;
}
/* put updated tci as 

[PATCH net-next v6 2/3] nfp flower action: Modified to use VLAN helper functions

2017-11-03 Thread Manish Kurup
Modified netronome nfp flower action to use VLAN helper functions instead
of accessing the structure directly.

Signed-off-by: Manish Kurup 
---
 drivers/net/ethernet/netronome/nfp/flower/action.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/action.c 
b/drivers/net/ethernet/netronome/nfp/flower/action.c
index de64ced..c1c595f 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/action.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/action.c
@@ -58,7 +58,6 @@ nfp_fl_push_vlan(struct nfp_fl_push_vlan *push_vlan,
 const struct tc_action *action)
 {
size_t act_size = sizeof(struct nfp_fl_push_vlan);
-   struct tcf_vlan *vlan = to_vlan(action);
u16 tmp_push_vlan_tci;
 
push_vlan->head.jump_id = NFP_FL_ACTION_OPCODE_PUSH_VLAN;
@@ -67,8 +66,8 @@ nfp_fl_push_vlan(struct nfp_fl_push_vlan *push_vlan,
push_vlan->vlan_tpid = tcf_vlan_push_proto(action);
 
tmp_push_vlan_tci =
-   FIELD_PREP(NFP_FL_PUSH_VLAN_PRIO, vlan->tcfv_push_prio) |
-   FIELD_PREP(NFP_FL_PUSH_VLAN_VID, vlan->tcfv_push_vid) |
+   FIELD_PREP(NFP_FL_PUSH_VLAN_PRIO, tcf_vlan_push_prio(action)) |
+   FIELD_PREP(NFP_FL_PUSH_VLAN_VID, tcf_vlan_push_vid(action)) |
NFP_FL_PUSH_VLAN_CFI;
push_vlan->vlan_tci = cpu_to_be16(tmp_push_vlan_tci);
 }
-- 
2.7.4



[PATCH net-next v6 0/3] Incorporated all required changes

2017-11-03 Thread Manish Kurup
Hi everyone,

Modified the netronome drivers (flower action) to use the VLAN helper
functions instead of dereferencing the structure directly. This is
required for the VLAN action patch.

Could you please review?

Here're the changes:
v2: Fixed all helper functions to use RCU (rtnl_dereference) - Eric, Jamal
v2: Fixed indentation, extra line nits - Jamal, Jiri
v2: Moved rcu_head to the end of the struct - Jiri
v2: Re-formatted locals to reverse-christmas-tree - Jiri
v2: Removed mismatched spin_lock() - Cong
v2: Removed spin_lock_bh() in tcf_vlan_init, rtnl_dereference() should
suffice - Cong, Jiri
v4: Modified the nfp flower action code to use the VLAN helper functions
instead of referencing the structure directly. Isolated this into a
separate patch - Pieter Jansen
v5: Got rid of the unlikely() for the allocation case - Simon Horman
v6: Added cleanup functions for RCU alloc - Dave Miller

Acked-by: Jamal Hadi Salim 
Acked-by: Jiri Pirko 
Signed-off-by: Manish Kurup 

Manish Kurup (3):
  act_vlan: Change stats update to use per-core stats
  nfp flower action: Modified to use VLAN helper functions
  act_vlan: VLAN action rewrite to use RCU lock/unlock and update

 drivers/net/ethernet/netronome/nfp/flower/action.c |  5 +-
 include/net/tc_act/tc_vlan.h   | 46 +---
 net/sched/act_vlan.c   | 81 +++---
 3 files changed, 94 insertions(+), 38 deletions(-)

-- 
2.7.4



[PATCH net-next v6 1/3] act_vlan: Change stats update to use per-core stats

2017-11-03 Thread Manish Kurup
The VLAN action maintains one set of stats across all cores, and uses a
spinlock to synchronize updates to it from the same. Changed this to use a
per-CPU stats context instead.
This change will result in better performance.

Acked-by: Jamal Hadi Salim 
Acked-by: Jiri Pirko 
Signed-off-by: Manish Kurup 
---
 net/sched/act_vlan.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
index 16eb067..b093bad 100644
--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -30,9 +30,10 @@ static int tcf_vlan(struct sk_buff *skb, const struct 
tc_action *a,
int err;
u16 tci;
 
-   spin_lock(>tcf_lock);
tcf_lastuse_update(>tcf_tm);
-   bstats_update(>tcf_bstats, skb);
+   bstats_cpu_update(this_cpu_ptr(v->common.cpu_bstats), skb);
+
+   spin_lock(>tcf_lock);
action = v->tcf_action;
 
/* Ensure 'data' points at mac_header prior calling vlan manipulating
@@ -85,7 +86,8 @@ static int tcf_vlan(struct sk_buff *skb, const struct 
tc_action *a,
 
 drop:
action = TC_ACT_SHOT;
-   v->tcf_qstats.drops++;
+   qstats_drop_inc(this_cpu_ptr(v->common.cpu_qstats));
+
 unlock:
if (skb_at_tc_ingress(skb))
skb_pull_rcsum(skb, skb->mac_len);
@@ -172,7 +174,7 @@ static int tcf_vlan_init(struct net *net, struct nlattr 
*nla,
 
if (!exists) {
ret = tcf_idr_create(tn, parm->index, est, a,
-_vlan_ops, bind, false);
+_vlan_ops, bind, true);
if (ret)
return ret;
 
-- 
2.7.4



[PATCH net] l2tp: don't use l2tp_tunnel_find() in l2tp_ip and l2tp_ip6

2017-11-03 Thread Guillaume Nault
Using l2tp_tunnel_find() in l2tp_ip_recv() is wrong for two reasons:

  * It doesn't take a reference on the returned tunnel, which makes the
call racy wrt. concurrent tunnel deletion.

  * The lookup is only based on the tunnel identifier, so it can return
a tunnel that doesn't match the packet's addresses or protocol.

For example, a packet sent to an L2TPv3 over IPv6 tunnel can be
delivered to an L2TPv2 over UDPv4 tunnel. This is worse than a simple
cross-talk: when delivering the packet to an L2TP over UDP tunnel, the
corresponding socket is UDP, where ->sk_backlog_rcv() is NULL. Calling
sk_receive_skb() will then crash the kernel by trying to execute this
callback.

And l2tp_tunnel_find() isn't even needed here. __l2tp_ip_bind_lookup()
properly checks the socket binding and connection settings. It was used
as a fallback mechanism for finding tunnels that didn't have their data
path registered yet. But it's not limited to this case and can be used
to replace l2tp_tunnel_find() in the general case.

Fix l2tp_ip6 in the same way.

Fixes: 0d76751fad77 ("l2tp: Add L2TPv3 IP encapsulation (no UDP) support")
Fixes: a32e0eec7042 ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6")
Signed-off-by: Guillaume Nault 
---
 net/l2tp/l2tp_ip.c  | 24 +---
 net/l2tp/l2tp_ip6.c | 24 +---
 2 files changed, 18 insertions(+), 30 deletions(-)

diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 4d322c1b7233..e4280b6568b4 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -123,6 +123,7 @@ static int l2tp_ip_recv(struct sk_buff *skb)
unsigned char *ptr, *optr;
struct l2tp_session *session;
struct l2tp_tunnel *tunnel = NULL;
+   struct iphdr *iph;
int length;
 
if (!pskb_may_pull(skb, 4))
@@ -178,24 +179,17 @@ static int l2tp_ip_recv(struct sk_buff *skb)
goto discard;
 
tunnel_id = ntohl(*(__be32 *) >data[4]);
-   tunnel = l2tp_tunnel_find(net, tunnel_id);
-   if (tunnel) {
-   sk = tunnel->sock;
-   sock_hold(sk);
-   } else {
-   struct iphdr *iph = (struct iphdr *) skb_network_header(skb);
-
-   read_lock_bh(_ip_lock);
-   sk = __l2tp_ip_bind_lookup(net, iph->daddr, iph->saddr,
-  inet_iif(skb), tunnel_id);
-   if (!sk) {
-   read_unlock_bh(_ip_lock);
-   goto discard;
-   }
+   iph = (struct iphdr *)skb_network_header(skb);
 
-   sock_hold(sk);
+   read_lock_bh(_ip_lock);
+   sk = __l2tp_ip_bind_lookup(net, iph->daddr, iph->saddr, inet_iif(skb),
+  tunnel_id);
+   if (!sk) {
read_unlock_bh(_ip_lock);
+   goto discard;
}
+   sock_hold(sk);
+   read_unlock_bh(_ip_lock);
 
if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
goto discard_put;
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 88b397c30d86..8bcaa975b432 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -136,6 +136,7 @@ static int l2tp_ip6_recv(struct sk_buff *skb)
unsigned char *ptr, *optr;
struct l2tp_session *session;
struct l2tp_tunnel *tunnel = NULL;
+   struct ipv6hdr *iph;
int length;
 
if (!pskb_may_pull(skb, 4))
@@ -192,24 +193,17 @@ static int l2tp_ip6_recv(struct sk_buff *skb)
goto discard;
 
tunnel_id = ntohl(*(__be32 *) >data[4]);
-   tunnel = l2tp_tunnel_find(net, tunnel_id);
-   if (tunnel) {
-   sk = tunnel->sock;
-   sock_hold(sk);
-   } else {
-   struct ipv6hdr *iph = ipv6_hdr(skb);
-
-   read_lock_bh(_ip6_lock);
-   sk = __l2tp_ip6_bind_lookup(net, >daddr, >saddr,
-   inet6_iif(skb), tunnel_id);
-   if (!sk) {
-   read_unlock_bh(_ip6_lock);
-   goto discard;
-   }
+   iph = ipv6_hdr(skb);
 
-   sock_hold(sk);
+   read_lock_bh(_ip6_lock);
+   sk = __l2tp_ip6_bind_lookup(net, >daddr, >saddr,
+   inet6_iif(skb), tunnel_id);
+   if (!sk) {
read_unlock_bh(_ip6_lock);
+   goto discard;
}
+   sock_hold(sk);
+   read_unlock_bh(_ip6_lock);
 
if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb))
goto discard_put;
-- 
2.15.0



Re: [PATCH 6/7] netdev: octeon-ethernet: Add Cavium Octeon III support.

2017-11-03 Thread Andrew Lunn
> >>+static char *mix_port;
> >>+module_param(mix_port, charp, 0444);
> >>+MODULE_PARM_DESC(mix_port, "Specifies which ports connect to MIX 
> >>interfaces.");
> >
> >Can you derive this from Device Tree /platform data configuration?
> >
> >>+
> >>+static char *pki_port;
> >>+module_param(pki_port, charp, 0444);
> >>+MODULE_PARM_DESC(pki_port, "Specifies which ports connect to the PKI.");
> >
> >Likewise
> 
> The SoC is flexible in how it is configured.  Technically the device tree
> should only be used to specify information about the physical configuration
> of the system that cannot be probed for, and this is about policy rather
> that physical wiring.  That said, we do take the default configuration from
> the device tree, but give the option here to override via the module command
> line.

Module parameters are pretty much frowned upon. 

You should really try to remove them all, if possible.

> >>+/* Registers are accessed via xkphys */
> >>+#define SSO_BASE   0x16700ull
> >>+#define SSO_ADDR(node) (SET_XKPHYS + NODE_OFFSET(node) 
> >>+  \
> >>+SSO_BASE)
> >>+#define GRP_OFFSET(grp)((grp) << 16)
> >>+#define GRP_ADDR(n, g) (SSO_ADDR(n) + GRP_OFFSET(g))
> >>+#define SSO_GRP_AQ_CNT(n, g)   (GRP_ADDR(n, g)+ 
> >>0x2700)
> >>+
> >>+#define MIO_PTP_BASE   0x10700ull
> >>+#define MIO_PTP_ADDR(node) (SET_XKPHYS + NODE_OFFSET(node) +  \
> >>+MIO_PTP_BASE)
> >>+#define MIO_PTP_CLOCK_CFG(node)(MIO_PTP_ADDR(node) 
> >>+ 0xf00)
> >>+#define MIO_PTP_CLOCK_HI(node) (MIO_PTP_ADDR(node) 
> >>+ 0xf10)
> >>+#define MIO_PTP_CLOCK_COMP(node)   (MIO_PTP_ADDR(node) + 0xf18)
> >
> >I am sure this will work great on anything but MIPS64 ;)
> 
> Sarcasm duly noted.
> 
> That said, by definition it is exactly an OCTEON-III/MIPS64, and can never
> be anything else.  It is known a priori that the hardware and this driver
> will never be used anywhere else.

Please make sure your Kconfig really enforces this. Generally, we
suggest allowing the driver to be compiled when COMPILE_TEST is set.
That gives you better compiler test coverage. But it seems like this
driver won't compile under such conditions.

> >>+static int num_packet_buffers = 768;
> >>+module_param(num_packet_buffers, int, 0444);
> >>+MODULE_PARM_DESC(num_packet_buffers,
> >>+"Number of packet buffers to allocate per port.");
> >
> >Consider implementing ethtool -g/G for this.
> 
> That may be work for a follow-on patch.

Then please remove the module parameter now.

> >>+static int rx_queues = 1;
> >>+module_param(rx_queues, int, 0444);
> >>+MODULE_PARM_DESC(rx_queues, "Number of RX threads per port.");
> >
> >Same thing, can you consider using an ethtool knob for that?
> 
> Also may be work for a follow-on patch.

Ditto

> >>+/**
> >>+ * Reads a 64 bit value from the processor local scratchpad memory.
> >>+ *
> >>+ * @param offset byte offset into scratch pad to read
> >>+ *
> >>+ * @return value read
> >>+ */
> >>+static inline u64 scratch_read64(u64 offset)
> >>+{
> >>+   return *(u64 *)((long)SCRATCH_BASE + offset);
> >>+}
> >
> >No barriers needed whatsoever?
> 
> Nope.

Then it would be good to add a comment about why no barrier is
needed. Otherwise people are going to ask why there is no barrier,
submit patches adding barriers, etc.

   Andrew


  1   2   3   >