date:20181116

Re: [PATCH net-next] nxp: fix trivial comment typo

2018-11-16 Thread Vladimir Zapolskiy

Hello Andrea,

On 11/14/2018 08:47 PM, Andrea Claudi wrote:
> s/rxfliterctrl/rxfilterctrl
> 
> Signed-off-by: Andrea Claudi 

thank you for the patch, but let me ask you to change the subject line by
adding the expected prefixes 'net: lpc_eth: fix trivial comment typo'.
Also it would be nice to see a simple non-empty commit description.

--
Best wishes,
Vladimir

[PATCH] socket: do a generic_file_splice_read when proto_ops has no splice_read

2018-11-16 Thread kaslevs

From: Slavomir Kaslev 

splice(2) fails with -EINVAL when called reading on a socket with no splice_read
set in its proto_ops (such as vsock sockets). Switch this to fallbacks to a
generic_file_splice_read instead.

Signed-off-by: Slavomir Kaslev 
---
 net/socket.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/socket.c b/net/socket.c
index 593826e11a53..334fcc617ef2 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -853,7 +853,7 @@ static ssize_t sock_splice_read(struct file *file, loff_t 
*ppos,
struct socket *sock = file->private_data;
 
if (unlikely(!sock->ops->splice_read))
-   return -EINVAL;
+   return generic_file_splice_read(file, ppos, pipe, len, flags);
 
return sock->ops->splice_read(sock, ppos, pipe, len, flags);
 }
-- 
2.19.1

Re: [PATCH net-next 2/2] net/sched: act_police: don't use spinlock in the data path

2018-11-16 Thread Davide Caratti

On Thu, 2018-11-15 at 05:53 -0800, Eric Dumazet wrote:
> 
> On 11/15/2018 03:43 AM, Davide Caratti wrote:
> > On Wed, 2018-11-14 at 22:46 -0800, Eric Dumazet wrote:
> > > On 09/13/2018 10:29 AM, Davide Caratti wrote:
> > > > use RCU instead of spinlocks, to protect concurrent read/write on
> > > > act_police configuration. This reduces the effects of contention in the
> > > > data path, in case multiple readers are present.
> > > > 
> > > > Signed-off-by: Davide Caratti 
> > > > ---
> > > >  net/sched/act_police.c | 156 -
> > > >  1 file changed, 92 insertions(+), 64 deletions(-)
> > > > 
> > > 
> > > I must be missing something obvious with this patch.
> > 
> > hello Eric,
> > 
> > On the opposite, I missed something obvious when I wrote that patch: there
> > is a race condition on tcfp_toks, tcfp_ptoks and tcfp_t_c: thank you for
> > noticing it.
> > 
> > These variables still need to be protected with a spinlock. I will do a
> > patch and evaluate if 'act_police' is still faster than a version where   
> > 2d550dbad83c ("net/sched:  ") is reverted, and share results in the
> > next hours.
> > 
> > Ok?
> > 
> 
> SGTM, thanks.

hello,
I just finished the comparison of act_police, in the following cases:

a) revert the RCU-ification (i.e. commit 2d550dbad83c ("net/sched:
act_police: don't use spinlock in the data path"), and leave per-cpu
counters used by the rate estimator

b) keep RCU-ified configuration parameters, and protect read/update of
tcfp_toks, tcfp_ptoks and tcfp_t with a spinlock (code at the bottom  of
this message).

## Test setup:

$DEV is a 'dummy' with clsact qdisc; the following two commands,

# test police with 'rate'
$TC filter add dev $DEV egress matchall \
 action police rate 2gbit burst 100k conform-exceed pass/pass index 100

# test police with 'avrate'
$TC filter add dev prova egress estimator 1s 8s matchall \
action police avrate 2gbit conform-exceed pass/pass index 100

are tested with the following loop:

for c in 1 2 4 8 16; do
./pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -v -s 64  -t $c -n 500 -i $DEV
done


## Test results:

using  rate  | reverted   | patched
$c   | act_police (a) | act_police (b)
-++---
 1   |   3364442  |  3345580  
 2   |   2703030  |  2721919  
 4   |   1130146  |  1253555
 8   |664238  |   658777
16   |154026  |   155259


using avrate | reverted   | patched
$c   | act_police (a) | act_police (b)
-++---
 1   |   3621796  |  3658472 
 2   |   3075589  |  3548135  
 4   |   2313314  |  3343717
 8   |768458  |  3260480
16   |16  |  3254128


so, 'avrate' still gets a significant improvement because the 'conform/exceed'
decision doesn't need the spinlock in this case. The estimation is probably
less accurate, because it use per-CPU variables: if this is not acceptable,
then we need to revert also 93be42f9173b ("net/sched: act_police: use per-cpu
counters").


## patch code:

-- >8 --
diff --git a/net/sched/act_police.c b/net/sched/act_police.c
index 052855d..42db852 100644
--- a/net/sched/act_police.c
+++ b/net/sched/act_police.c
@@ -27,10 +27,7 @@ struct tcf_police_params {
u32 tcfp_ewma_rate;
s64 tcfp_burst;
u32 tcfp_mtu;
-   s64 tcfp_toks;
-   s64 tcfp_ptoks;
s64 tcfp_mtu_ptoks;
-   s64 tcfp_t_c;
struct psched_ratecfg   rate;
boolrate_present;
struct psched_ratecfg   peak;
@@ -40,6 +37,9 @@ struct tcf_police_params {
 
 struct tcf_police {
struct tc_actioncommon;
+   s64 tcfp_toks;
+   s64 tcfp_ptoks;
+   s64 tcfp_t_c;
struct tcf_police_params __rcu *params;
 };
 
@@ -186,12 +186,6 @@ static int tcf_police_init(struct net *net, struct nlattr 
*nla,
}
 
new->tcfp_burst = PSCHED_TICKS2NS(parm->burst);
-   new->tcfp_toks = new->tcfp_burst;
-   if (new->peak_present) {
-   new->tcfp_mtu_ptoks = (s64)psched_l2t_ns(>peak,
-new->tcfp_mtu);
-   new->tcfp_ptoks = new->tcfp_mtu_ptoks;
-   }
 
if (tb[TCA_POLICE_AVRATE])
new->tcfp_ewma_rate = nla_get_u32(tb[TCA_POLICE_AVRATE]);
@@ -207,7 +201,14 @@ static int tcf_police_init(struct net *net, struct nlattr 
*nla,
}
 
spin_lock_bh(>tcf_lock);
-   new->tcfp_t_c = ktime_get_ns();
+   police->tcfp_t_c = ktime_get_ns();
+   police->tcfp_toks = new->tcfp_burst;
+   if (new->peak_present) {
+

selftests: net: udpgro.sh hangs on DUT devices running Linux -next

2018-11-16 Thread Naresh Kamboju

Kernel selftests: net: udpgro.sh hangs / waits forever on x86_64 and
arm32 devices running Linux -next. Test getting PASS on arm64 devices.

Do you see this problem ?

Short error log:
-
ip6tables v1.6.1: can't initialize ip6tables table `nat': Table does
not exist (do you need to insmod?)

Short log error with debug,

+ ip netns exec ns-peer-pwnOKK ip6tables -t nat -I PREROUTING -d
2001:db8::1 -j DNAT --to-destination 2001:db8::3
ip6tables v1.6.1: can't initialize ip6tables table `nat': Table does
not exist (do you need to insmod?)
Perhaps ip6tables or your kernel needs to be upgraded.
+ pid=3880
+ ip netns exec ns-peer-pwnOKK ./udpgso_bench_rx -G -b 2001:db8::1 -n 0
+ sleep 0.1
+ ip netns exec ns-peer-pwnOKK ./udpgso_bench_rx -b 2001:db8::3 -r -n 10 -l 1452
+ ./udpgso_bench_tx -l 4 -6 -D 2001:db8::1 -M 1 -s 14520 -S 0
+ kill -INT 3880
++ jobs -p
+ wait 3880 3881
# Here it is waiting forever.

Long test log,
selftests: net: udpgro.sh
[ 1009.614607] IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready
[ 1009.737943] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
ipv4
[ 1010.053106] IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready
[ 1010.162944] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
 no GRO  ok
 no GRO chk cmsg ./udpgso_bench_rx: wrong
packet number! got 7, expected 10

[ 1010.562502] IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready
[ 1010.689950] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
failed
[ 1011.066820] IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready
[ 1011.188955] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
 GRO ok
[ 1011.550350] IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready
[ 1011.673963] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
 GRO chk cmsgok
[ 1012.117313] IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready
[ 1012.242831] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
 GRO with custom segment sizeok
[ 1012.621189] IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready
[ 1012.738940] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
[ 1012.896143] kauditd_printk_skb: 22 callbacks suppressed
[ 1012.896143] audit: type=1325 audit(1542288720.207:2489): table=nat
family=2 entries=0
[ 1012.909742] audit: type=1300 audit(1542288720.207:2489):
arch=c03e syscall=313 success=yes exit=0 a0=2 a1=418424 a2=0 a3=2
items=0 ppid=7763 pid=21917 auid=4294967295 uid=0 gid=0 euid=0 suid=0
fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295
comm=\"modprobe\" exe=\"/bin/kmod\" subj=kernel key=(null)
[ 1012.936822] audit: type=1327 audit(1542288720.207:2489):
proctitle=2F7362696E2F6D6F6470726F6265002D71002D2D0069707461626C655F6E6174
[ 1012.948653] audit: type=1325 audit(1542288720.175:2490): table=nat
family=2 entries=0
[ 1012.956524] audit: type=1300 audit(1542288720.175:2490):
arch=c03e syscall=55 success=yes exit=0 a0=4 a1=0 a2=40
a3=7ffeb4e82700 items=0 ppid=21878 pid=21916 auid=4294967295 uid=0
gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0
ses=4294967295 comm=\"iptables\" exe=\"/usr/sbin/xtables-multi\"
subj=kernel key=(null)
[ 1012.985334] audit: type=1327 audit(1542288720.175:2490):
proctitle=69707461626C6573002D74006E6174002D4900505245524F5554494E47002D64003139322E3136382E312E31002D6A00444E4154002D2D746F2D64657374696E6174696F6E003139322E3136382E312E33
[ 1013.005799] audit: type=1325 audit(1542288720.221:2491): table=nat
family=2 entries=5
[ 1013.013652] audit: type=1300 audit(1542288720.221:2491):
arch=c03e syscall=54 success=yes exit=0 a0=4 a1=0 a2=40 a3=b8a870
items=0 ppid=21878 pid=21916 auid=4294967295 uid=0 gid=0 euid=0 suid=0
fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 ses=4294967295
comm=\"iptables\" exe=\"/usr/sbin/xtables-multi\" subj=kernel
key=(null)
[ 1013.042023] audit: type=1327 audit(1542288720.221:2491):
proctitle=69707461626C6573002D74006E6174002D4900505245524F5554494E47002D64003139322E3136382E312E31002D6A00444E4154002D2D746F2D64657374696E6174696F6E003139322E3136382E312E33
 GRO with custom segment size cmsg   ok
 bad GRO lookup  ok
[ 1013.331549] IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready
[ 1013.455957] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
ipv6
 no GRO  ./udpgso_bench_rx: wrong
packet number! got 7, expected 10

[ 1013.836277] IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready
[ 1013.952959] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
failed
[ 1014.338978] IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready
[ 1014.462967] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
 no GRO chk cmsg ok
[ 1014.839919] IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready
[ 1014.965962] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
 GRO

Re: [RFC v1 2/3] vxlan: add support for underlay in non-default VRF

2018-11-16 Thread Alexis Bauvin

Le 16 nov. 2018 à 08:37, David Ahern  a écrit :
> On 11/15/18 2:05 AM, Alexis Bauvin wrote:
>> Le 14 nov. 2018 à 20:58, David Ahern  a écrit :
>>> 
>>> you are making this more specific than it needs to be 
>>> 
>>> On 11/14/18 1:31 AM, Alexis Bauvin wrote:
 diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
 index 27bd586b94b0..7477b5510a04 100644
 --- a/drivers/net/vxlan.c
 +++ b/drivers/net/vxlan.c
 @@ -208,11 +208,23 @@ static inline struct vxlan_rdst 
 *first_remote_rtnl(struct vxlan_fdb *fdb)
return list_first_entry(>remotes, struct vxlan_rdst, list);
 }

 +static int vxlan_get_l3mdev(struct net *net, int ifindex)
 +{
 +  struct net_device *dev;
 +
 +  dev = __dev_get_by_index(net, ifindex);
 +  while (dev && !netif_is_l3_master(dev))
 +  dev = netdev_master_upper_dev_get(dev);
 +
 +  return dev ? dev->ifindex : 0;
 +}
>>> 
>>> l3mdev_master_ifindex_by_index should work instead of defining this for
>>> vxlan.
>>> 
>>> But I do not believe you need this function.
>> 
>> l3mdev_master_ifindex_by_index does not recursively climbs up the master 
>> chain.
>> This means that if the l3mdev is not a direct master of the device, it will 
>> not
>> be found.
>> 
>> E.G. Calling l3mdev_master_ifindex_by_index with the index of eth0 will
>> return 0:
>> 
>> +--+ +-+ +--+
>> |  | | | |  |
>> | eth0 +-+ br0 +-+ vrf-blue |
>> |  | | | |  |
>> +--+ +-+ +--+
>> 
> 
> eth0 is not the L3/router interface in this picture; br0 is. There
> should not be a need for invoking l3mdev_master_ifindex_by_index on eth0.
> 
> What device stacking are you expecting to handle with vxlan devices?
> vxlan on eth0 with vxlan devices in a VRF? vxlan devices into a bridge
> with the bridge (or SVI) enslaved to a VRF?

The case I am trying to cover here is the user creating a VXLAN device with eth0
as its lower device (ip link add vxlan0 type vxlan ... dev eth0), thus ignoring
the fact that it should be br0 (the actual L3 interface). In this case, the only
information available from the module's point of view is eth0. I may be wrong,
but eth0 is indirectly "part" of vrf-blue (even if it is only L2), as packets
flowing in from it would land in vrf-blue if L3.

As for the device stacking, I am only interested in the VXLAN underlay: the
VXLAN device itself could be in a specific VRF or not, it should not influence
its underlay. 

+--+ +-+
|  | | |
| vrf-blue | | vrf-red |
|  | | |
++-+ +++
 ||
 ||
++-+ +++
|  | | |
| br-blue  | | br-red  |
|  | | |
++-+ +---+-+---+
 |   | |
 | +-+ +-+
 | | |
++-++--++   +++
|  |  lower device  |   |   | |
|   eth0   | <- - - - - - - | vxlan-red |   | tap-red | (... more taps)
|  ||   |   | |
+--++---+   +-+

While I don't see any use case for having a bridged uplink when using VXLAN,
someone may and would find a different behavior depending on the lower device.
In the above example, vxlan-red's lower device should be br-blue, but a user
would expect the underlay VRF (vrf-blue) to still be taken into account if eth0
was used as the lower device.

A different approach would be to check if the lower device is a bridge. If not,
fetch a potential master bridge. Then, with this L3/router interface, we fetch
the l3mdev with l3mdev_master_ifindex_by_index (if any).

> 
>> This is because the underlying l3mdev_master_dev_rcu function fetches the 
>> master
>> (br0 in this case), checks whether it is an l3mdev (which it is not), and
>> returns its index if so.
>> 
>> So if using l3mdev_master_dev_rcu, using eth0 as a lower device will still 
>> bind
>> to no specific device, thus in the default VRF.
>> 
>> Maybe I should have patched l3mdev_master_dev_rcu to do a recursive 
>> resolution
>> (as vxlan_get_l3mdev does), but I don’t know the impact of such a change.
> 
> no, that is definitely the wrong the approach.

Ok! What is the best approach in your opinion?

[PATCH v3 1/4] bpf: allow zero-initializing hash map seed

2018-11-16 Thread Lorenz Bauer

Add a new flag BPF_F_ZERO_SEED, which forces a hash map
to initialize the seed to zero. This is useful when doing
performance analysis both on individual BPF programs, as
well as the kernel's hash table implementation.

Signed-off-by: Lorenz Bauer 
---
 include/uapi/linux/bpf.h |  3 +++
 kernel/bpf/hashtab.c | 13 +++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 47d606d744cc..8c01b89a4cb4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -269,6 +269,9 @@ enum bpf_attach_type {
 /* Flag for stack_map, store build_id+offset instead of pointer */
 #define BPF_F_STACK_BUILD_ID   (1U << 5)
 
+/* Zero-initialize hash function seed. This should only be used for testing. */
+#define BPF_F_ZERO_SEED(1U << 6)
+
 enum bpf_stack_build_id_status {
/* user space need an empty entry to identify end of a trace */
BPF_STACK_BUILD_ID_EMPTY = 0,
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 2c1790288138..4b7c76765d9d 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -23,7 +23,7 @@
 
 #define HTAB_CREATE_FLAG_MASK  \
(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE |\
-BPF_F_RDONLY | BPF_F_WRONLY)
+BPF_F_RDONLY | BPF_F_WRONLY | BPF_F_ZERO_SEED)
 
 struct bucket {
struct hlist_nulls_head head;
@@ -244,6 +244,7 @@ static int htab_map_alloc_check(union bpf_attr *attr)
 */
bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU);
bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC);
+   bool zero_seed = (attr->map_flags & BPF_F_ZERO_SEED);
int numa_node = bpf_map_attr_numa_node(attr);
 
BUILD_BUG_ON(offsetof(struct htab_elem, htab) !=
@@ -257,6 +258,10 @@ static int htab_map_alloc_check(union bpf_attr *attr)
 */
return -EPERM;
 
+   if (zero_seed && !capable(CAP_SYS_ADMIN))
+   /* Guard against local DoS, and discourage production use. */
+   return -EPERM;
+
if (attr->map_flags & ~HTAB_CREATE_FLAG_MASK)
/* reserved bits should not be used */
return -EINVAL;
@@ -373,7 +378,11 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
if (!htab->buckets)
goto free_htab;
 
-   htab->hashrnd = get_random_int();
+   if (htab->map.map_flags & BPF_F_ZERO_SEED)
+   htab->hashrnd = 0;
+   else
+   htab->hashrnd = get_random_int();
+
for (i = 0; i < htab->n_buckets; i++) {
INIT_HLIST_NULLS_HEAD(>buckets[i].head, i);
raw_spin_lock_init(>buckets[i].lock);
-- 
2.17.1

[PATCH v3 0/4] bpf: allow zero-initialising hash map seed

2018-11-16 Thread Lorenz Bauer

Allow forcing the seed of a hash table to zero, for deterministic
execution during benchmarking and testing.

Changes from v2:
* Change ordering of BPF_F_ZERO_SEED in linux/bpf.h

Comments adressed from v1:
* Add comment to discourage production use to linux/bpf.h
* Require CAP_SYS_ADMIN

Lorenz Bauer (4):
  bpf: allow zero-initializing hash map seed
  bpf: move BPF_F_QUERY_EFFECTIVE after map flags
  tools: sync linux/bpf.h
  tools: add selftest for BPF_F_ZERO_SEED

 include/uapi/linux/bpf.h|  9 ++--
 kernel/bpf/hashtab.c| 13 -
 tools/include/uapi/linux/bpf.h  | 13 +++--
 tools/testing/selftests/bpf/test_maps.c | 68 +
 4 files changed, 84 insertions(+), 19 deletions(-)

-- 
2.17.1

[PATCH v3 3/4] tools: sync linux/bpf.h

2018-11-16 Thread Lorenz Bauer

Synchronize changes to linux/bpf.h from
* "bpf: allow zero-initializing hash map seed"
* "bpf: move BPF_F_QUERY_EFFECTIVE after map flags"

Signed-off-by: Lorenz Bauer 
---
 tools/include/uapi/linux/bpf.h | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 852dc17ab47a..05d95290b848 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -257,9 +257,6 @@ enum bpf_attach_type {
 /* Specify numa node during map creation */
 #define BPF_F_NUMA_NODE(1U << 2)
 
-/* flags for BPF_PROG_QUERY */
-#define BPF_F_QUERY_EFFECTIVE  (1U << 0)
-
 #define BPF_OBJ_NAME_LEN 16U
 
 /* Flags for accessing BPF object */
@@ -269,6 +266,12 @@ enum bpf_attach_type {
 /* Flag for stack_map, store build_id+offset instead of pointer */
 #define BPF_F_STACK_BUILD_ID   (1U << 5)
 
+/* Zero-initialize hash function seed. This should only be used for testing. */
+#define BPF_F_ZERO_SEED(1U << 6)
+
+/* flags for BPF_PROG_QUERY */
+#define BPF_F_QUERY_EFFECTIVE  (1U << 0)
+
 enum bpf_stack_build_id_status {
/* user space need an empty entry to identify end of a trace */
BPF_STACK_BUILD_ID_EMPTY = 0,
@@ -2201,6 +2204,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ * For sockets with reuseport option, *struct bpf_sock*
+ * return is from reuse->socks[] using hash of the packet.
  *
  * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
  * Description
@@ -2233,6 +2238,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ * For sockets with reuseport option, *struct bpf_sock*
+ * return is from reuse->socks[] using hash of the packet.
  *
  * int bpf_sk_release(struct bpf_sock *sk)
  * Description
-- 
2.17.1

[PATCH v3 4/4] tools: add selftest for BPF_F_ZERO_SEED

2018-11-16 Thread Lorenz Bauer

Check that iterating two separate hash maps produces the same
order of keys if BPF_F_ZERO_SEED is used.

Signed-off-by: Lorenz Bauer 
---
 tools/testing/selftests/bpf/test_maps.c | 68 +
 1 file changed, 57 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 4db2116e52be..9f0a5b16a246 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -258,23 +258,35 @@ static void test_hashmap_percpu(int task, void *data)
close(fd);
 }
 
+static int helper_fill_hashmap(int max_entries)
+{
+   int i, fd, ret;
+   long long key, value;
+
+   fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value),
+   max_entries, map_flags);
+   CHECK(fd < 0,
+ "failed to create hashmap",
+ "err: %s, flags: 0x%x\n", strerror(errno), map_flags);
+
+   for (i = 0; i < max_entries; i++) {
+   key = i; value = key;
+   ret = bpf_map_update_elem(fd, , , BPF_NOEXIST);
+   CHECK(ret != 0,
+ "can't update hashmap",
+ "err: %s\n", strerror(ret));
+   }
+
+   return fd;
+}
+
 static void test_hashmap_walk(int task, void *data)
 {
int fd, i, max_entries = 1000;
long long key, value, next_key;
bool next_key_valid = true;
 
-   fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value),
-   max_entries, map_flags);
-   if (fd < 0) {
-   printf("Failed to create hashmap '%s'!\n", strerror(errno));
-   exit(1);
-   }
-
-   for (i = 0; i < max_entries; i++) {
-   key = i; value = key;
-   assert(bpf_map_update_elem(fd, , , BPF_NOEXIST) == 0);
-   }
+   fd = helper_fill_hashmap(max_entries);
 
for (i = 0; bpf_map_get_next_key(fd, !i ? NULL : ,
 _key) == 0; i++) {
@@ -306,6 +318,39 @@ static void test_hashmap_walk(int task, void *data)
close(fd);
 }
 
+static void test_hashmap_zero_seed(void)
+{
+   int i, first, second, old_flags;
+   long long key, next_first, next_second;
+
+   old_flags = map_flags;
+   map_flags |= BPF_F_ZERO_SEED;
+
+   first = helper_fill_hashmap(3);
+   second = helper_fill_hashmap(3);
+
+   for (i = 0; ; i++) {
+   void *key_ptr = !i ? NULL : 
+
+   if (bpf_map_get_next_key(first, key_ptr, _first) != 0)
+   break;
+
+   CHECK(bpf_map_get_next_key(second, key_ptr, _second) != 0,
+ "next_key for second map must succeed",
+ "key_ptr: %p", key_ptr);
+   CHECK(next_first != next_second,
+ "keys must match",
+ "i: %d first: %lld second: %lld\n", i,
+ next_first, next_second);
+
+   key = next_first;
+   }
+
+   map_flags = old_flags;
+   close(first);
+   close(second);
+}
+
 static void test_arraymap(int task, void *data)
 {
int key, next_key, fd;
@@ -1534,6 +1579,7 @@ static void run_all_tests(void)
test_hashmap(0, NULL);
test_hashmap_percpu(0, NULL);
test_hashmap_walk(0, NULL);
+   test_hashmap_zero_seed();
 
test_arraymap(0, NULL);
test_arraymap_percpu(0, NULL);
-- 
2.17.1

Re: [PATCH net-next 6/8] net: eth: altera: tse: add support for ptp and timestamping

2018-11-16 Thread Dalon Westergreen

On Thu, 2018-11-15 at 18:14 -0800, Richard Cochran wrote:
> On Thu, Nov 15, 2018 at 06:55:29AM -0800, Dalon Westergreen wrote:
> > Sure, I would like to keep the debugfs entries for disabling freq
> > correction,and
> > reading the current scaled_ppm value.  I intend to use these to tune
> > anexternal
> > vcxo.  If there is a better way to do this, please let me know.
> 
> Yes, there is.  The external VCXO should be a proper PHC.  Then, with
> a minor change to the linuxptp stack (already in the pipe), you can
> just use that.
> 
> You should not disable frequency correction in the driver.  Leave that
> decision to the user space PTP stack.

Good to know, thanks.

> 
> > I would prefer to keep altera just to be consistent with the altera_tse
> > stuff,
> > and i intend to reusethis code for a 10GbE driver, so perhaps altera_tod to
> > reference the fpga ip name?
> 
> So the IP core is called "tod"?  Really?

yes, i am afraid so. "Time of Day"

--dalon

> 
> Thanks,
> Richard

Compliment of the day to you Dear Friend.

2018-11-16 Thread Mrs Amina.Kadi




Compliment of the day to you Dear Friend.

Dear Friend.
 
  I am Mrs. Amina Kadi. am sending this brief letter to solicit your
partnership to transfer $5.5 million US Dollars. I shall send you
more information and procedures when I receive positive response from
you.

Mrs. Amina Kadi

Re: [PATCH 00/10] add flow_rule infrastructure

2018-11-16 Thread Or Gerlitz

On Fri, Nov 16, 2018 at 3:43 AM Pablo Neira Ayuso  wrote:
> This patchset introduces a kernel intermediate representation (IR) to
> express ACL hardware offloads, this is heavily based on the existing
> flow dissector infrastructure and the TC actions. This IR can be used by
> different frontend ACL interfaces such as ethtool_rxnfc and tc to

any reason to keep aRFS out?

> represent ACL hardware offloads. Main goal is to simplify the
> development of ACL hardware offloads for the existing frontend
> interfaces, the idea is that driver developers do not need to add one
> specific parser for each ACL frontend, instead each frontend can just
> generate this flow_rule IR and pass it to drivers to populate the
> hardware IR.

yeah, but we are adding one more chain (IR), today we have

kernel frontend U/API X --> kernel parser Y --> driver --> HW API
kernel frontend U/API Z --> kernel parser W --> driver --> HW API

and we move to

kernel frontend U/API X --> kernel parser Y --> IR --> driver --> HW API
kernel frontend U/API Z --> kernel parser W --> IR --> driver --> HW API

So instead of

Y --> HW
W --> HW

we have IR --> HW while loosing the TC semantics and spirit in the drivers.

We could have normalize all the U/APIs to use the flow dissectors and
the TC actions and then have the drivers to deal with TC only.

IMHO flow dissectors and TC actions are the right approach to deal with ACLs
HW offloading. They properly reflect how ACLs work in modern HW pipelines.

>
> .   ethtool_rxnfc   tc
>|   (ioctl)(netlink)
>|  | | translate native
>   Frontend |  | |  interface representation
>|  | |  to flow_rule IR
>|  | |
> . \/\/
> . flow_rule IR
>||
>Drivers || parsing of flow_rule IR
>||  to populate hardware IR
>|   \/
> .  hardware IR (driver)
>
> For design and implementation details, please have a look at:
>
> https://lwn.net/Articles/766695/

I will further look next week, but as this is not marked as RFC (and
not with net-next
on the title), can we consider this still a discussion and not final/review?

> As an example, with this patchset, it should be possible to simplify the
> existing net/qede driver which already has two parsers to populate the
> hardware IR, one for ethtool_rxnfc interface and another for tc.

I think it would be fair to ask for one such driver porting to see the
impact/benefit.

Or.

[PATCH v3 2/4] bpf: move BPF_F_QUERY_EFFECTIVE after map flags

2018-11-16 Thread Lorenz Bauer

BPF_F_QUERY_EFFECTIVE is in the middle of the flags valid
for BPF_MAP_CREATE. Move it to its own section to reduce confusion.

Signed-off-by: Lorenz Bauer 
---
 include/uapi/linux/bpf.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8c01b89a4cb4..05d95290b848 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -257,9 +257,6 @@ enum bpf_attach_type {
 /* Specify numa node during map creation */
 #define BPF_F_NUMA_NODE(1U << 2)
 
-/* flags for BPF_PROG_QUERY */
-#define BPF_F_QUERY_EFFECTIVE  (1U << 0)
-
 #define BPF_OBJ_NAME_LEN 16U
 
 /* Flags for accessing BPF object */
@@ -272,6 +269,9 @@ enum bpf_attach_type {
 /* Zero-initialize hash function seed. This should only be used for testing. */
 #define BPF_F_ZERO_SEED(1U << 6)
 
+/* flags for BPF_PROG_QUERY */
+#define BPF_F_QUERY_EFFECTIVE  (1U << 0)
+
 enum bpf_stack_build_id_status {
/* user space need an empty entry to identify end of a trace */
BPF_STACK_BUILD_ID_EMPTY = 0,
-- 
2.17.1

[PATCH net V3 4/5] net/smc: atomic SMCD cursor handling

2018-11-16 Thread Ursula Braun

Running uperf tests with SMCD on LPARs results in corrupted cursors.
SMCD cursors should be treated atomically to fix cursor corruption.

Signed-off-by: Ursula Braun 
---
 net/smc/smc_cdc.c | 24 +--
 net/smc/smc_cdc.h | 58 +--
 2 files changed, 58 insertions(+), 24 deletions(-)

diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index ed5dcf03fe0b..18c047c155fd 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -177,23 +177,24 @@ void smc_cdc_tx_dismiss_slots(struct smc_connection *conn)
 int smcd_cdc_msg_send(struct smc_connection *conn)
 {
struct smc_sock *smc = container_of(conn, struct smc_sock, conn);
+   union smc_host_cursor curs;
struct smcd_cdc_msg cdc;
int rc, diff;
 
memset(, 0, sizeof(cdc));
cdc.common.type = SMC_CDC_MSG_TYPE;
-   cdc.prod_wrap = conn->local_tx_ctrl.prod.wrap;
-   cdc.prod_count = conn->local_tx_ctrl.prod.count;
-
-   cdc.cons_wrap = conn->local_tx_ctrl.cons.wrap;
-   cdc.cons_count = conn->local_tx_ctrl.cons.count;
-   cdc.prod_flags = conn->local_tx_ctrl.prod_flags;
-   cdc.conn_state_flags = conn->local_tx_ctrl.conn_state_flags;
+   curs.acurs.counter = atomic64_read(>local_tx_ctrl.prod.acurs);
+   cdc.prod.wrap = curs.wrap;
+   cdc.prod.count = curs.count;
+   curs.acurs.counter = atomic64_read(>local_tx_ctrl.cons.acurs);
+   cdc.cons.wrap = curs.wrap;
+   cdc.cons.count = curs.count;
+   cdc.cons.prod_flags = conn->local_tx_ctrl.prod_flags;
+   cdc.cons.conn_state_flags = conn->local_tx_ctrl.conn_state_flags;
rc = smcd_tx_ism_write(conn, , sizeof(cdc), 0, 1);
if (rc)
return rc;
-   smc_curs_copy(>rx_curs_confirmed, >local_tx_ctrl.cons,
- conn);
+   smc_curs_copy(>rx_curs_confirmed, , conn);
/* Calculate transmitted data and increment free send buffer space */
diff = smc_curs_diff(conn->sndbuf_desc->len, >tx_curs_fin,
 >tx_curs_sent);
@@ -331,13 +332,16 @@ static void smc_cdc_msg_recv(struct smc_sock *smc, struct 
smc_cdc_msg *cdc)
 static void smcd_cdc_rx_tsklet(unsigned long data)
 {
struct smc_connection *conn = (struct smc_connection *)data;
+   struct smcd_cdc_msg *data_cdc;
struct smcd_cdc_msg cdc;
struct smc_sock *smc;
 
if (!conn)
return;
 
-   memcpy(, conn->rmb_desc->cpu_addr, sizeof(cdc));
+   data_cdc = (struct smcd_cdc_msg *)conn->rmb_desc->cpu_addr;
+   smcd_curs_copy(, _cdc->prod, conn);
+   smcd_curs_copy(, _cdc->cons, conn);
smc = container_of(conn, struct smc_sock, conn);
smc_cdc_msg_recv(smc, (struct smc_cdc_msg *));
 }
diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
index 934df4473a7c..884218910353 100644
--- a/net/smc/smc_cdc.h
+++ b/net/smc/smc_cdc.h
@@ -50,19 +50,29 @@ struct smc_cdc_msg {
u8  reserved[18];
 } __packed;/* format defined in RFC7609 */
 
+/* SMC-D cursor format */
+union smcd_cdc_cursor {
+   struct {
+   u16 wrap;
+   u32 count;
+   struct smc_cdc_producer_flags   prod_flags;
+   struct smc_cdc_conn_state_flags conn_state_flags;
+   } __packed;
+#ifdef KERNEL_HAS_ATOMIC64
+   atomic64_t  acurs;  /* for atomic processing */
+#else
+   u64 acurs;  /* for atomic processing */
+#endif
+} __aligned(8);
+
 /* CDC message for SMC-D */
 struct smcd_cdc_msg {
struct smc_wr_rx_hdr common;/* Type = 0xFE */
u8 res1[7];
-   u16 prod_wrap;
-   u32 prod_count;
-   u8 res2[2];
-   u16 cons_wrap;
-   u32 cons_count;
-   struct smc_cdc_producer_flags   prod_flags;
-   struct smc_cdc_conn_state_flags conn_state_flags;
+   union smcd_cdc_cursor   prod;
+   union smcd_cdc_cursor   cons;
u8 res3[8];
-} __packed;
+} __aligned(8);
 
 static inline bool smc_cdc_rxed_any_close(struct smc_connection *conn)
 {
@@ -135,6 +145,21 @@ static inline void smc_curs_copy_net(union smc_cdc_cursor 
*tgt,
 #endif
 }
 
+static inline void smcd_curs_copy(union smcd_cdc_cursor *tgt,
+ union smcd_cdc_cursor *src,
+ struct smc_connection *conn)
+{
+#ifndef KERNEL_HAS_ATOMIC64
+   unsigned long flags;
+
+   spin_lock_irqsave(>acurs_lock, flags);
+   tgt->acurs = src->acurs;
+   spin_unlock_irqrestore(>acurs_lock, flags);
+#else
+   atomic64_set(>acurs, atomic64_read(>acurs));
+#endif
+}
+
 /* calculate cursor difference between old and new, where old <= new */
 static inline int smc_curs_diff(unsigned int size,
union smc_host_cursor *old,
@@ -222,12 +247,17 @@ static inline void smcr_cdc_msg_to_host(struct 
smc_host_cdc_msg *local,
 static inline void

[PATCH net V3 3/5] net/smc: add SMC-D shutdown signal

2018-11-16 Thread Ursula Braun

From: Hans Wippel 

When a SMC-D link group is freed, a shutdown signal should be sent to
the peer to indicate that the link group is invalid. This patch adds the
shutdown signal to the SMC code.

Signed-off-by: Hans Wippel 
Signed-off-by: Ursula Braun 
---
 net/smc/smc_core.c | 10 --
 net/smc/smc_core.h |  3 ++-
 net/smc/smc_ism.c  | 43 ---
 net/smc/smc_ism.h  |  1 +
 4 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 3c023de58afd..1c9fa7f0261a 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -184,6 +184,8 @@ static void smc_lgr_free_work(struct work_struct *work)
 
if (!lgr->is_smcd && lnk->state != SMC_LNK_INACTIVE)
smc_llc_link_inactive(lnk);
+   if (lgr->is_smcd)
+   smc_ism_signal_shutdown(lgr);
smc_lgr_free(lgr);
}
 }
@@ -485,7 +487,7 @@ void smc_port_terminate(struct smc_ib_device *smcibdev, u8 
ibport)
 }
 
 /* Called when SMC-D device is terminated or peer is lost */
-void smc_smcd_terminate(struct smcd_dev *dev, u64 peer_gid)
+void smc_smcd_terminate(struct smcd_dev *dev, u64 peer_gid, unsigned short 
vlan)
 {
struct smc_link_group *lgr, *l;
LIST_HEAD(lgr_free_list);
@@ -495,7 +497,7 @@ void smc_smcd_terminate(struct smcd_dev *dev, u64 peer_gid)
list_for_each_entry_safe(lgr, l, _lgr_list.list, list) {
if (lgr->is_smcd && lgr->smcd == dev &&
(!peer_gid || lgr->peer_gid == peer_gid) &&
-   !list_empty(>list)) {
+   (vlan == VLAN_VID_MASK || lgr->vlan_id == vlan)) {
__smc_lgr_terminate(lgr);
list_move(>list, _free_list);
}
@@ -506,6 +508,8 @@ void smc_smcd_terminate(struct smcd_dev *dev, u64 peer_gid)
list_for_each_entry_safe(lgr, l, _free_list, list) {
list_del_init(>list);
cancel_delayed_work_sync(>free_work);
+   if (!peer_gid && vlan == VLAN_VID_MASK) /* dev terminated? */
+   smc_ism_signal_shutdown(lgr);
smc_lgr_free(lgr);
}
 }
@@ -1026,6 +1030,8 @@ void smc_core_exit(void)
smc_llc_link_inactive(lnk);
}
cancel_delayed_work_sync(>free_work);
+   if (lgr->is_smcd)
+   smc_ism_signal_shutdown(lgr);
smc_lgr_free(lgr); /* free link group */
}
 }
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 5bc6cbaf0ed5..cf98f4d6093e 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -247,7 +247,8 @@ void smc_lgr_free(struct smc_link_group *lgr);
 void smc_lgr_forget(struct smc_link_group *lgr);
 void smc_lgr_terminate(struct smc_link_group *lgr);
 void smc_port_terminate(struct smc_ib_device *smcibdev, u8 ibport);
-void smc_smcd_terminate(struct smcd_dev *dev, u64 peer_gid);
+void smc_smcd_terminate(struct smcd_dev *dev, u64 peer_gid,
+   unsigned short vlan);
 int smc_buf_create(struct smc_sock *smc, bool is_smcd);
 int smc_uncompress_bufsize(u8 compressed);
 int smc_rmb_rtoken_handling(struct smc_connection *conn,
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index e36f21ce7252..2fff79db1a59 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -187,22 +187,28 @@ struct smc_ism_event_work {
 #define ISM_EVENT_REQUEST  0x0001
 #define ISM_EVENT_RESPONSE 0x0002
 #define ISM_EVENT_REQUEST_IR   0x0001
+#define ISM_EVENT_CODE_SHUTDOWN0x80
 #define ISM_EVENT_CODE_TESTLINK0x83
 
+union smcd_sw_event_info {
+   u64 info;
+   struct {
+   u8  uid[SMC_LGR_ID_SIZE];
+   unsigned short  vlan_id;
+   u16 code;
+   };
+};
+
 static void smcd_handle_sw_event(struct smc_ism_event_work *wrk)
 {
-   union {
-   u64 info;
-   struct {
-   u32 uid;
-   unsigned short  vlanid;
-   u16 code;
-   };
-   } ev_info;
+   union smcd_sw_event_info ev_info;
 
+   ev_info.info = wrk->event.info;
switch (wrk->event.code) {
+   case ISM_EVENT_CODE_SHUTDOWN:   /* Peer shut down DMBs */
+   smc_smcd_terminate(wrk->smcd, wrk->event.tok, ev_info.vlan_id);
+   break;
case ISM_EVENT_CODE_TESTLINK:   /* Activity timer */
-   ev_info.info = wrk->event.info;
if (ev_info.code == ISM_EVENT_REQUEST) {
ev_info.code = ISM_EVENT_RESPONSE;
wrk->smcd->ops->signal_event(wrk->smcd,
@@ -215,6 +221,21 @@ static void smcd_handle_sw_event(struct smc_ism_event_work 
*wrk)
}
 }
 
+int smc_ism_signal_shutdown(struct smc_link_group *lgr)
+{
+

[PATCH net V3 2/5] net/smc: use queue pair number when matching link group

2018-11-16 Thread Ursula Braun

From: Karsten Graul 

When searching for an existing link group the queue pair number is also
to be taken into consideration. When the SMC server sends a new number
in a CLC packet (keeping all other values equal) then a new link group
is to be created on the SMC client side.

Signed-off-by: Karsten Graul 
Signed-off-by: Ursula Braun 
---
 net/smc/af_smc.c   |  9 +
 net/smc/smc_core.c | 10 ++
 net/smc/smc_core.h |  2 +-
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 84f67f601838..5fbaf1901571 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -549,7 +549,8 @@ static int smc_connect_rdma(struct smc_sock *smc,
 
mutex_lock(_create_lgr_pending);
local_contact = smc_conn_create(smc, false, aclc->hdr.flag, ibdev,
-   ibport, >lcl, NULL, 0);
+   ibport, ntoh24(aclc->qpn), >lcl,
+   NULL, 0);
if (local_contact < 0) {
if (local_contact == -ENOMEM)
reason_code = SMC_CLC_DECL_MEM;/* insufficient memory*/
@@ -620,7 +621,7 @@ static int smc_connect_ism(struct smc_sock *smc,
int rc = 0;
 
mutex_lock(_create_lgr_pending);
-   local_contact = smc_conn_create(smc, true, aclc->hdr.flag, NULL, 0,
+   local_contact = smc_conn_create(smc, true, aclc->hdr.flag, NULL, 0, 0,
NULL, ismdev, aclc->gid);
if (local_contact < 0)
return smc_connect_abort(smc, SMC_CLC_DECL_MEM, 0);
@@ -1085,7 +1086,7 @@ static int smc_listen_rdma_init(struct smc_sock *new_smc,
int *local_contact)
 {
/* allocate connection / link group */
-   *local_contact = smc_conn_create(new_smc, false, 0, ibdev, ibport,
+   *local_contact = smc_conn_create(new_smc, false, 0, ibdev, ibport, 0,
 >lcl, NULL, 0);
if (*local_contact < 0) {
if (*local_contact == -ENOMEM)
@@ -1109,7 +1110,7 @@ static int smc_listen_ism_init(struct smc_sock *new_smc,
struct smc_clc_msg_smcd *pclc_smcd;
 
pclc_smcd = smc_get_clc_msg_smcd(pclc);
-   *local_contact = smc_conn_create(new_smc, true, 0, NULL, 0, NULL,
+   *local_contact = smc_conn_create(new_smc, true, 0, NULL, 0, 0, NULL,
 ismdev, pclc_smcd->gid);
if (*local_contact < 0) {
if (*local_contact == -ENOMEM)
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 18daebcef181..3c023de58afd 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -559,7 +559,7 @@ int smc_vlan_by_tcpsk(struct socket *clcsock, unsigned 
short *vlan_id)
 
 static bool smcr_lgr_match(struct smc_link_group *lgr,
   struct smc_clc_msg_local *lcl,
-  enum smc_lgr_role role)
+  enum smc_lgr_role role, u32 clcqpn)
 {
return !memcmp(lgr->peer_systemid, lcl->id_for_peer,
   SMC_SYSTEMID_LEN) &&
@@ -567,7 +567,9 @@ static bool smcr_lgr_match(struct smc_link_group *lgr,
SMC_GID_SIZE) &&
!memcmp(lgr->lnk[SMC_SINGLE_LINK].peer_mac, lcl->mac,
sizeof(lcl->mac)) &&
-   lgr->role == role;
+   lgr->role == role &&
+   (lgr->role == SMC_SERV ||
+lgr->lnk[SMC_SINGLE_LINK].peer_qpn == clcqpn);
 }
 
 static bool smcd_lgr_match(struct smc_link_group *lgr,
@@ -578,7 +580,7 @@ static bool smcd_lgr_match(struct smc_link_group *lgr,
 
 /* create a new SMC connection (and a new link group if necessary) */
 int smc_conn_create(struct smc_sock *smc, bool is_smcd, int srv_first_contact,
-   struct smc_ib_device *smcibdev, u8 ibport,
+   struct smc_ib_device *smcibdev, u8 ibport, u32 clcqpn,
struct smc_clc_msg_local *lcl, struct smcd_dev *smcd,
u64 peer_gid)
 {
@@ -603,7 +605,7 @@ int smc_conn_create(struct smc_sock *smc, bool is_smcd, int 
srv_first_contact,
list_for_each_entry(lgr, _lgr_list.list, list) {
write_lock_bh(>conns_lock);
if ((is_smcd ? smcd_lgr_match(lgr, smcd, peer_gid) :
-smcr_lgr_match(lgr, lcl, role)) &&
+smcr_lgr_match(lgr, lcl, role, clcqpn)) &&
!lgr->sync_err &&
lgr->vlan_id == vlan_id &&
(role == SMC_CLNT ||
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index c156674733c9..5bc6cbaf0ed5 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -262,7 +262,7 @@ int smc_vlan_by_tcpsk(struct socket *clcsock, unsigned 
short *vlan_id);
 
 void smc_conn_free(struct smc_connection *conn);
 int smc_conn_create(struct smc_sock *smc, bool is_smcd, int srv_first_contact,
-   struct

[PATCH net V3 5/5] net/smc: use after free fix in smc_wr_tx_put_slot()

2018-11-16 Thread Ursula Braun

From: Ursula Braun 

In smc_wr_tx_put_slot() field pend->idx is used after being
cleared. That means always idx 0 is cleared in the wr_tx_mask.
This results in a broken administration of available WR send
payload buffers.

Signed-off-by: Ursula Braun 
---
 net/smc/smc_wr.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index 3c458d279855..c2694750a6a8 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -215,12 +215,14 @@ int smc_wr_tx_put_slot(struct smc_link *link,
 
pend = container_of(wr_pend_priv, struct smc_wr_tx_pend, priv);
if (pend->idx < link->wr_tx_cnt) {
+   u32 idx = pend->idx;
+
/* clear the full struct smc_wr_tx_pend including .priv */
memset(>wr_tx_pends[pend->idx], 0,
   sizeof(link->wr_tx_pends[pend->idx]));
memset(>wr_tx_bufs[pend->idx], 0,
   sizeof(link->wr_tx_bufs[pend->idx]));
-   test_and_clear_bit(pend->idx, link->wr_tx_mask);
+   test_and_clear_bit(idx, link->wr_tx_mask);
return 1;
}
 
-- 
2.16.4

[PATCH net V3 0/5] net/smc: fixes 2018-11-12

2018-11-16 Thread Ursula Braun

Dave,

here is V3 of some net/smc fixes in different areas for the net tree.

v1->v2:
   do not define 8-byte alignment for union smcd_cdc_cursor in
   patch 4/5 "net/smc: atomic SMCD cursor handling"
v2->v3:
   stay with 8-byte alignment for union smcd_cdc_cursor in
   patch 4/5 "net/smc: atomic SMCD cursor handling", but get rid of
   __packed for struct smcd_cdc_msg

Thanks, Ursula

Hans Wippel (2):
  net/smc: abort CLC connection in smc_release
  net/smc: add SMC-D shutdown signal

Karsten Graul (1):
  net/smc: use queue pair number when matching link group

Ursula Braun (2):
  net/smc: atomic SMCD cursor handling
  net/smc: use after free fix in smc_wr_tx_put_slot()

 net/smc/af_smc.c   | 11 +++
 net/smc/smc_cdc.c  | 24 --
 net/smc/smc_cdc.h  | 58 +-
 net/smc/smc_core.c | 20 +--
 net/smc/smc_core.h |  5 +++--
 net/smc/smc_ism.c  | 43 +---
 net/smc/smc_ism.h  |  1 +
 net/smc/smc_wr.c   |  4 +++-
 8 files changed, 118 insertions(+), 48 deletions(-)

-- 
2.16.4

[PATCH net V3 1/5] net/smc: abort CLC connection in smc_release

2018-11-16 Thread Ursula Braun

From: Hans Wippel 

In case of a non-blocking SMC socket, the initial CLC handshake is
performed over a blocking TCP connection in a worker. If the SMC socket
is released, smc_release has to wait for the blocking CLC socket
operations (e.g., kernel_connect) inside the worker.

This patch aborts a CLC connection when the respective non-blocking SMC
socket is released to avoid waiting on socket operations or timeouts.

Signed-off-by: Hans Wippel 
Signed-off-by: Ursula Braun 
---
 net/smc/af_smc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 80e2119f1c70..84f67f601838 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -127,6 +127,8 @@ static int smc_release(struct socket *sock)
smc = smc_sk(sk);
 
/* cleanup for a dangling non-blocking connect */
+   if (smc->connect_info && sk->sk_state == SMC_INIT)
+   tcp_abort(smc->clcsock->sk, ECONNABORTED);
flush_work(>connect_work);
kfree(smc->connect_info);
smc->connect_info = NULL;
-- 
2.16.4

[PATCH 1/3] bpf: respect size hint to BPF_PROG_TEST_RUN if present

2018-11-16 Thread Lorenz Bauer

Use data_size_out as a size hint when copying test output to user space.
A program using BPF_PERF_OUTPUT can compare its own buffer length with
data_size_out after the syscall to detect whether truncation has taken
place. Callers which so far did not set data_size_in are not affected.

Signed-off-by: Lorenz Bauer 
---
 net/bpf/test_run.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index c89c22c49015..30c57b7f4ba4 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -74,8 +74,15 @@ static int bpf_test_finish(const union bpf_attr *kattr,
 {
void __user *data_out = u64_to_user_ptr(kattr->test.data_out);
int err = -EFAULT;
+   u32 copy_size = size;
 
-   if (data_out && copy_to_user(data_out, data, size))
+   /* Clamp copy if the user has provided a size hint, but copy the full
+* buffer if not to retain old behaviour.
+*/
+   if (kattr->test.data_size_out && copy_size > kattr->test.data_size_out)
+   copy_size = kattr->test.data_size_out;
+
+   if (data_out && copy_to_user(data_out, data, copy_size))
goto out;
if (copy_to_user(>test.data_size_out, , sizeof(size)))
goto out;
-- 
2.17.1

[PATCH 3/3] selftests: add a test for bpf_prog_test_run output size

2018-11-16 Thread Lorenz Bauer

Make sure that bpf_prog_test_run returns the correct length
in the size_out argument and that the kernel respects the
output size hint.

Signed-off-by: Lorenz Bauer 
---
 tools/testing/selftests/bpf/test_progs.c | 34 
 1 file changed, 34 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index 560d7527b86b..6ab98e10e86f 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -124,6 +124,39 @@ static void test_pkt_access(void)
bpf_object__close(obj);
 }
 
+static void test_output_size_hint(void)
+{
+   const char *file = "./test_pkt_access.o";
+   struct bpf_object *obj;
+   __u32 retval, size, duration;
+   int err, prog_fd;
+   char buf[10];
+
+   err = bpf_prog_load(file, BPF_PROG_TYPE_SCHED_CLS, , _fd);
+   if (err) {
+   error_cnt++;
+   return;
+   }
+
+   memset(buf, 0, sizeof(buf));
+
+   size = 5;
+   err = bpf_prog_test_run(prog_fd, 1, _v4, sizeof(pkt_v4),
+   buf, , , );
+   CHECK(err || retval, "run",
+ "err %d errno %d retval %d\n",
+ err, errno, retval);
+
+   CHECK(size != sizeof(pkt_v4), "out_size",
+ "incorrect output size, want %lu have %u\n",
+ sizeof(pkt_v4), size);
+
+   CHECK(buf[5] != 0, "overflow",
+ "prog_test_run ignored size hint\n");
+
+   bpf_object__close(obj);
+}
+
 static void test_xdp(void)
 {
struct vip key4 = {.protocol = 6, .family = AF_INET};
@@ -1847,6 +1880,7 @@ int main(void)
jit_enabled = is_jit_enabled();
 
test_pkt_access();
+   test_output_size_hint();
test_xdp();
test_xdp_adjust_tail();
test_l4lb_all();
-- 
2.17.1

[PATCH 2/3] libbpf: require size hint in bpf_prog_test_run

2018-11-16 Thread Lorenz Bauer

Require size_out to be non-NULL if data_out is given. This prevents
accidental overwriting of process memory after the output buffer.

Adjust callers of bpf_prog_test_run to this behaviour.

Signed-off-by: Lorenz Bauer 
---
 tools/lib/bpf/bpf.c  |  4 +++-
 tools/testing/selftests/bpf/test_progs.c | 10 ++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 03f9bcc4ef50..127a9aa6170e 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -403,10 +403,12 @@ int bpf_prog_test_run(int prog_fd, int repeat, void 
*data, __u32 size,
attr.test.data_in = ptr_to_u64(data);
attr.test.data_out = ptr_to_u64(data_out);
attr.test.data_size_in = size;
+   if (data_out)
+   attr.test.data_size_out = *size_out;
attr.test.repeat = repeat;
 
ret = sys_bpf(BPF_PROG_TEST_RUN, , sizeof(attr));
-   if (size_out)
+   if (data_out)
*size_out = attr.test.data_size_out;
if (retval)
*retval = attr.test.retval;
diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index 2d3c04f45530..560d7527b86b 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -150,6 +150,7 @@ static void test_xdp(void)
bpf_map_update_elem(map_fd, , , 0);
bpf_map_update_elem(map_fd, , , 0);
 
+   size = sizeof(buf);
err = bpf_prog_test_run(prog_fd, 1, _v4, sizeof(pkt_v4),
buf, , , );
 
@@ -158,6 +159,7 @@ static void test_xdp(void)
  "err %d errno %d retval %d size %d\n",
  err, errno, retval, size);
 
+   size = sizeof(buf);
err = bpf_prog_test_run(prog_fd, 1, _v6, sizeof(pkt_v6),
buf, , , );
CHECK(err || retval != XDP_TX || size != 114 ||
@@ -182,6 +184,7 @@ static void test_xdp_adjust_tail(void)
return;
}
 
+   size = sizeof(buf);
err = bpf_prog_test_run(prog_fd, 1, _v4, sizeof(pkt_v4),
buf, , , );
 
@@ -189,6 +192,7 @@ static void test_xdp_adjust_tail(void)
  "ipv4", "err %d errno %d retval %d size %d\n",
  err, errno, retval, size);
 
+   size = sizeof(buf);
err = bpf_prog_test_run(prog_fd, 1, _v6, sizeof(pkt_v6),
buf, , , );
CHECK(err || retval != XDP_TX || size != 54,
@@ -252,6 +256,7 @@ static void test_l4lb(const char *file)
goto out;
bpf_map_update_elem(map_fd, _num, _def, 0);
 
+   size = sizeof(buf);
err = bpf_prog_test_run(prog_fd, NUM_ITER, _v4, sizeof(pkt_v4),
buf, , , );
CHECK(err || retval != 7/*TC_ACT_REDIRECT*/ || size != 54 ||
@@ -259,6 +264,7 @@ static void test_l4lb(const char *file)
  "err %d errno %d retval %d size %d magic %x\n",
  err, errno, retval, size, *magic);
 
+   size = sizeof(buf);
err = bpf_prog_test_run(prog_fd, NUM_ITER, _v6, sizeof(pkt_v6),
buf, , , );
CHECK(err || retval != 7/*TC_ACT_REDIRECT*/ || size != 74 ||
@@ -341,6 +347,7 @@ static void test_xdp_noinline(void)
goto out;
bpf_map_update_elem(map_fd, _num, _def, 0);
 
+   size = sizeof(buf);
err = bpf_prog_test_run(prog_fd, NUM_ITER, _v4, sizeof(pkt_v4),
buf, , , );
CHECK(err || retval != 1 || size != 54 ||
@@ -348,6 +355,7 @@ static void test_xdp_noinline(void)
  "err %d errno %d retval %d size %d magic %x\n",
  err, errno, retval, size, *magic);
 
+   size = sizeof(buf);
err = bpf_prog_test_run(prog_fd, NUM_ITER, _v6, sizeof(pkt_v6),
buf, , , );
CHECK(err || retval != 1 || size != 74 ||
@@ -1795,6 +1803,7 @@ static void test_queue_stack_map(int type)
pkt_v4.iph.saddr = vals[MAP_SIZE - 1 - i] * 5;
}
 
+   size = sizeof(buf);
err = bpf_prog_test_run(prog_fd, 1, _v4, sizeof(pkt_v4),
buf, , , );
if (err || retval || size != sizeof(pkt_v4) ||
@@ -1808,6 +1817,7 @@ static void test_queue_stack_map(int type)
  err, errno, retval, size, iph->daddr);
 
/* Queue is empty, program should return TC_ACT_SHOT */
+   size = sizeof(buf);
err = bpf_prog_test_run(prog_fd, 1, _v4, sizeof(pkt_v4),
buf, , , );
CHECK(err || retval != 2 /* TC_ACT_SHOT */|| size != sizeof(pkt_v4),
-- 
2.17.1

[PATCH 0/3] Fix unsafe BPF_PROG_TEST_RUN interface

2018-11-16 Thread Lorenz Bauer

Right now, there is no safe way to use BPF_PROG_TEST_RUN with data_out.
This is because bpf_test_finish copies the output buffer to user space
without checking its size. This can lead to the kernel overwriting
data in user space after the buffer if xdp_adjust_head and friends are
in play.

Fix this by using bpf_attr.test.data_size_out as a size hint. The old
behaviour is retained if size_hint is zero.

Interestingly, do_test_single() in test_verifier.c already assumes
that this is the intended use of data_size_out, and sets it to the
output buffer size.

Lorenz Bauer (3):
  bpf: respect size hint to BPF_PROG_TEST_RUN if present
  libbpf: require size hint in bpf_prog_test_run
  selftests: add a test for bpf_prog_test_run output size

 net/bpf/test_run.c   |  9 -
 tools/lib/bpf/bpf.c  |  4 ++-
 tools/testing/selftests/bpf/test_progs.c | 44 
 3 files changed, 55 insertions(+), 2 deletions(-)

-- 
2.17.1

Re: [PATCH net-next 2/2] net/sched: act_police: don't use spinlock in the data path

2018-11-16 Thread Eric Dumazet




On 11/16/2018 03:28 AM, Davide Caratti wrote:
> On Thu, 2018-11-15 at 05:53 -0800, Eric Dumazet wrote:
>>
>> On 11/15/2018 03:43 AM, Davide Caratti wrote:
>>> On Wed, 2018-11-14 at 22:46 -0800, Eric Dumazet wrote:
 On 09/13/2018 10:29 AM, Davide Caratti wrote:
> use RCU instead of spinlocks, to protect concurrent read/write on
> act_police configuration. This reduces the effects of contention in the
> data path, in case multiple readers are present.
>
> Signed-off-by: Davide Caratti 
> ---
>  net/sched/act_police.c | 156 -
>  1 file changed, 92 insertions(+), 64 deletions(-)
>

 I must be missing something obvious with this patch.
>>>
>>> hello Eric,
>>>
>>> On the opposite, I missed something obvious when I wrote that patch: there
>>> is a race condition on tcfp_toks, tcfp_ptoks and tcfp_t_c: thank you for
>>> noticing it.
>>>
>>> These variables still need to be protected with a spinlock. I will do a
>>> patch and evaluate if 'act_police' is still faster than a version where   
>>> 2d550dbad83c ("net/sched:  ") is reverted, and share results in the
>>> next hours.
>>>
>>> Ok?
>>>
>>
>> SGTM, thanks.
> 
> hello,
> I just finished the comparison of act_police, in the following cases:
> 
> a) revert the RCU-ification (i.e. commit 2d550dbad83c ("net/sched:
> act_police: don't use spinlock in the data path"), and leave per-cpu
> counters used by the rate estimator
> 
> b) keep RCU-ified configuration parameters, and protect read/update of
> tcfp_toks, tcfp_ptoks and tcfp_t with a spinlock (code at the bottom  of
> this message).
> 
> ## Test setup:
> 
> $DEV is a 'dummy' with clsact qdisc; the following two commands,
> 
> # test police with 'rate'
> $TC filter add dev $DEV egress matchall \
>  action police rate 2gbit burst 100k conform-exceed pass/pass index 100
> 
> # test police with 'avrate'
> $TC filter add dev prova egress estimator 1s 8s matchall \
> action police avrate 2gbit conform-exceed pass/pass index 100
> 
> are tested with the following loop:
> 
> for c in 1 2 4 8 16; do
> ./pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -v -s 64  -t $c -n 500 -i 
> $DEV
> done
> 
> 
> ## Test results:
> 
> using  rate  | reverted   | patched
> $c   | act_police (a) | act_police (b)
> -++---
>  1   |   3364442  |  3345580  
>  2   |   2703030  |  2721919  
>  4   |   1130146  |  1253555
>  8   |664238  |   658777
> 16   |154026  |   155259
> 
> 
> using avrate | reverted   | patched
> $c   | act_police (a) | act_police (b)
> -++---
>  1   |   3621796  |  3658472 
>  2   |   3075589  |  3548135  
>  4   |   2313314  |  3343717
>  8   |768458  |  3260480
> 16   |16  |  3254128
> 
> 
> so, 'avrate' still gets a significant improvement because the 'conform/exceed'
> decision doesn't need the spinlock in this case. The estimation is probably
> less accurate, because it use per-CPU variables: if this is not acceptable,
> then we need to revert also 93be42f9173b ("net/sched: act_police: use per-cpu
> counters").
> 
> 
> ## patch code:
> 
> -- >8 --
> diff --git a/net/sched/act_police.c b/net/sched/act_police.c
> index 052855d..42db852 100644
> --- a/net/sched/act_police.c
> +++ b/net/sched/act_police.c
> @@ -27,10 +27,7 @@ struct tcf_police_params {
>   u32 tcfp_ewma_rate;
>   s64 tcfp_burst;
>   u32 tcfp_mtu;
> - s64 tcfp_toks;
> - s64 tcfp_ptoks;
>   s64 tcfp_mtu_ptoks;
> - s64 tcfp_t_c;
>   struct psched_ratecfg   rate;
>   boolrate_present;
>   struct psched_ratecfg   peak;
> @@ -40,6 +37,9 @@ struct tcf_police_params {
>  
>  struct tcf_police {
>   struct tc_actioncommon;


> + s64 tcfp_toks;
> + s64 tcfp_ptoks;
> + s64 tcfp_t_c;

I suggest to use a single cache line with a dedicated spinlock and these three 
s64

spinlock_t  tcfp_lock cacheline_aligned_in_smp;
s64 ...
s64 ...
s64 ...


>   struct tcf_police_params __rcu *params;

Make sure to use a different cache line for *params 

struct tcf_police_params __rcu *params cacheline_aligned_in_smp;

>  };
>  
> @@ -186,12 +186,6 @@ static int tcf_police_init(struct net *net, struct 
> nlattr *nla,
>   }
>  
>   new->tcfp_burst = PSCHED_TICKS2NS(parm->burst);
> - new->tcfp_toks = new->tcfp_burst;
> - if (new->peak_present) {
> - new->tcfp_mtu_ptoks =

Re: [PATCH net-next 2/2] net/sched: act_police: don't use spinlock in the data path

2018-11-16 Thread Eric Dumazet




On 11/16/2018 06:34 AM, Eric Dumazet wrote:

> 
>> +s64 tcfp_toks;
>> +s64 tcfp_ptoks;
>> +s64 tcfp_t_c;
> 
> I suggest to use a single cache line with a dedicated spinlock and these 
> three s64
> 
>   spinlock_t  tcfp_lock cacheline_aligned_in_smp;
>   s64 ...
>   s64 ...
>   s64 ...
> 
> 
>>  struct tcf_police_params __rcu *params;
> 
> Make sure to use a different cache line for *params 
> 
>   struct tcf_police_params __rcu *params cacheline_aligned_in_smp;


Or move it before the cacheline used by the lock and three s64,
since 'common' should be read-mostly. No need for a separate cache line.

RE: [PATCH net-next 2/2] net/sched: act_police: don't use spinlock in the data path

2018-11-16 Thread David Laight

From: Eric Dumazet
> Sent: 16 November 2018 14:35
...
> I suggest to use a single cache line with a dedicated spinlock and these 
> three s64
> 
>   spinlock_t  tcfp_lock cacheline_aligned_in_smp;
>   s64 ...
>   s64 ...
>   s64 ...

Doesn't this do something really stupid when cache lines are big.
If the spinlock is 8 bytes you never want more than 32 byte alignment.
If cache lines are 256 bytes you don't even need that.

Also ISTR that the kmalloc() only guarantees 8 byte alignment on x86_64.
So aligning structure members to larger offsets is rather pointless.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [PATCH net-next 6/8] net: eth: altera: tse: add support for ptp and timestamping

2018-11-16 Thread Dalon Westergreen

On Thu, 2018-11-15 at 18:14 -0800, Richard Cochran wrote:
> On Thu, Nov 15, 2018 at 06:55:29AM -0800, Dalon Westergreen wrote:
> > I would prefer to keep altera just to be consistent with the altera_tse
> > stuff,
> > and i intend to reusethis code for a 10GbE driver, so perhaps altera_tod to
> > reference the fpga ip name?
> 
> So the IP core is called "tod"?  Really?

For naming, how about intel_fpga_tod ?

--dalon

> 
> Thanks,
> Richard

Re: [PATCH net] sctp: not allow to set asoc prsctp_enable by sockopt

2018-11-16 Thread Neil Horman

On Thu, Nov 15, 2018 at 09:41:01PM -0200, Marcelo Ricardo Leitner wrote:
> [ re-sending, without html this time ]
> 
> On Thu, Nov 15, 2018, 15:26 Neil Horman  
> > On Thu, Nov 15, 2018 at 08:25:36PM -0200, Marcelo Ricardo Leitner wrote:
> > > On Thu, Nov 15, 2018 at 04:43:10PM -0500, Neil Horman wrote:
> > > > On Thu, Nov 15, 2018 at 03:22:21PM -0200, Marcelo Ricardo Leitner
> > wrote:
> > > > > On Thu, Nov 15, 2018 at 07:14:28PM +0800, Xin Long wrote:
> > > > > > As rfc7496#section4.5 says about SCTP_PR_SUPPORTED:
> > > > > >
> > > > > >This socket option allows the enabling or disabling of the
> > > > > >negotiation of PR-SCTP support for future associations.  For
> > existing
> > > > > >associations, it allows one to query whether or not PR-SCTP
> > support
> > > > > >was negotiated on a particular association.
> > > > > >
> > > > > > It means only sctp sock's prsctp_enable can be set.
> > > > > >
> > > > > > Note that for the limitation of SCTP_{CURRENT|ALL}_ASSOC, we will
> > > > > > add it when introducing SCTP_{FUTURE|CURRENT|ALL}_ASSOC for linux
> > > > > > sctp in another patchset.
> > > > > >
> > > > > > Fixes: 28aa4c26fce2 ("sctp: add SCTP_PR_SUPPORTED on sctp sockopt")
> > > > > > Reported-by: Ying Xu 
> > > > > > Signed-off-by: Xin Long 
> > > > > > ---
> > > > > >  net/sctp/socket.c | 13 +++--
> > > > > >  1 file changed, 3 insertions(+), 10 deletions(-)
> > > > > >
> > > > > > diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> > > > > > index 739f3e5..e9b8232 100644
> > > > > > --- a/net/sctp/socket.c
> > > > > > +++ b/net/sctp/socket.c
> > > > > > @@ -3940,7 +3940,6 @@ static int
> > sctp_setsockopt_pr_supported(struct sock *sk,
> > > > > > unsigned int optlen)
> > > > > >  {
> > > > > > struct sctp_assoc_value params;
> > > > > > -   struct sctp_association *asoc;
> > > > > > int retval = -EINVAL;
> > > > > >
> > > > > > if (optlen != sizeof(params))
> > > > > > @@ -3951,16 +3950,10 @@ static int
> > sctp_setsockopt_pr_supported(struct sock *sk,
> > > > > > goto out;
> > > > > > }
> > > > > >
> > > > > > -   asoc = sctp_id2assoc(sk, params.assoc_id);
> > > > > > -   if (asoc) {
> > > > > > -   asoc->prsctp_enable = !!params.assoc_value;
> > > > > > -   } else if (!params.assoc_id) {
> > > > > > -   struct sctp_sock *sp = sctp_sk(sk);
> > > > > > -
> > > > > > -   sp->ep->prsctp_enable = !!params.assoc_value;
> > > > > > -   } else {
> > > > > > +   if (sctp_style(sk, UDP) && sctp_id2assoc(sk,
> > params.assoc_id))
> > > > >
> > > > > This would allow using a non-existent assoc id on UDP-style sockets
> > to
> > > > > set it at the socket, which is not expected. It should be more like:
> > > > >
> > > > > + if (sctp_style(sk, UDP) && params.assoc_id)
> > > > How do you see that to be the case? sctp_id2assoc will return NULL if
> > an
> > > > association isn't found, so the use of sctp_id2assoc should work just
> > fine.
> > >
> > > Right, it will return NULL, and because of that it won't bail out as
> > > it should and will adjust the socket config instead.
> > >
> >
> > Oh, duh, you're absolutely right, NULL will evalutate to false there, and
> > skip
> > the conditional goto out;
> >
> > that said, It would make more sense to me to just change the sense of the
> > second
> > condition to !sctp_id2assoc(sk, params.assoc_id), so that we goto out if no
> > association is found.  it still seems a
> 
> 
> That would break setting it on the socket without an assoc so far.
> 
ok, yes, I see what xin is getting at now.  The RFC indicates that the
setsockopt method for this socket option is meant to set the prsctp enabled
value on _future_ associations, implying that we should not operate at all on
already existing associations (i.e. we should ignore the assoc_id in the passed
in structure and only operate on the socket).  That said, heres the entire text
of the RFC section:

4.5.  Socket Option for Getting and Setting the PR-SCTP Support
  (SCTP_PR_SUPPORTED)

   This socket option allows the enabling or disabling of the
   negotiation of PR-SCTP support for future associations.  For existing
   associations, it allows one to query whether or not PR-SCTP support
   was negotiated on a particular association.

   Whether or not PR-SCTP is enabled by default is implementation
   specific.

   This socket option uses IPPROTO_SCTP as its level and
   SCTP_PR_SUPPORTED as its name.  It can be used with getsockopt() and
   setsockopt().  The socket option value uses the following structure
   defined in [RFC6458]:

   struct sctp_assoc_value {
 sctp_assoc_t assoc_id;
 uint32_t assoc_value;
   };

   assoc_id:  This parameter is ignored for one-to-one style sockets.
  For one-to-many style sockets, this parameter indicates upon which
  association the user is performing an action.  The special

Re: [PATCH net-next 2/2] net/sched: act_police: don't use spinlock in the data path

2018-11-16 Thread Eric Dumazet




On 11/16/2018 06:41 AM, David Laight wrote:
> From: Eric Dumazet
>> Sent: 16 November 2018 14:35
> ...
>> I suggest to use a single cache line with a dedicated spinlock and these 
>> three s64
>>
>>  spinlock_t  tcfp_lock cacheline_aligned_in_smp;
>>  s64 ...
>>  s64 ...
>>  s64 ...
> 
> Doesn't this do something really stupid when cache lines are big.
> If the spinlock is 8 bytes you never want more than 32 byte alignment.
> If cache lines are 256 bytes you don't even need that.

We do want that, even if cache lines are 256 bytes, thank you.

> 
> Also ISTR that the kmalloc() only guarantees 8 byte alignment on x86_64.
> So aligning structure members to larger offsets is rather pointless.


No it is not, we use these hints all the time.

Just double check and report a bug to mm teams if you disagree.

Please do not send feedback if you are not sure.

Re: [PATCH v3 0/4] bpf: allow zero-initialising hash map seed

2018-11-16 Thread Song Liu




> On Nov 16, 2018, at 3:41 AM, Lorenz Bauer  wrote:
> 
> Allow forcing the seed of a hash table to zero, for deterministic
> execution during benchmarking and testing.
> 
> Changes from v2:
> * Change ordering of BPF_F_ZERO_SEED in linux/bpf.h
> 
> Comments adressed from v1:
> * Add comment to discourage production use to linux/bpf.h
> * Require CAP_SYS_ADMIN
> 
> Lorenz Bauer (4):
>  bpf: allow zero-initializing hash map seed
>  bpf: move BPF_F_QUERY_EFFECTIVE after map flags
>  tools: sync linux/bpf.h
>  tools: add selftest for BPF_F_ZERO_SEED
> 
> include/uapi/linux/bpf.h|  9 ++--
> kernel/bpf/hashtab.c| 13 -
> tools/include/uapi/linux/bpf.h  | 13 +++--
> tools/testing/selftests/bpf/test_maps.c | 68 +
> 4 files changed, 84 insertions(+), 19 deletions(-)
> 
> -- 
> 2.17.1
>

Re: [PATCH v3 0/4] bpf: allow zero-initialising hash map seed

2018-11-16 Thread Song Liu




> On Nov 16, 2018, at 3:41 AM, Lorenz Bauer  wrote:
> 
> Allow forcing the seed of a hash table to zero, for deterministic
> execution during benchmarking and testing.
> 
> Changes from v2:
> * Change ordering of BPF_F_ZERO_SEED in linux/bpf.h
> 
> Comments adressed from v1:
> * Add comment to discourage production use to linux/bpf.h
> * Require CAP_SYS_ADMIN
> 
> Lorenz Bauer (4):
>  bpf: allow zero-initializing hash map seed
>  bpf: move BPF_F_QUERY_EFFECTIVE after map flags
>  tools: sync linux/bpf.h
>  tools: add selftest for BPF_F_ZERO_SEED
> 
> include/uapi/linux/bpf.h|  9 ++--
> kernel/bpf/hashtab.c| 13 -
> tools/include/uapi/linux/bpf.h  | 13 +++--
> tools/testing/selftests/bpf/test_maps.c | 68 +
> 4 files changed, 84 insertions(+), 19 deletions(-)
> 
> -- 
> 2.17.1
> 

For the series:

Acked-by: Song Liu

Linux kernel hangs if using RV1108 with MSZ8863 switch with two ports connected

2018-11-16 Thread Otavio Salvador

Hi,

I have a custom design based on Rockchip RV1108 that uses an MSZ8863
switch running kernel 4.19.

The dts part is as follows:

 {
pinctrl-names = "default";
pinctrl-0 = <_pins>;
snps,reset-gpio = < RK_PC1 GPIO_ACTIVE_LOW>;
snps,reset-active-low;
clock_in_out = "output";
status = "okay";
};

RV1108 GMAC is connected to KSZ8863 port 3 and after kernel boots, I
can put an Ethernet cable from my router to the uplink port of
KSZ8863, which makes the RV1108 board and a Linux PC connected to the
other KSZ8863 port to both get IP addresses.

So in this usecase the setup is working fine.

However, if the RV1108 board boots with both Ethernet cables to the
KSZ8863 switch connected, then the kernel silently hangs.

Any suggestions as to what I should do in order to avoid the kernel to
hang with the two Ethernet cables connected?

The system boots fine without any Ethernet cable connected or with
only one Ethernet cable connected.
Here is the log of the system booting with no Ethernet cable connected:
http://dark-code.bulix.org/9kfff9-506410

It is only when both cables are connected that the kernel silently hangs.

Also, with the vendor 3.10 kernel such hang does not happen.

Thanks

-- 
Otavio Salvador O.S. Systems
http://www.ossystems.com.brhttp://code.ossystems.com.br
Mobile: +55 (53) 9 9981-7854  Mobile: +1 (347) 903-9750

Re: [PATCH net-next 7/8] net: eth: altera: tse: add msgdma prefetcher

2018-11-16 Thread Thor Thayer


Hi Dalon,

Just a few comments/questions.

On 11/14/18 6:50 PM, Dalon Westergreen wrote:

From: Dalon Westergreen 

Add support for the mSGDMA prefetcher.  The prefetcher adds support
for a linked list of descriptors in system memory.  The prefetcher
feeds these to the mSGDMA dispatcher.

The prefetcher is configured to poll for the next descriptor in the
list to be owned by hardware, then pass the descriptor to the
dispatcher.  It will then poll the next descriptor until it is
owned by hardware.

The dispatcher responses are written back to the appropriate
descriptor, and the owned by hardware bit is cleared.

The driver sets up a linked list twice the tx and rx ring sizes,
with the last descriptor pointing back to the first.  This ensures
that the ring of descriptors will always have inactive descriptors
preventing the prefetcher from looping over and reusing descriptors
inappropriately.  The prefetcher will continuously loop over these
descriptors.  The driver modifies descriptors as required to update
the skb address and length as well as the owned by hardware bit.

In addition to the above, the mSGDMA prefetcher can be used to
handle rx and tx timestamps coming from the ethernet ip.  These
can be included in the prefetcher response in the descriptor.

Signed-off-by: Dalon Westergreen 
---
  drivers/net/ethernet/altera/Makefile  |   2 +-
  .../altera/altera_msgdma_prefetcher.c | 433 ++
  .../altera/altera_msgdma_prefetcher.h |  30 ++
  .../altera/altera_msgdmahw_prefetcher.h   |  87 
  drivers/net/ethernet/altera/altera_tse.h  |  14 +
  drivers/net/ethernet/altera/altera_tse_main.c |  51 +++
  6 files changed, 616 insertions(+), 1 deletion(-)
  create mode 100644 drivers/net/ethernet/altera/altera_msgdma_prefetcher.c
  create mode 100644 drivers/net/ethernet/altera/altera_msgdma_prefetcher.h
  create mode 100644 drivers/net/ethernet/altera/altera_msgdmahw_prefetcher.h

diff --git a/drivers/net/ethernet/altera/Makefile 
b/drivers/net/ethernet/altera/Makefile
index ad80be42fa26..73b32876f126 100644
--- a/drivers/net/ethernet/altera/Makefile
+++ b/drivers/net/ethernet/altera/Makefile
@@ -5,4 +5,4 @@
  obj-$(CONFIG_ALTERA_TSE) += altera_tse.o
  altera_tse-objs := altera_tse_main.o altera_tse_ethtool.o \
   altera_msgdma.o altera_sgdma.o altera_utils.o \
-  altera_ptp.o
+  altera_ptp.o altera_msgdma_prefetcher.o
diff --git a/drivers/net/ethernet/altera/altera_msgdma_prefetcher.c 
b/drivers/net/ethernet/altera/altera_msgdma_prefetcher.c
new file mode 100644
index ..55b475e9e15b
--- /dev/null
+++ b/drivers/net/ethernet/altera/altera_msgdma_prefetcher.c
@@ -0,0 +1,433 @@
+// SPDX-License-Identifier: GPL-2.0
+/* MSGDMA Prefetcher driver for Altera ethernet devices
+ *
+ * Copyright (C) 2018 Intel Corporation. All rights reserved.
+ * Author(s):
+ *   Dalon Westergreen 
+ */
+
+#include 
+#include 
+#include 
+#include "altera_utils.h"
+#include "altera_tse.h"
+#include "altera_msgdma.h"
+#include "altera_msgdmahw.h"
+#include "altera_msgdma_prefetcher.h"
+#include "altera_msgdmahw_prefetcher.h"


These could be alphabetized - tse and utils at the end.

+
+int msgdma_pref_initialize(struct altera_tse_private *priv)
+{
+   int i;
+   struct msgdma_pref_extended_desc *rx_descs;
+   struct msgdma_pref_extended_desc *tx_descs;
+   dma_addr_t rx_descsphys;
+   dma_addr_t tx_descsphys;
+   u32 rx_ring_size;
+   u32 tx_ring_size;
+
+   priv->pref_rxdescphys = (dma_addr_t)0;
+   priv->pref_txdescphys = (dma_addr_t)0;
+
+   /* we need to allocate more pref descriptors than ringsize, for now
+* just double ringsize
+*/
+   rx_ring_size = priv->rx_ring_size * 2;
+   tx_ring_size = priv->tx_ring_size * 2;
+
+   /* The prefetcher requires the descriptors to be aligned to the
+* descriptor read/write master's data width which worst case is
+* 512 bits.  Currently we DO NOT CHECK THIS and only support 32-bit
+* prefetcher masters.
+*/
+
+   /* allocate memory for rx descriptors */
+   priv->pref_rxdesc =
+   dma_zalloc_coherent(priv->device,
+   sizeof(struct msgdma_pref_extended_desc)
+   * rx_ring_size,
+   >pref_rxdescphys, GFP_KERNEL);
+
+   if (!priv->pref_rxdesc)
+   goto err_rx;
+
+   /* allocate memory for tx descriptors */
+   priv->pref_txdesc =
+   dma_zalloc_coherent(priv->device,
+   sizeof(struct msgdma_pref_extended_desc)
+   * tx_ring_size,
+   >pref_txdescphys, GFP_KERNEL);
+
+   if (!priv->pref_txdesc)
+   goto err_tx;
+
+   /* setup base descriptor ring for tx & rx */
+   rx_descs = (struct msgdma_pref_extended_desc

[PATCH net-next] net: align pcpu_sw_netstats and pcpu_lstats structs

2018-11-16 Thread Eric Dumazet

Do not risk spanning these small structures on two cache lines,
it is absolutely not worth it.

For 32bit arches, the hint might not be enough, but we do not
really care anymore.

Signed-off-by: Eric Dumazet 
---
 include/linux/netdevice.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 
917ae7b6263e4686ac7af4d16445f4e996001ea6..086e64d885971ff04f186d488975b3305a0fbb1d
 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2389,13 +2389,13 @@ struct pcpu_sw_netstats {
u64 tx_packets;
u64 tx_bytes;
struct u64_stats_sync   syncp;
-};
+} __aligned(4 * sizeof(u64));
 
 struct pcpu_lstats {
u64 packets;
u64 bytes;
struct u64_stats_sync syncp;
-};
+} __aligned(2 * sizeof(u64));
 
 #define __netdev_alloc_pcpu_stats(type, gfp)   \
 ({ \
-- 
2.19.1.1215.g8438c0b245-goog

[PATCH net] ip_tunnel: don't force DF when MTU is locked

2018-11-16 Thread Sabrina Dubroca

The various types of tunnels running over IPv4 can ask to set the DF
bit to do PMTU discovery. However, PMTU discovery is subject to the
threshold set by the net.ipv4.route.min_pmtu sysctl, and is also
disabled on routes with "mtu lock". In those cases, we shouldn't set
the DF bit.

This patch makes setting the DF bit conditional on the route's MTU
locking state.

This issue seems to be older than git history.

Signed-off-by: Sabrina Dubroca 
Reviewed-by: Stefano Brivio 
---
 net/ipv4/ip_tunnel_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index f45b96d715f0..c857ec6b9784 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -80,7 +80,7 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct 
sk_buff *skb,
 
iph->version=   4;
iph->ihl=   sizeof(struct iphdr) >> 2;
-   iph->frag_off   =   df;
+   iph->frag_off   =   ip_mtu_locked(>dst) ? 0 : df;
iph->protocol   =   proto;
iph->tos=   tos;
iph->daddr  =   dst;
-- 
2.19.1

Re: [PATCH net-next 6/8] net: eth: altera: tse: add support for ptp and timestamping

2018-11-16 Thread Richard Cochran

On Fri, Nov 16, 2018 at 06:48:15AM -0800, Dalon Westergreen wrote:
> For naming, how about intel_fpga_tod ?

Fine by me.

Thanks,
Richard

Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-16 Thread Cong Wang

On Thu, Nov 15, 2018 at 8:50 PM Herbert Xu  wrote:
>
> On Thu, Nov 15, 2018 at 06:23:38PM -0800, Cong Wang wrote:
> >
> > > Normally if the hardware's partial checksum is valid then we just
> > > trust it and send the packet along.  However, if the partial
> > > checksum is invalid we don't trust it and we will compute the
> > > whole checksum manually which is what ends up in sum.
> >
> > Not sure if I understand partial checksum here, but it is the
> > CHECKSUM_COMPLETE case which I am trying to fix, not
> > CHECKSUM_PARTIAL.
>
> What I meant by partial checksum is the checksum produced by the
> hardware on RX.  In the kernel we call that CHECKSUM_COMPLETE.
> CHECKSUM_PARTIAL is the absence of the substantial part of the
> checksum which is something we use in the kernel primarily for TX.
>
> Yes the names are confusing :)

Yeah, understood. The hardware provides skb->csum in this case, but
we keep adjusting it each time when we change skb->data.

>
> > So, in other word, a checksum *match* is the intended to detect
> > this HW RX checksum fault?
>
> Correct.  Or more likely it's probably a bug in either the driver
> or if there are overlaying code such as VLAN then in that code.
>
> Basically if the RX checksum is buggy, it's much more likely to
> cause a valid packet to be rejected than to cause an invalid packet
> to be accepted, because we still verify that checksum against the
> pseudoheader.  So we only attempt to catch buggy hardware/drivers
> by doing a second manual verification for the case where the packet
> is flagged as invalid.

Hmm, now I see how it works. Actually it uses the differences between
these two check's as the difference between hardware checksum with
skb_checksum().

I will send a patch to add a comment there to avoid confusion.

>
> > Sure, my case is nearly same with Pawel's, except I have no vlan:
> > https://marc.info/?l=linux-netdev=154086647601721=2
>
> Can you please provide your backtrace?

I already did:
https://marc.info/?l=linux-netdev=154092211305599=2

Note, the offending commit has been backported to 4.14, which
is why I saw this warning. I have no idea why it is backported
from the beginning, it is just an optimization, doesn't fix any bug,
IMHO.

Also, it is much harder for me to reproduce it than Pawel who
saw the warning every second. Sometimes I need 1 hour to trigger
it, sometimes other people here needs 10+ hours to trigger it.

Let me see if I can add vlan on my side to make it more reproducible,
it seems hard as our switch doesn't use vlan either.

We have warnings with conntrack involved too, I can provide it too
if you are interested.

I tend to revert it for -stable, at least that is what I plan to do
on my side unless there is a fix coming soon.

Thanks.

Re: [BUG] xfrm: unable to handle kernel NULL pointer dereference

2018-11-16 Thread Steffen Klassert

On Fri, Nov 16, 2018 at 08:48:00PM +0200, Lennert Buytenhek wrote:
> On Sat, Nov 10, 2018 at 08:34:34PM +0100, Jean-Philippe Menil wrote:
> 
> > we're seeing unexpected crashes from kernel 4.15 to 4.18.17, using
> > IPsec VTI interfaces, on several vpn hosts, since upgrade from 4.4.
> 
> I looked into this with Jean-Philippe, and it appears to be crashing
> on a NULL pointer dereference in the inlined xfrm_policy_check() call
> in vti_rcv_cb(), and specifically on the skb_dst(skb) dereference in
> __xfrm_policy_check2():
> 
>   return  (!net->xfrm.policy_count[dir] && !skb->sp) ||
>   (skb_dst(skb)->flags & DST_NOPOLICY) || <=
>   __xfrm_policy_check(sk, ndir, skb, family);
> 
> Commit 9e1437937807 ("xfrm: Fix NULL pointer dereference when
> skb_dst_force clears the dst_entry.") fixes a very similar problem on
> the output and forward paths, but our issue seems to be triggering on
> the input path.

Yes, this is the same problem. skb_dst_force() does not
really force a refcount anymore, it might clear the dst
pointer instead (maybe this function should be renamed).

Want to submit a fix? If not I'll go to fix that.

Thanks!

Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-16 Thread Cong Wang

On Fri, Nov 16, 2018 at 12:06 PM Cong Wang  wrote:
>
> Hmm, now I see how it works. Actually it uses the differences between
> these two check's as the difference between hardware checksum with
> skb_checksum().
>

Well...

This is true only when there is a skb_checksum_init*() or
skb_checksum_validate*() prior to it, it seems not true for
nf_ip_checksum() where skb->csum is correctly set to pesudo header
checksum but there is no validation of the original skb->csum.
So this check should be still inverted there??

Or am I still missing anything here?

Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-16 Thread Eric Dumazet




On 11/16/2018 12:15 PM, Cong Wang wrote:
> On Thu, Nov 15, 2018 at 8:52 PM Eric Dumazet  wrote:
>>
>> It is very possible NIC provides an incorrect CHECKSUM_COMPLETE, in the
>> case non zero trailer bytes were added by a buggy switch (or host)
>>
>> Saeed can comment/confirm, but the theory is that the NIC does header 
>> analysis and
>> computes a checksum only on the bytes of the IP frame, not including the 
>> tail bytes
>> that were added by a switch.
> 
> 
> This theory seems can't explain why Pawel saw this warning so often,
> which is beyond the probability of a buggy switch. I don't know.

Well the bug here would be the receiver NIC, not really respecting 
CHECKSUM_COMPLETE premise
(provide a checksum over all the bytes, regardless of how smart header parsing 
can be on the NIC)

'Buggy switch' would add random bytes after IP frames, but as I mentioned, any 
AF_PACKET user
can cook arbitrary padding after a valid IP (or IPv6) frame.

> 
> I will try it.
> 
> Thanks.
>

[PATCH mlx5-next 10/12] net/mlx5: EQ, Generic EQ

2018-11-16 Thread Saeed Mahameed

Add mlx5_eq_{create/destroy}_generic APIs and EQE access methods, for
mlx5 core consumers generic EQs.

This API will be used in downstream patch to move page fault (RDMA ODP)
EQ logic into mlx5_ib rdma driver, hence it will use a generic EQ.

Current mlx5 EQ allocation scheme:
On load mlx5 allocates 4 (for async) + #cores (for data completions)
MSIX vectors, mlx5 core will assign 3 MSIX vectors for internal async
EQs and will use all of the #cores MSIX vectors for completion EQs,
(One vector is going to be reserved for a generic EQ).

After this patch an external user (e.g mlx5_ib) of mlx5_core
can use this new API to create new generic EQs with the reserved msix
vector index for that eq.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  | 243 +-
 .../net/ethernet/mellanox/mlx5/core/lib/eq.h  |  12 +-
 include/linux/mlx5/eq.h   |  39 +++
 3 files changed, 221 insertions(+), 73 deletions(-)
 create mode 100644 include/linux/mlx5/eq.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 252c9f0569b1..ec1f5018546e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #ifdef CONFIG_RFS_ACCEL
 #include 
@@ -69,6 +70,7 @@ enum {
 struct mlx5_irq_info {
cpumask_var_t mask;
char name[MLX5_MAX_IRQ_NAME];
+   void *context; /* dev_id provided to request_irq */
 };
 
 struct mlx5_eq_table {
@@ -81,7 +83,6 @@ struct mlx5_eq_table {
struct mlx5_eq_pagefault pfault_eq;
 #endif
struct mutexlock; /* sync async eqs creations */
-   u8  num_async_eqs;
int num_comp_vectors;
struct mlx5_irq_info*irq_info;
 #ifdef CONFIG_RFS_ACCEL
@@ -229,19 +230,19 @@ static void eqe_pf_action(struct work_struct *work)
 work);
struct mlx5_eq_pagefault *eq = pfault->eq;
 
-   mlx5_core_page_fault(eq->core.dev, pfault);
+   mlx5_core_page_fault(eq->core->dev, pfault);
mempool_free(pfault, eq->pool);
 }
 
 static void eq_pf_process(struct mlx5_eq_pagefault *eq)
 {
-   struct mlx5_core_dev *dev = eq->core.dev;
+   struct mlx5_core_dev *dev = eq->core->dev;
struct mlx5_eqe_page_fault *pf_eqe;
struct mlx5_pagefault *pfault;
struct mlx5_eqe *eqe;
int set_ci = 0;
 
-   while ((eqe = next_eqe_sw(>core))) {
+   while ((eqe = next_eqe_sw(eq->core))) {
pfault = mempool_alloc(eq->pool, GFP_ATOMIC);
if (!pfault) {
schedule_work(>work);
@@ -316,16 +317,16 @@ static void eq_pf_process(struct mlx5_eq_pagefault *eq)
INIT_WORK(>work, eqe_pf_action);
queue_work(eq->wq, >work);
 
-   ++eq->core.cons_index;
+   ++eq->core->cons_index;
++set_ci;
 
if (unlikely(set_ci >= MLX5_NUM_SPARE_EQE)) {
-   eq_update_ci(>core, 0);
+   eq_update_ci(eq->core, 0);
set_ci = 0;
}
}
 
-   eq_update_ci(>core, 1);
+   eq_update_ci(eq->core, 1);
 }
 
 static irqreturn_t mlx5_eq_pf_int(int irq, void *eq_ptr)
@@ -368,6 +369,7 @@ static void eq_pf_action(struct work_struct *work)
 static int
 create_pf_eq(struct mlx5_core_dev *dev, struct mlx5_eq_pagefault *eq)
 {
+   struct mlx5_eq_param param = {};
int err;
 
spin_lock_init(>lock);
@@ -386,11 +388,19 @@ create_pf_eq(struct mlx5_core_dev *dev, struct 
mlx5_eq_pagefault *eq)
goto err_mempool;
}
 
-   err = mlx5_create_async_eq(dev, >core, MLX5_NUM_ASYNC_EQE,
-  1 << MLX5_EVENT_TYPE_PAGE_FAULT,
-  "mlx5_page_fault_eq", mlx5_eq_pf_int);
-   if (err)
+   param = (struct mlx5_eq_param) {
+   .index = MLX5_EQ_PFAULT_IDX,
+   .mask = 1 << MLX5_EVENT_TYPE_PAGE_FAULT,
+   .nent = MLX5_NUM_ASYNC_EQE,
+   .context = eq,
+   .handler = mlx5_eq_pf_int
+   };
+
+   eq->core = mlx5_eq_create_generic(dev, "mlx5_page_fault_eq", );
+   if (IS_ERR(eq->core)) {
+   err = PTR_ERR(eq->core);
goto err_wq;
+   }
 
return 0;
 err_wq:
@@ -404,7 +414,7 @@ static int destroy_pf_eq(struct mlx5_core_dev *dev, struct 
mlx5_eq_pagefault *eq
 {
int err;
 
-   err = mlx5_destroy_async_eq(dev, >core);
+   err = mlx5_eq_destroy_generic(dev, eq->core);
cancel_work_sync(>work);
destroy_workqueue(eq->wq);
mempool_destroy(eq->pool);
@@ -710,25 +720,29 @@ static void init_eq_buf(struct mlx5_eq *eq)
 }
 
 static int

[PATCH mlx5-next 09/12] net/mlx5: EQ, Different EQ types

2018-11-16 Thread Saeed Mahameed

In mlx5 we have three types of usages for EQs,
1. Asynchronous EQs, used internally by mlx5 core for
 a. FW command completions
 b. FW page requests
 c. one EQ for all other Asynchronous events

2. Completion EQs, used for CQ completion (we create one per core)

3. *Special type of EQ (page fault) used for RDMA on demand paging
(ODP).

*The 3rd type shouldn't be special at least in mlx5 core, it is yet
another async events EQ with specific use case, it will be removed in
the next two patches, and will completely move its logic to mlx5_ib,
as it is rdma specific.

In this patch we remove use case (eq type) specific fields from
struct mlx5_eq into a new eq type specific structures.

struct mlx5_eq_async;
truct mlx5_eq_comp;
struct mlx5_eq_pagefault;

Separate between their type specific flows.

In the future we will allow users to create there own generic EQs.
for now we will allow only one for ODP in next patches.

We will introduce event listeners registration API for those who
want to receive mlx5 async events.
After that mlx5 eq handling will be clean from feature/user specific
handling.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx5/core/cq.c  |  10 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   8 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  | 376 +++---
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |   2 +-
 .../net/ethernet/mellanox/mlx5/core/health.c  |   2 +-
 .../net/ethernet/mellanox/mlx5/core/lib/eq.h  |  53 ++-
 .../net/ethernet/mellanox/mlx5/core/main.c|   2 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   4 -
 include/linux/mlx5/cq.h   |   2 +-
 include/linux/mlx5/driver.h   |  10 +-
 10 files changed, 270 insertions(+), 199 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
index 6e55d2f37c6d..713a17ee3751 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
@@ -93,10 +93,10 @@ int mlx5_core_create_cq(struct mlx5_core_dev *dev, struct 
mlx5_core_cq *cq,
u32 dout[MLX5_ST_SZ_DW(destroy_cq_out)];
u32 out[MLX5_ST_SZ_DW(create_cq_out)];
u32 din[MLX5_ST_SZ_DW(destroy_cq_in)];
-   struct mlx5_eq *eq;
+   struct mlx5_eq_comp *eq;
int err;
 
-   eq = mlx5_eqn2eq(dev, eqn);
+   eq = mlx5_eqn2comp_eq(dev, eqn);
if (IS_ERR(eq))
return PTR_ERR(eq);
 
@@ -120,7 +120,7 @@ int mlx5_core_create_cq(struct mlx5_core_dev *dev, struct 
mlx5_core_cq *cq,
INIT_LIST_HEAD(>tasklet_ctx.list);
 
/* Add to comp EQ CQ tree to recv comp events */
-   err = mlx5_eq_add_cq(eq, cq);
+   err = mlx5_eq_add_cq(>core, cq);
if (err)
goto err_cmd;
 
@@ -140,7 +140,7 @@ int mlx5_core_create_cq(struct mlx5_core_dev *dev, struct 
mlx5_core_cq *cq,
return 0;
 
 err_cq_add:
-   mlx5_eq_del_cq(eq, cq);
+   mlx5_eq_del_cq(>core, cq);
 err_cmd:
memset(din, 0, sizeof(din));
memset(dout, 0, sizeof(dout));
@@ -162,7 +162,7 @@ int mlx5_core_destroy_cq(struct mlx5_core_dev *dev, struct 
mlx5_core_cq *cq)
if (err)
return err;
 
-   err = mlx5_eq_del_cq(cq->eq, cq);
+   err = mlx5_eq_del_cq(>eq->core, cq);
if (err)
return err;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index c23caade31bf..0d495a6b3949 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -320,7 +320,7 @@ static void mlx5e_enable_async_events(struct mlx5e_priv 
*priv)
 static void mlx5e_disable_async_events(struct mlx5e_priv *priv)
 {
clear_bit(MLX5E_STATE_ASYNC_EVENTS_ENABLED, >state);
-   synchronize_irq(pci_irq_vector(priv->mdev->pdev, MLX5_EQ_VEC_ASYNC));
+   mlx5_eq_synchronize_async_irq(priv->mdev);
 }
 
 static inline void mlx5e_build_umr_wqe(struct mlx5e_rq *rq,
@@ -4117,17 +4117,17 @@ static netdev_features_t mlx5e_features_check(struct 
sk_buff *skb,
 static bool mlx5e_tx_timeout_eq_recover(struct net_device *dev,
struct mlx5e_txqsq *sq)
 {
-   struct mlx5_eq *eq = sq->cq.mcq.eq;
+   struct mlx5_eq_comp *eq = sq->cq.mcq.eq;
u32 eqe_count;
 
netdev_err(dev, "EQ 0x%x: Cons = 0x%x, irqn = 0x%x\n",
-  eq->eqn, eq->cons_index, eq->irqn);
+  eq->core.eqn, eq->core.cons_index, eq->core.irqn);
 
eqe_count = mlx5_eq_poll_irq_disabled(eq);
if (!eqe_count)
return false;
 
-   netdev_err(dev, "Recover %d eqes on EQ 0x%x\n", eqe_count, eq->eqn);
+   netdev_err(dev, "Recover %d eqes on EQ 0x%x\n", eqe_count, 
eq->core.eqn);
sq->channel->stats->eq_rearm++;
return true;
 }
diff --git

[PATCH mlx5-next 03/12] net/mlx5: EQ, No need to store eq index as a field

2018-11-16 Thread Saeed Mahameed

eq->index is used only for completion EQs and is assigned to be
the completion eq index, it is used only when traversing the completion
eqs list, and it can be calculated dynamically, thus remove the
eq->index field.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 4 ++--
 include/linux/mlx5/driver.h| 1 -
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index d5cea0a36e6a..f5e6d375a8cc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -702,10 +702,11 @@ int mlx5_vector2eqn(struct mlx5_core_dev *dev, int 
vector, int *eqn,
struct mlx5_eq_table *table = >priv.eq_table;
struct mlx5_eq *eq, *n;
int err = -ENOENT;
+   int i = 0;
 
spin_lock(>lock);
list_for_each_entry_safe(eq, n, >comp_eqs_list, list) {
-   if (eq->index == vector) {
+   if (i++ == vector) {
*eqn = eq->eqn;
*irqn = eq->irqn;
err = 0;
@@ -797,7 +798,6 @@ static int alloc_comp_eqs(struct mlx5_core_dev *dev)
goto clean;
}
mlx5_core_dbg(dev, "allocated completion EQN %d\n", eq->eqn);
-   eq->index = i;
spin_lock(>lock);
list_add_tail(>list, >comp_eqs_list);
spin_unlock(>lock);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 15cf6727a62d..4b62d71825c1 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -399,7 +399,6 @@ struct mlx5_eq {
u8  eqn;
int nent;
struct list_headlist;
-   int index;
struct mlx5_rsc_debug   *dbg;
enum mlx5_eq_type   type;
union {
-- 
2.19.1

[PATCH mlx5-next 02/12] net/mlx5: EQ, Remove unused fields and structures

2018-11-16 Thread Saeed Mahameed

Some fields and structures are not referenced nor used by the driver,
remove them.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 11 ---
 include/linux/mlx5/driver.h  |  3 ---
 2 files changed, 14 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index aeab0c4f60f4..fd5926daa0a6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -78,17 +78,6 @@ enum {
   (1ull << MLX5_EVENT_TYPE_SRQ_LAST_WQE)   | \
   (1ull << MLX5_EVENT_TYPE_SRQ_RQ_LIMIT))
 
-struct map_eq_in {
-   u64 mask;
-   u32 reserved;
-   u32 unmap_eqn;
-};
-
-struct cre_des_eq {
-   u8  reserved[15];
-   u8  eqn;
-};
-
 static int mlx5_cmd_destroy_eq(struct mlx5_core_dev *dev, u8 eqn)
 {
u32 out[MLX5_ST_SZ_DW(destroy_eq_out)] = {0};
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 7d4ed995b4ce..15cf6727a62d 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -398,7 +398,6 @@ struct mlx5_eq {
unsigned intirqn;
u8  eqn;
int nent;
-   u64 mask;
struct list_headlist;
int index;
struct mlx5_rsc_debug   *dbg;
@@ -478,8 +477,6 @@ struct mlx5_core_srq {
 };
 
 struct mlx5_eq_table {
-   void __iomem   *update_ci;
-   void __iomem   *update_arm_ci;
struct list_headcomp_eqs_list;
struct mlx5_eq  pages_eq;
struct mlx5_eq  async_eq;
-- 
2.19.1

[PATCH mlx5-next 06/12] net/mlx5: EQ, Create all EQs in one place

2018-11-16 Thread Saeed Mahameed

Instead of creating the EQ table in three steps at driver load,
 - allocate irq vectors
 - allocate async EQs
 - allocate completion EQs
Gather all of the procedures into one function in eq.c and call it from
driver load.

This will help us reduce the EQ and EQ table private structures
visibility to eq.c in downstream refactoring.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Tariq Toukan 
---
 .../net/ethernet/mellanox/mlx5/core/debugfs.c |  10 ++
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  | 121 ++
 .../net/ethernet/mellanox/mlx5/core/main.c|  81 ++--
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   8 +-
 4 files changed, 116 insertions(+), 104 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c 
b/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
index 90fabd612b6c..b76766fb6c67 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
@@ -349,6 +349,16 @@ static u64 qp_read_field(struct mlx5_core_dev *dev, struct 
mlx5_core_qp *qp,
return param;
 }
 
+static int mlx5_core_eq_query(struct mlx5_core_dev *dev, struct mlx5_eq *eq,
+ u32 *out, int outlen)
+{
+   u32 in[MLX5_ST_SZ_DW(query_eq_in)] = {};
+
+   MLX5_SET(query_eq_in, in, opcode, MLX5_CMD_OP_QUERY_EQ);
+   MLX5_SET(query_eq_in, in, eq_number, eq->eqn);
+   return mlx5_cmd_exec(dev, in, sizeof(in), out, outlen);
+}
+
 static u64 eq_read_field(struct mlx5_core_dev *dev, struct mlx5_eq *eq,
 int index)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 4d79a4ccb758..44ccd4206104 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -822,7 +822,7 @@ void mlx5_eq_cleanup(struct mlx5_core_dev *dev)
 
 /* Async EQs */
 
-int mlx5_start_eqs(struct mlx5_core_dev *dev)
+static int create_async_eqs(struct mlx5_core_dev *dev)
 {
struct mlx5_eq_table *table = >priv.eq_table;
u64 async_event_mask = MLX5_ASYNC_EVENT_MASK;
@@ -914,7 +914,7 @@ int mlx5_start_eqs(struct mlx5_core_dev *dev)
return err;
 }
 
-void mlx5_stop_eqs(struct mlx5_core_dev *dev)
+static void destroy_async_eqs(struct mlx5_core_dev *dev)
 {
struct mlx5_eq_table *table = >priv.eq_table;
int err;
@@ -945,19 +945,9 @@ void mlx5_stop_eqs(struct mlx5_core_dev *dev)
  err);
 }
 
-int mlx5_core_eq_query(struct mlx5_core_dev *dev, struct mlx5_eq *eq,
-  u32 *out, int outlen)
-{
-   u32 in[MLX5_ST_SZ_DW(query_eq_in)] = {0};
-
-   MLX5_SET(query_eq_in, in, opcode, MLX5_CMD_OP_QUERY_EQ);
-   MLX5_SET(query_eq_in, in, eq_number, eq->eqn);
-   return mlx5_cmd_exec(dev, in, sizeof(in), out, outlen);
-}
-
 /* Completion EQs */
 
-static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
+static int set_comp_irq_affinity_hint(struct mlx5_core_dev *mdev, int i)
 {
struct mlx5_priv *priv  = >priv;
int vecidx = MLX5_EQ_VEC_COMP_BASE + i;
@@ -978,7 +968,7 @@ static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev 
*mdev, int i)
return 0;
 }
 
-static void mlx5_irq_clear_affinity_hint(struct mlx5_core_dev *mdev, int i)
+static void clear_comp_irq_affinity_hint(struct mlx5_core_dev *mdev, int i)
 {
int vecidx = MLX5_EQ_VEC_COMP_BASE + i;
struct mlx5_priv *priv  = >priv;
@@ -988,13 +978,13 @@ static void mlx5_irq_clear_affinity_hint(struct 
mlx5_core_dev *mdev, int i)
free_cpumask_var(priv->irq_info[vecidx].mask);
 }
 
-static int mlx5_irq_set_affinity_hints(struct mlx5_core_dev *mdev)
+static int set_comp_irq_affinity_hints(struct mlx5_core_dev *mdev)
 {
int err;
int i;
 
for (i = 0; i < mdev->priv.eq_table.num_comp_vectors; i++) {
-   err = mlx5_irq_set_affinity_hint(mdev, i);
+   err = set_comp_irq_affinity_hint(mdev, i);
if (err)
goto err_out;
}
@@ -1003,25 +993,25 @@ static int mlx5_irq_set_affinity_hints(struct 
mlx5_core_dev *mdev)
 
 err_out:
for (i--; i >= 0; i--)
-   mlx5_irq_clear_affinity_hint(mdev, i);
+   clear_comp_irq_affinity_hint(mdev, i);
 
return err;
 }
 
-static void mlx5_irq_clear_affinity_hints(struct mlx5_core_dev *mdev)
+static void clear_comp_irqs_affinity_hints(struct mlx5_core_dev *mdev)
 {
int i;
 
for (i = 0; i < mdev->priv.eq_table.num_comp_vectors; i++)
-   mlx5_irq_clear_affinity_hint(mdev, i);
+   clear_comp_irq_affinity_hint(mdev, i);
 }
 
-void mlx5_free_comp_eqs(struct mlx5_core_dev *dev)
+static void destroy_comp_eqs(struct mlx5_core_dev *dev)
 {
struct mlx5_eq_table *table = >priv.eq_table;
struct mlx5_eq *eq, *n;
 
-   mlx5_irq_clear_affinity_hints(dev);
+

Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-16 Thread Cong Wang

On Thu, Nov 15, 2018 at 8:52 PM Eric Dumazet  wrote:
>
> It is very possible NIC provides an incorrect CHECKSUM_COMPLETE, in the
> case non zero trailer bytes were added by a buggy switch (or host)
>
> Saeed can comment/confirm, but the theory is that the NIC does header 
> analysis and
> computes a checksum only on the bytes of the IP frame, not including the tail 
> bytes
> that were added by a switch.


This theory seems can't explain why Pawel saw this warning so often,
which is beyond the probability of a buggy switch. I don't know.


>
> You could use trafgen to cook such a frame and confirm the theory.
>
> Something like :

I will try it.

Thanks.

[PATCH net-next] add part of TCP counts explanations in snmp_counters.rst

2018-11-16 Thread yupeng

Add explanations of some generic TCP counters, fast open
related counters and TCP abort related counters and several
examples.

Signed-off-by: yupeng 
---
 Documentation/networking/snmp_counter.rst | 525 +-
 1 file changed, 524 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/snmp_counter.rst 
b/Documentation/networking/snmp_counter.rst
index e0d588fcb67f..a5b8dc0c7c4a 100644
--- a/Documentation/networking/snmp_counter.rst
+++ b/Documentation/networking/snmp_counter.rst
@@ -40,7 +40,7 @@ multicast packets, and would always be updated together with
 IpExtOutOctets.
 
 * IpExtInOctets and IpExtOutOctets
-They are linux kernel extensions, no RFC definitions. Please note,
+They are Linux kernel extensions, no RFC definitions. Please note,
 RFC1213 indeed defines ifInOctets  and ifOutOctets, but they
 are different things. The ifInOctets and ifOutOctets include the MAC
 layer header size but IpExtInOctets and IpExtOutOctets don't, they
@@ -174,6 +174,163 @@ IcmpMsgOutType[N]. If the errors occur in both step (2) 
and step (4),
 IcmpInMsgs should be less than the sum of IcmpMsgOutType[N] plus
 IcmpInErrors.
 
+General TCP counters
+==
+* TcpInSegs
+Defined in `RFC1213 tcpInSegs`_
+
+.. _RFC1213 tcpInSegs: https://tools.ietf.org/html/rfc1213#page-48
+
+The number of packets received by the TCP layer. As mentioned in
+RFC1213, it includes the packets received in error, such as checksum
+error, invalid TCP header and so on. Only one error won't be included:
+if the layer 2 destination address is not the NIC's layer 2
+address. It might happen if the packet is a multicast or broadcast
+packet, or the NIC is in promiscuous mode. In these situations, the
+packets would be delivered to the TCP layer, but the TCP layer will discard
+these packets before increasing TcpInSegs. The TcpInSegs counter
+isn't aware of GRO. So if two packets are merged by GRO, the TcpInSegs
+counter would only increase 1.
+
+* TcpOutSegs
+Defined in `RFC1213 tcpOutSegs`_
+
+.. _RFC1213 tcpOutSegs: https://tools.ietf.org/html/rfc1213#page-48
+
+The number of packets sent by the TCP layer. As mentioned in RFC1213,
+it excludes the retransmitted packets. But it includes the SYN, ACK
+and RST packets. Doesn't like TcpInSegs, the TcpOutSegs is aware of
+GSO, so if a packet would be split to 2 by GSO, TcpOutSegs will
+increase 2.
+
+* TcpActiveOpens
+Defined in `RFC1213 tcpActiveOpens`_
+
+.. _RFC1213 tcpActiveOpens: https://tools.ietf.org/html/rfc1213#page-47
+
+It means the TCP layer sends a SYN, and come into the SYN-SENT
+state. Every time TcpActiveOpens increases 1, TcpOutSegs should always
+increase 1.
+
+* TcpPassiveOpens
+Defined in `RFC1213 tcpPassiveOpens`_
+
+.. _RFC1213 tcpPassiveOpens: https://tools.ietf.org/html/rfc1213#page-47
+
+It means the TCP layer receives a SYN, replies a SYN+ACK, come into
+the SYN-RCVD state.
+
+TCP Fast Open
+
+When kernel receives a TCP packet, it has two paths to handler the
+packet, one is fast path, another is slow path. The comment in kernel
+code provides a good explanation of them, I pasted them below::
+
+  It is split into a fast path and a slow path. The fast path is
+  disabled when:
+
+  - A zero window was announced from us
+  - zero window probing
+is only handled properly on the slow path.
+  - Out of order segments arrived.
+  - Urgent data is expected.
+  - There is no buffer space left
+  - Unexpected TCP flags/window values/header lengths are received
+(detected by checking the TCP header against pred_flags)
+  - Data is sent in both directions. The fast path only supports pure senders
+or pure receivers (this means either the sequence number or the ack
+value must stay constant)
+  - Unexpected TCP option.
+
+Kernel will try to use fast path unless any of the above conditions
+are satisfied. If the packets are out of order, kernel will handle
+them in slow path, which means the performance might be not very
+good. Kernel would also come into slow path if the "Delayed ack" is
+used, because when using "Delayed ack", the data is sent in both
+directions. When the TCP window scale option is not used, kernel will
+try to enable fast path immediately when the connection comes into the
+established state, but if the TCP window scale option is used, kernel
+will disable the fast path at first, and try to enable it after kernel
+receives packets.
+
+* TcpExtTCPPureAcks and TcpExtTCPHPAcks
+If a packet set ACK flag and has no data, it is a pure ACK packet, if
+kernel handles it in the fast path, TcpExtTCPHPAcks will increase 1,
+if kernel handles it in the slow path, TcpExtTCPPureAcks will
+increase 1.
+
+* TcpExtTCPHPHits
+If a TCP packet has data (which means it is not a pure ACK packet),
+and this packet is handled in the fast path, TcpExtTCPHPHits will
+increase 1.
+
+
+TCP abort
+
+
+
+* TcpExtTCPAbortOnData
+It means TCP layer has data in flight, but need to close the
+connection. So TCP layer sends a

[PATCH net 1/2] tc-testing: tdc.py: ignore errors when decoding stdout/stderr

2018-11-16 Thread Lucas Bates

Prevent exceptions from being raised while decoding output
from an executed command. There is no impact on tdc's
execution and the verify command phase would fail the pattern
match.

Signed-off-by: Lucas Bates 
---
 tools/testing/selftests/tc-testing/tdc.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/tc-testing/tdc.py 
b/tools/testing/selftests/tc-testing/tdc.py
index 87a04a8..9b3f414 100755
--- a/tools/testing/selftests/tc-testing/tdc.py
+++ b/tools/testing/selftests/tc-testing/tdc.py
@@ -134,9 +134,9 @@ def exec_cmd(args, pm, stage, command):
 (rawout, serr) = proc.communicate()
 
 if proc.returncode != 0 and len(serr) > 0:
-foutput = serr.decode("utf-8")
+foutput = serr.decode("utf-8", errors="ignore")
 else:
-foutput = rawout.decode("utf-8")
+foutput = rawout.decode("utf-8", errors="ignore")
 
 proc.stdout.close()
 proc.stderr.close()
-- 
2.7.4

[PATCH net 0/2] Prevent uncaught exceptions in tdc

2018-11-16 Thread Lucas Bates

This patch series addresses two potential bugs in tdc that can
cause exceptions to be raised in certain circumstances.  These
exceptions are generally not handled, so instead we will prevent
them from being raised.

Brenda J. Butler (1):
  tc-testing: tdc.py: Guard against lack of returncode in executed
command

Lucas Bates (1):
  tc-testing: tdc.py: ignore errors when decoding stdout/stderr

 tools/testing/selftests/tc-testing/tdc.py | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

--
2.7.4

[PATCH net 2/2] tc-testing: tdc.py: Guard against lack of returncode in executed command

2018-11-16 Thread Lucas Bates

From: "Brenda J. Butler" 

Add some defensive coding in case one of the subprocesses created by tdc
returns nothing. If no object is returned from exec_cmd, then tdc will
halt with an unhandled exception.

Signed-off-by: Brenda J. Butler 
Signed-off-by: Lucas Bates 
---
 tools/testing/selftests/tc-testing/tdc.py | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/tc-testing/tdc.py 
b/tools/testing/selftests/tc-testing/tdc.py
index 9b3f414..7607ba3 100755
--- a/tools/testing/selftests/tc-testing/tdc.py
+++ b/tools/testing/selftests/tc-testing/tdc.py
@@ -169,6 +169,8 @@ def prepare_env(args, pm, stage, prefix, cmdlist, output = 
None):
   file=sys.stderr)
 print("\n{} *** Error message: \"{}\"".format(prefix, foutput),
   file=sys.stderr)
+print("returncode {}; expected {}".format(proc.returncode,
+  exit_codes))
 print("\n{} *** Aborting test run.".format(prefix), 
file=sys.stderr)
 print("\n\n{} *** stdout ***".format(proc.stdout), file=sys.stderr)
 print("\n\n{} *** stderr ***".format(proc.stderr), file=sys.stderr)
@@ -195,12 +197,18 @@ def run_one_test(pm, args, index, tidx):
 print('-> execute stage')
 pm.call_pre_execute()
 (p, procout) = exec_cmd(args, pm, 'execute', tidx["cmdUnderTest"])
-exit_code = p.returncode
+if p:
+exit_code = p.returncode
+else:
+exit_code = None
+
 pm.call_post_execute()
 
-if (exit_code != int(tidx["expExitCode"])):
+if (exit_code is None or exit_code != int(tidx["expExitCode"])):
 result = False
-print("exit:", exit_code, int(tidx["expExitCode"]))
+print("exit: {!r}".format(exit_code))
+print("exit: {}".format(int(tidx["expExitCode"])))
+#print("exit: {!r} {}".format(exit_code, int(tidx["expExitCode"])))
 print(procout)
 else:
 if args.verbose > 0:
-- 
2.7.4

Re: [BUG] xfrm: unable to handle kernel NULL pointer dereference

2018-11-16 Thread Lennert Buytenhek

On Sat, Nov 10, 2018 at 08:34:34PM +0100, Jean-Philippe Menil wrote:

> we're seeing unexpected crashes from kernel 4.15 to 4.18.17, using
> IPsec VTI interfaces, on several vpn hosts, since upgrade from 4.4.

I looked into this with Jean-Philippe, and it appears to be crashing
on a NULL pointer dereference in the inlined xfrm_policy_check() call
in vti_rcv_cb(), and specifically on the skb_dst(skb) dereference in
__xfrm_policy_check2():

return  (!net->xfrm.policy_count[dir] && !skb->sp) ||
(skb_dst(skb)->flags & DST_NOPOLICY) || <=
__xfrm_policy_check(sk, ndir, skb, family);

Commit 9e1437937807 ("xfrm: Fix NULL pointer dereference when
skb_dst_force clears the dst_entry.") fixes a very similar problem on
the output and forward paths, but our issue seems to be triggering on
the input path.

This hack patch seems to make the crashes go away, and the printk added
triggers with approximately the same regularity as the crashes used
to occur, so the fix from 9e1437937807 probably needs to be extended
to the input path somewhat like this.

Thanks!


diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 352abca2605f..c666e29441b4 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -381,6 +381,12 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
XFRM_SKB_CB(skb)->seq.input.hi = seq_hi;
 
skb_dst_force(skb);
+   if (!skb_dst(skb)) {
+   if (net_ratelimit())
+   printk(KERN_CRIT "OH CRAP\n");
+   goto drop;
+   }
+
dev_hold(skb->dev);
 
if (crypto_done)



> Attached, the offended oops against 4.18.
> 
> Output of decodedecode:
> 
> [ 37.134864] Code: 8b 44 24 70 0f c8 89 87 b4 00 00 00 48 8b 86 20 05 00 00
> 8b 80 f8 14 00 00 85 c0 75 05 48 85 d2 74 0e 48 8b 43 58 48 83 e0 fe  40
> 38 04 74 7d 44 89 b3 b4 00 00 00 49 8b 44 24 20 48 39 86 20
> All code
> 
>0:   8b 44 24 70 mov0x70(%rsp),%eax
>4:   0f c8   bswap  %eax
>6:   89 87 b4 00 00 00   mov%eax,0xb4(%rdi)
>c:   48 8b 86 20 05 00 00mov0x520(%rsi),%rax
>   13:   8b 80 f8 14 00 00   mov0x14f8(%rax),%eax
>   19:   85 c0   test   %eax,%eax
>   1b:   75 05   jne0x22
>   1d:   48 85 d2test   %rdx,%rdx
>   20:   74 0e   je 0x30
>   22:   48 8b 43 58 mov0x58(%rbx),%rax
>   26:   48 83 e0 fe and$0xfffe,%rax
>   2a:*  f6 40 38 04 testb  $0x4,0x38(%rax)  <-- trapping
> instruction
>   2e:   74 7d   je 0xad
>   30:   44 89 b3 b4 00 00 00mov%r14d,0xb4(%rbx)
>   37:   49 8b 44 24 20  mov0x20(%r12),%rax
>   3c:   48  rex.W
>   3d:   39  .byte 0x39
>   3e:   86 20   xchg   %ah,(%rax)
> 
> Code starting with the faulting instruction
> ===
>0:   f6 40 38 04 testb  $0x4,0x38(%rax)
>4:   74 7d   je 0x83
>6:   44 89 b3 b4 00 00 00mov%r14d,0xb4(%rbx)
>d:   49 8b 44 24 20  mov0x20(%r12),%rax
>   12:   48  rex.W
>   13:   39  .byte 0x39
>   14:   86 20   xchg   %ah,(%rax)
> 
> 
> if my understanding is correct, we fail here:
> 
> /build/linux-hwe-edge-yHKLQJ/linux-hwe-edge-4.18.0/include/net/xfrm.h:
> 1169return  (!net->xfrm.policy_count[dir] && !skb->sp) ||
>0x0b19 <+185>:   testb  $0x4,0x38(%rax)
>0x0b1d <+189>:   je 0xb9c 
> 
> (gdb) list *0x0b19
> 0xb19 is in vti_rcv_cb
> (/build/linux-hwe-edge-yHKLQJ/linux-hwe-edge-4.18.0/include/net/xfrm.h:1169).
> 1164int ndir = dir | (reverse ? XFRM_POLICY_MASK + 1 : 0);
> 1165
> 1166if (sk && sk->sk_policy[XFRM_POLICY_IN])
> 1167return __xfrm_policy_check(sk, ndir, skb, family);
> 1168
> 1169return  (!net->xfrm.policy_count[dir] && !skb->sp) ||
> 1170(skb_dst(skb)->flags & DST_NOPOLICY) ||
> 1171__xfrm_policy_check(sk, ndir, skb, family);
> 1172}
> 1173
> 
> I really have hard time to understand why skb seem to be freed twice.
> 
> I'm not able to repeat the bug in lab, but it happened regulary in prod,
> seem to depend of the workload.
> 
> Any help will be appreciated.
> 
> Let me know if you need further informations.
> 
> Regards,
> 
> Jean-Philippe

> [   31.154360] BUG: unable to handle kernel NULL pointer dereference at 
> 0038
> [   31.162233] PGD 0 P4D 0
> [   31.164786] Oops:  [#1] SMP PTI
> [   31.168291] CPU: 5 PID: 42 Comm: ksoftirqd/5 Not tainted 4.18.0-11-generic 
> #12~18.04.1-Ubuntu
> [   31.176854] Hardware name: Supermicro

Re: [PATCH] [PATCH net-next] tun: fix multiqueue rx

2018-11-16 Thread Matt Cover

On Fri, Nov 16, 2018 at 1:10 PM Michael S. Tsirkin  wrote:
>
> On Fri, Nov 16, 2018 at 12:00:15AM -0700, Matthew Cover wrote:
> > When writing packets to a descriptor associated with a combined queue, the
> > packets should end up on that queue.
> >
> > Before this change all packets written to any descriptor associated with a
> > tap interface end up on rx-0, even when the descriptor is associated with a
> > different queue.
> >
> > The rx traffic can be generated by either of the following.
> >   1. a simple tap program which spins up multiple queues and writes packets
> >  to each of the file descriptors
> >   2. tx from a qemu vm with a tap multiqueue netdev
> >
> > The queue for rx traffic can be observed by either of the following (done
> > on the hypervisor in the qemu case).
> >   1. a simple netmap program which opens and reads from per-queue
> >  descriptors
> >   2. configuring RPS and doing per-cpu captures with rxtxcpu
> >
> > Alternatively, if you printk() the return value of skb_get_rx_queue() just
> > before each instance of netif_receive_skb() in tun.c, you will get 65535
> > for every skb.
> >
> > Calling skb_record_rx_queue() to set the rx queue to the queue_index fixes
> > the association between descriptor and rx queue.
> >
> > Signed-off-by: Matthew Cover 
>
> Acked-by: Michael S. Tsirkin 
>
> stable material?
>

Yes, I believe so.

The documentation below I think justifies classifying this as a fix.
https://github.com/torvalds/linux/blob/v4.19/Documentation/networking/tuntap.txt#L111

> > ---
> >  drivers/net/tun.c | 7 ++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index a65779c6d72f..ce8620f3ea5e 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -1536,6 +1536,7 @@ static void tun_rx_batched(struct tun_struct *tun, 
> > struct tun_file *tfile,
> >
> >   if (!rx_batched || (!more && skb_queue_empty(queue))) {
> >   local_bh_disable();
> > + skb_record_rx_queue(skb, tfile->queue_index);
> >   netif_receive_skb(skb);
> >   local_bh_enable();
> >   return;
> > @@ -1555,8 +1556,11 @@ static void tun_rx_batched(struct tun_struct *tun, 
> > struct tun_file *tfile,
> >   struct sk_buff *nskb;
> >
> >   local_bh_disable();
> > - while ((nskb = __skb_dequeue(_queue)))
> > + while ((nskb = __skb_dequeue(_queue))) {
> > + skb_record_rx_queue(nskb, tfile->queue_index);
> >   netif_receive_skb(nskb);
> > + }
> > + skb_record_rx_queue(skb, tfile->queue_index);
> >   netif_receive_skb(skb);
> >   local_bh_enable();
> >   }
> > @@ -2452,6 +2456,7 @@ static int tun_xdp_one(struct tun_struct *tun,
> >   !tfile->detached)
> >   rxhash = __skb_get_hash_symmetric(skb);
> >
> > + skb_record_rx_queue(skb, tfile->queue_index);
> >   netif_receive_skb(skb);
> >
> >   stats = get_cpu_ptr(tun->pcpu_stats);
> > --
> > 2.15.2 (Apple Git-101.1)

Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-16 Thread Cong Wang

On Thu, Nov 15, 2018 at 8:59 PM Herbert Xu  wrote:
>
> On Thu, Nov 15, 2018 at 08:52:23PM -0800, Eric Dumazet wrote:
> >
> > It is very possible NIC provides an incorrect CHECKSUM_COMPLETE, in the
> > case non zero trailer bytes were added by a buggy switch (or host)
>
> We should probably change netdev_rx_csum_fault to print out at
> least one complete packet plus the hardware-generated checksum.
>
> That would make debugging these rare hardware faults much easier.

I have a patch as a starter:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=7fe50ac83f4319c18ed7c634d85cad16bd0bf509

Let me know if you want to add more information there.

Dumping the hex of an skb data?

Thanks.

RE: [PATCH iproute2-next v3] rdma: Document IB device renaming option

2018-11-16 Thread Ruhl, Michael J

>-Original Message-
>From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
>ow...@vger.kernel.org] On Behalf Of Leon Romanovsky
>Sent: Sunday, November 4, 2018 2:11 PM
>To: David Ahern 
>Cc: Leon Romanovsky ; netdev
>; RDMA mailing list ;
>Stephen Hemminger 
>Subject: [PATCH iproute2-next v3] rdma: Document IB device renaming
>option
>
>From: Leon Romanovsky 

Hi Leon,

After looking at this and Steve Wise's changes for the ADDLINK/DELLINK,
it occurred to me that the driver that handed the name to ib_register_device()
might be interested in knowing that this name change occurred.

Are there plans to include a some kind of notify mechanism so drivers can
find out when things like this occur?

Is this something that should be done?

Thanks,

Mike

>[leonro@server /]$ lspci |grep -i Ether
>00:08.0 Ethernet controller: Red Hat, Inc. Virtio network device
>00:09.0 Ethernet controller: Mellanox Technologies MT27700 Family
>[ConnectX-4]
>[leonro@server /]$ sudo rdma dev
>1: mlx5_0: node_type ca fw 3.8. node_guid 5254:00c0:fe12:3455
>sys_image_guid 5254:00c0:fe12:3455
>[leonro@server /]$ sudo rdma dev set mlx5_0 name hfi1_0
>[leonro@server /]$ sudo rdma dev
>1: hfi1_0: node_type ca fw 3.8. node_guid 5254:00c0:fe12:3455
>sys_image_guid 5254:00c0:fe12:3455
>
>Signed-off-by: Leon Romanovsky 
>---
>Changelog:
>v2->v3:
> * Dropped "to be named" words from example section of man
>---
> man/man8/rdma-dev.8 | 15 ++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
>diff --git a/man/man8/rdma-dev.8 b/man/man8/rdma-dev.8
>index 461681b6..7c275180 100644
>--- a/man/man8/rdma-dev.8
>+++ b/man/man8/rdma-dev.8
>@@ -1,6 +1,6 @@
> .TH RDMA\-DEV 8 "06 Jul 2017" "iproute2" "Linux"
> .SH NAME
>-rdmak-dev \- RDMA device configuration
>+rdma-dev \- RDMA device configuration
> .SH SYNOPSIS
> .sp
> .ad l
>@@ -22,10 +22,18 @@ rdmak-dev \- RDMA device configuration
> .B rdma dev show
> .RI "[ " DEV " ]"
>
>+.ti -8
>+.B rdma dev set
>+.RI "[ " DEV " ]"
>+.BR name
>+.BR NEWNAME
>+
> .ti -8
> .B rdma dev help
>
> .SH "DESCRIPTION"
>+.SS rdma dev set - rename rdma device
>+
> .SS rdma dev show - display rdma device attributes
>
> .PP
>@@ -45,6 +53,11 @@ rdma dev show mlx5_3
> Shows the state of specified RDMA device.
> .RE
> .PP
>+rdma dev set mlx5_3 name rdma_0
>+.RS 4
>+Renames the mlx5_3 device to rdma_0.
>+.RE
>+.PP
>
> .SH SEE ALSO
> .BR rdma (8),
>--
>2.19.1

Re: [PATCH] [PATCH net-next] tun: fix multiqueue rx

2018-11-16 Thread Michael S. Tsirkin

On Fri, Nov 16, 2018 at 12:00:15AM -0700, Matthew Cover wrote:
> When writing packets to a descriptor associated with a combined queue, the
> packets should end up on that queue.
> 
> Before this change all packets written to any descriptor associated with a
> tap interface end up on rx-0, even when the descriptor is associated with a
> different queue.
> 
> The rx traffic can be generated by either of the following.
>   1. a simple tap program which spins up multiple queues and writes packets
>  to each of the file descriptors
>   2. tx from a qemu vm with a tap multiqueue netdev
> 
> The queue for rx traffic can be observed by either of the following (done
> on the hypervisor in the qemu case).
>   1. a simple netmap program which opens and reads from per-queue
>  descriptors
>   2. configuring RPS and doing per-cpu captures with rxtxcpu
> 
> Alternatively, if you printk() the return value of skb_get_rx_queue() just
> before each instance of netif_receive_skb() in tun.c, you will get 65535
> for every skb.
> 
> Calling skb_record_rx_queue() to set the rx queue to the queue_index fixes
> the association between descriptor and rx queue.
> 
> Signed-off-by: Matthew Cover 

Acked-by: Michael S. Tsirkin 

stable material?

> ---
>  drivers/net/tun.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index a65779c6d72f..ce8620f3ea5e 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1536,6 +1536,7 @@ static void tun_rx_batched(struct tun_struct *tun, 
> struct tun_file *tfile,
>  
>   if (!rx_batched || (!more && skb_queue_empty(queue))) {
>   local_bh_disable();
> + skb_record_rx_queue(skb, tfile->queue_index);
>   netif_receive_skb(skb);
>   local_bh_enable();
>   return;
> @@ -1555,8 +1556,11 @@ static void tun_rx_batched(struct tun_struct *tun, 
> struct tun_file *tfile,
>   struct sk_buff *nskb;
>  
>   local_bh_disable();
> - while ((nskb = __skb_dequeue(_queue)))
> + while ((nskb = __skb_dequeue(_queue))) {
> + skb_record_rx_queue(nskb, tfile->queue_index);
>   netif_receive_skb(nskb);
> + }
> + skb_record_rx_queue(skb, tfile->queue_index);
>   netif_receive_skb(skb);
>   local_bh_enable();
>   }
> @@ -2452,6 +2456,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>   !tfile->detached)
>   rxhash = __skb_get_hash_symmetric(skb);
>  
> + skb_record_rx_queue(skb, tfile->queue_index);
>   netif_receive_skb(skb);
>  
>   stats = get_cpu_ptr(tun->pcpu_stats);
> -- 
> 2.15.2 (Apple Git-101.1)

[PATCH mlx5-next 05/12] net/mlx5: EQ, Move all EQ logic to eq.c

2018-11-16 Thread Saeed Mahameed

Move completion EQs flows from main.c to eq.c, reasons:
1) It is where this logic belongs.
2) It will help centralize the EQ logic in one file for downstream
refactoring, and future extensions/updates.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  | 176 +
 .../net/ethernet/mellanox/mlx5/core/main.c| 179 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   2 +
 3 files changed, 181 insertions(+), 176 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index e75272503027..4d79a4ccb758 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -820,6 +820,8 @@ void mlx5_eq_cleanup(struct mlx5_core_dev *dev)
mlx5_eq_debugfs_cleanup(dev);
 }
 
+/* Async EQs */
+
 int mlx5_start_eqs(struct mlx5_core_dev *dev)
 {
struct mlx5_eq_table *table = >priv.eq_table;
@@ -953,12 +955,186 @@ int mlx5_core_eq_query(struct mlx5_core_dev *dev, struct 
mlx5_eq *eq,
return mlx5_cmd_exec(dev, in, sizeof(in), out, outlen);
 }
 
+/* Completion EQs */
+
+static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
+{
+   struct mlx5_priv *priv  = >priv;
+   int vecidx = MLX5_EQ_VEC_COMP_BASE + i;
+   int irq = pci_irq_vector(mdev->pdev, vecidx);
+
+   if (!zalloc_cpumask_var(>irq_info[vecidx].mask, GFP_KERNEL)) {
+   mlx5_core_warn(mdev, "zalloc_cpumask_var failed");
+   return -ENOMEM;
+   }
+
+   cpumask_set_cpu(cpumask_local_spread(i, priv->numa_node),
+   priv->irq_info[vecidx].mask);
+
+   if (IS_ENABLED(CONFIG_SMP) &&
+   irq_set_affinity_hint(irq, priv->irq_info[vecidx].mask))
+   mlx5_core_warn(mdev, "irq_set_affinity_hint failed, irq 
0x%.4x", irq);
+
+   return 0;
+}
+
+static void mlx5_irq_clear_affinity_hint(struct mlx5_core_dev *mdev, int i)
+{
+   int vecidx = MLX5_EQ_VEC_COMP_BASE + i;
+   struct mlx5_priv *priv  = >priv;
+   int irq = pci_irq_vector(mdev->pdev, vecidx);
+
+   irq_set_affinity_hint(irq, NULL);
+   free_cpumask_var(priv->irq_info[vecidx].mask);
+}
+
+static int mlx5_irq_set_affinity_hints(struct mlx5_core_dev *mdev)
+{
+   int err;
+   int i;
+
+   for (i = 0; i < mdev->priv.eq_table.num_comp_vectors; i++) {
+   err = mlx5_irq_set_affinity_hint(mdev, i);
+   if (err)
+   goto err_out;
+   }
+
+   return 0;
+
+err_out:
+   for (i--; i >= 0; i--)
+   mlx5_irq_clear_affinity_hint(mdev, i);
+
+   return err;
+}
+
+static void mlx5_irq_clear_affinity_hints(struct mlx5_core_dev *mdev)
+{
+   int i;
+
+   for (i = 0; i < mdev->priv.eq_table.num_comp_vectors; i++)
+   mlx5_irq_clear_affinity_hint(mdev, i);
+}
+
+void mlx5_free_comp_eqs(struct mlx5_core_dev *dev)
+{
+   struct mlx5_eq_table *table = >priv.eq_table;
+   struct mlx5_eq *eq, *n;
+
+   mlx5_irq_clear_affinity_hints(dev);
+
+#ifdef CONFIG_RFS_ACCEL
+   if (dev->rmap) {
+   free_irq_cpu_rmap(dev->rmap);
+   dev->rmap = NULL;
+   }
+#endif
+   list_for_each_entry_safe(eq, n, >comp_eqs_list, list) {
+   list_del(>list);
+   if (mlx5_destroy_unmap_eq(dev, eq))
+   mlx5_core_warn(dev, "failed to destroy EQ 0x%x\n",
+  eq->eqn);
+   kfree(eq);
+   }
+}
+
+int mlx5_alloc_comp_eqs(struct mlx5_core_dev *dev)
+{
+   struct mlx5_eq_table *table = >priv.eq_table;
+   char name[MLX5_MAX_IRQ_NAME];
+   struct mlx5_eq *eq;
+   int ncomp_vec;
+   int nent;
+   int err;
+   int i;
+
+   INIT_LIST_HEAD(>comp_eqs_list);
+   ncomp_vec = table->num_comp_vectors;
+   nent = MLX5_COMP_EQ_SIZE;
+#ifdef CONFIG_RFS_ACCEL
+   dev->rmap = alloc_irq_cpu_rmap(ncomp_vec);
+   if (!dev->rmap)
+   return -ENOMEM;
+#endif
+   for (i = 0; i < ncomp_vec; i++) {
+   int vecidx = i + MLX5_EQ_VEC_COMP_BASE;
+
+   eq = kzalloc(sizeof(*eq), GFP_KERNEL);
+   if (!eq) {
+   err = -ENOMEM;
+   goto clean;
+   }
+
+#ifdef CONFIG_RFS_ACCEL
+   irq_cpu_rmap_add(dev->rmap, pci_irq_vector(dev->pdev, vecidx));
+#endif
+   snprintf(name, MLX5_MAX_IRQ_NAME, "mlx5_comp%d", i);
+   err = mlx5_create_map_eq(dev, eq, vecidx, nent, 0,
+name, MLX5_EQ_TYPE_COMP);
+   if (err) {
+   kfree(eq);
+   goto clean;
+   }
+   mlx5_core_dbg(dev, "allocated completion EQN %d\n", eq->eqn);
+   /* add tail, to keep the list ordered, for mlx5_vector2eqn to 
work */
+

[PATCH mlx5-next 04/12] net/mlx5: EQ, Remove redundant completion EQ list lock

2018-11-16 Thread Saeed Mahameed

Completion EQs list is only modified on driver load/unload, locking is
not required, remove it.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c   |  2 --
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 17 +++--
 include/linux/mlx5/driver.h|  3 ---
 3 files changed, 3 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index fd5926daa0a6..e75272503027 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -810,8 +810,6 @@ int mlx5_eq_init(struct mlx5_core_dev *dev)
 {
int err;
 
-   spin_lock_init(>priv.eq_table.lock);
-
err = mlx5_eq_debugfs_init(dev);
 
return err;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index f5e6d375a8cc..f692c2a42130 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -704,7 +704,6 @@ int mlx5_vector2eqn(struct mlx5_core_dev *dev, int vector, 
int *eqn,
int err = -ENOENT;
int i = 0;
 
-   spin_lock(>lock);
list_for_each_entry_safe(eq, n, >comp_eqs_list, list) {
if (i++ == vector) {
*eqn = eq->eqn;
@@ -713,7 +712,6 @@ int mlx5_vector2eqn(struct mlx5_core_dev *dev, int vector, 
int *eqn,
break;
}
}
-   spin_unlock(>lock);
 
return err;
 }
@@ -724,14 +722,11 @@ struct mlx5_eq *mlx5_eqn2eq(struct mlx5_core_dev *dev, 
int eqn)
struct mlx5_eq_table *table = >priv.eq_table;
struct mlx5_eq *eq;
 
-   spin_lock(>lock);
-   list_for_each_entry(eq, >comp_eqs_list, list)
-   if (eq->eqn == eqn) {
-   spin_unlock(>lock);
+   list_for_each_entry(eq, >comp_eqs_list, list) {
+   if (eq->eqn == eqn)
return eq;
-   }
+   }
 
-   spin_unlock(>lock);
 
return ERR_PTR(-ENOENT);
 }
@@ -747,17 +742,13 @@ static void free_comp_eqs(struct mlx5_core_dev *dev)
dev->rmap = NULL;
}
 #endif
-   spin_lock(>lock);
list_for_each_entry_safe(eq, n, >comp_eqs_list, list) {
list_del(>list);
-   spin_unlock(>lock);
if (mlx5_destroy_unmap_eq(dev, eq))
mlx5_core_warn(dev, "failed to destroy EQ 0x%x\n",
   eq->eqn);
kfree(eq);
-   spin_lock(>lock);
}
-   spin_unlock(>lock);
 }
 
 static int alloc_comp_eqs(struct mlx5_core_dev *dev)
@@ -798,9 +789,7 @@ static int alloc_comp_eqs(struct mlx5_core_dev *dev)
goto clean;
}
mlx5_core_dbg(dev, "allocated completion EQN %d\n", eq->eqn);
-   spin_lock(>lock);
list_add_tail(>list, >comp_eqs_list);
-   spin_unlock(>lock);
}
 
return 0;
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 4b62d71825c1..852e397c7624 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -484,9 +484,6 @@ struct mlx5_eq_table {
struct mlx5_eq  pfault_eq;
 #endif
int num_comp_vectors;
-   /* protect EQs list
-*/
-   spinlock_t  lock;
 };
 
 struct mlx5_uars_page {
-- 
2.19.1

[PATCH mlx5-next 07/12] net/mlx5: EQ, irq_info and rmap belong to eq_table

2018-11-16 Thread Saeed Mahameed

irq_info and rmap are EQ properties of the driver, and only needed for
EQ objects, move them to the eq_table EQs database structure.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Tariq Toukan 
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  4 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  | 40 ++-
 include/linux/mlx5/driver.h   | 10 ++---
 3 files changed, 28 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2839c30dd3a0..32ea47c28324 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1760,7 +1760,7 @@ static void mlx5e_close_cq(struct mlx5e_cq *cq)
 
 static int mlx5e_get_cpu(struct mlx5e_priv *priv, int ix)
 {
-   return cpumask_first(priv->mdev->priv.irq_info[ix + 
MLX5_EQ_VEC_COMP_BASE].mask);
+   return cpumask_first(priv->mdev->priv.eq_table.irq_info[ix + 
MLX5_EQ_VEC_COMP_BASE].mask);
 }
 
 static int mlx5e_open_tx_cqs(struct mlx5e_channel *c,
@@ -4960,7 +4960,7 @@ int mlx5e_netdev_init(struct net_device *netdev,
netif_carrier_off(netdev);
 
 #ifdef CONFIG_MLX5_EN_ARFS
-   netdev->rx_cpu_rmap = mdev->rmap;
+   netdev->rx_cpu_rmap = mdev->priv.eq_table.rmap;
 #endif
 
return 0;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 44ccd4206104..70f62f10065e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -694,7 +694,7 @@ int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct 
mlx5_eq *eq, u8 vecidx,
if (err)
goto err_in;
 
-   snprintf(priv->irq_info[vecidx].name, MLX5_MAX_IRQ_NAME, "%s@pci:%s",
+   snprintf(priv->eq_table.irq_info[vecidx].name, MLX5_MAX_IRQ_NAME, 
"%s@pci:%s",
 name, pci_name(dev->pdev));
 
eq->eqn = MLX5_GET(create_eq_out, out, eq_number);
@@ -702,7 +702,7 @@ int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct 
mlx5_eq *eq, u8 vecidx,
eq->dev = dev;
eq->doorbell = priv->uar->map + MLX5_EQ_DOORBEL_OFFSET;
err = request_irq(eq->irqn, handler, 0,
- priv->irq_info[vecidx].name, eq);
+ priv->eq_table.irq_info[vecidx].name, eq);
if (err)
goto err_eq;
 
@@ -952,17 +952,18 @@ static int set_comp_irq_affinity_hint(struct 
mlx5_core_dev *mdev, int i)
struct mlx5_priv *priv  = >priv;
int vecidx = MLX5_EQ_VEC_COMP_BASE + i;
int irq = pci_irq_vector(mdev->pdev, vecidx);
+   struct mlx5_irq_info *irq_info = >eq_table.irq_info[vecidx];
 
-   if (!zalloc_cpumask_var(>irq_info[vecidx].mask, GFP_KERNEL)) {
+   if (!zalloc_cpumask_var(_info->mask, GFP_KERNEL)) {
mlx5_core_warn(mdev, "zalloc_cpumask_var failed");
return -ENOMEM;
}
 
cpumask_set_cpu(cpumask_local_spread(i, priv->numa_node),
-   priv->irq_info[vecidx].mask);
+   irq_info->mask);
 
if (IS_ENABLED(CONFIG_SMP) &&
-   irq_set_affinity_hint(irq, priv->irq_info[vecidx].mask))
+   irq_set_affinity_hint(irq, irq_info->mask))
mlx5_core_warn(mdev, "irq_set_affinity_hint failed, irq 
0x%.4x", irq);
 
return 0;
@@ -973,9 +974,10 @@ static void clear_comp_irq_affinity_hint(struct 
mlx5_core_dev *mdev, int i)
int vecidx = MLX5_EQ_VEC_COMP_BASE + i;
struct mlx5_priv *priv  = >priv;
int irq = pci_irq_vector(mdev->pdev, vecidx);
+   struct mlx5_irq_info *irq_info = >eq_table.irq_info[vecidx];
 
irq_set_affinity_hint(irq, NULL);
-   free_cpumask_var(priv->irq_info[vecidx].mask);
+   free_cpumask_var(irq_info->mask);
 }
 
 static int set_comp_irq_affinity_hints(struct mlx5_core_dev *mdev)
@@ -1014,9 +1016,9 @@ static void destroy_comp_eqs(struct mlx5_core_dev *dev)
clear_comp_irqs_affinity_hints(dev);
 
 #ifdef CONFIG_RFS_ACCEL
-   if (dev->rmap) {
-   free_irq_cpu_rmap(dev->rmap);
-   dev->rmap = NULL;
+   if (table->rmap) {
+   free_irq_cpu_rmap(table->rmap);
+   table->rmap = NULL;
}
 #endif
list_for_each_entry_safe(eq, n, >comp_eqs_list, list) {
@@ -1042,8 +1044,8 @@ static int create_comp_eqs(struct mlx5_core_dev *dev)
ncomp_vec = table->num_comp_vectors;
nent = MLX5_COMP_EQ_SIZE;
 #ifdef CONFIG_RFS_ACCEL
-   dev->rmap = alloc_irq_cpu_rmap(ncomp_vec);
-   if (!dev->rmap)
+   table->rmap = alloc_irq_cpu_rmap(ncomp_vec);
+   if (!table->rmap)
return -ENOMEM;
 #endif
for (i = 0; i < ncomp_vec; i++) {
@@ -1056,7 +1058,7 @@ static int create_comp_eqs(struct mlx5_core_dev *dev)
}
 
 #ifdef CONFIG_RFS_ACCEL
-

[PATCH mlx5-next 11/12] {net,IB}/mlx5: Move Page fault EQ and ODP logic to RDMA

2018-11-16 Thread Saeed Mahameed

Use the new generic EQ API to move all ODP RDMA data structures and logic
form mlx5 core driver into mlx5_ib driver.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Tariq Toukan 
---
 drivers/infiniband/hw/mlx5/main.c |  10 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h  |  15 +-
 drivers/infiniband/hw/mlx5/odp.c  | 281 +-
 drivers/net/ethernet/mellanox/mlx5/core/dev.c |  34 ---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  | 252 
 .../net/ethernet/mellanox/mlx5/core/lib/eq.h  |   8 -
 .../net/ethernet/mellanox/mlx5/core/main.c|  17 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   2 -
 include/linux/mlx5/driver.h   |  49 ---
 include/linux/mlx5/eq.h   |  21 ++
 10 files changed, 308 insertions(+), 381 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 6fbc0cba1bac..fcf4a0328a90 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -6040,6 +6040,11 @@ static int mlx5_ib_stage_odp_init(struct mlx5_ib_dev 
*dev)
return mlx5_ib_odp_init_one(dev);
 }
 
+void mlx5_ib_stage_odp_cleanup(struct mlx5_ib_dev *dev)
+{
+   mlx5_ib_odp_cleanup_one(dev);
+}
+
 int mlx5_ib_stage_counters_init(struct mlx5_ib_dev *dev)
 {
if (MLX5_CAP_GEN(dev->mdev, max_qp_cnt)) {
@@ -6225,7 +6230,7 @@ static const struct mlx5_ib_profile pf_profile = {
 mlx5_ib_stage_dev_res_cleanup),
STAGE_CREATE(MLX5_IB_STAGE_ODP,
 mlx5_ib_stage_odp_init,
-NULL),
+mlx5_ib_stage_odp_cleanup),
STAGE_CREATE(MLX5_IB_STAGE_COUNTERS,
 mlx5_ib_stage_counters_init,
 mlx5_ib_stage_counters_cleanup),
@@ -6395,9 +6400,6 @@ static struct mlx5_interface mlx5_ib_interface = {
.add= mlx5_ib_add,
.remove = mlx5_ib_remove,
.event  = mlx5_ib_event,
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
-   .pfault = mlx5_ib_pfault,
-#endif
.protocol   = MLX5_INTERFACE_PROTOCOL_IB,
 };
 
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index b651a7a6fde9..d01af2d829b8 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -880,6 +880,15 @@ struct mlx5_ib_lb_state {
boolenabled;
 };
 
+struct mlx5_ib_pf_eq {
+   struct mlx5_ib_dev  *dev;
+   struct mlx5_eq  *core;
+   struct work_struct   work;
+   spinlock_t   lock; /* Pagefaults spinlock */
+   struct workqueue_struct  *wq;
+   mempool_t*pool;
+};
+
 struct mlx5_ib_dev {
struct ib_deviceib_dev;
const struct uverbs_object_tree_def *driver_trees[7];
@@ -902,6 +911,8 @@ struct mlx5_ib_dev {
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
struct ib_odp_caps  odp_caps;
u64 odp_max_size;
+   struct mlx5_ib_pf_eqodp_pf_eq;
+
/*
 * Sleepable RCU that prevents destruction of MRs while they are still
 * being used by a page fault handler.
@@ -1158,9 +1169,8 @@ struct ib_mr *mlx5_ib_reg_dm_mr(struct ib_pd *pd, struct 
ib_dm *dm,
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 void mlx5_ib_internal_fill_odp_caps(struct mlx5_ib_dev *dev);
-void mlx5_ib_pfault(struct mlx5_core_dev *mdev, void *context,
-   struct mlx5_pagefault *pfault);
 int mlx5_ib_odp_init_one(struct mlx5_ib_dev *ibdev);
+void mlx5_ib_odp_cleanup_one(struct mlx5_ib_dev *ibdev);
 int __init mlx5_ib_odp_init(void);
 void mlx5_ib_odp_cleanup(void);
 void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long 
start,
@@ -1175,6 +1185,7 @@ static inline void mlx5_ib_internal_fill_odp_caps(struct 
mlx5_ib_dev *dev)
 }
 
 static inline int mlx5_ib_odp_init_one(struct mlx5_ib_dev *ibdev) { return 0; }
+static inline void mlx5_ib_odp_cleanup_one(struct mlx5_ib_dev *ibdev) {}
 static inline int mlx5_ib_odp_init(void) { return 0; }
 static inline void mlx5_ib_odp_cleanup(void)   {}
 static inline void mlx5_odp_init_mr_cache_entry(struct mlx5_cache_ent *ent) {}
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 7d784b40e017..67b8fcd600c8 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -37,6 +37,46 @@
 #include "mlx5_ib.h"
 #include "cmd.h"
 
+#include 
+
+/* Contains the details of a pagefault. */
+struct mlx5_pagefault {
+   u32 bytes_committed;
+   u32 token;
+   u8  event_subtype;
+   u8  type;
+   union {
+   /* Initiator or send message responder pagefault details. */
+   struct {
+   /* Received

[PATCH mlx5-next 08/12] net/mlx5: EQ, Privatize eq_table and friends

2018-11-16 Thread Saeed Mahameed

Move unnecessary EQ table structures and declaration from the
public include/linux/mlx5/driver.h into the private area of mlx5_core
and into eq.c/eq.h.

Introduce new mlx5 EQ APIs:

mlx5_comp_vectors_count(dev);
mlx5_comp_irq_get_affinity_mask(dev, vector);

And use them from mlx5_ib or mlx5e netdevice instead of direct access to
mlx5_core internal structures.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Tariq Toukan 
---
 drivers/infiniband/hw/mlx5/main.c |   5 +-
 drivers/net/ethernet/mellanox/mlx5/core/cq.c  |   5 +-
 .../net/ethernet/mellanox/mlx5/core/debugfs.c |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   3 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  10 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  | 102 ++
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |   1 +
 .../net/ethernet/mellanox/mlx5/core/health.c  |   1 +
 .../net/ethernet/mellanox/mlx5/core/lib/eq.h  |  77 +
 .../net/ethernet/mellanox/mlx5/core/main.c|   7 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  15 ---
 include/linux/mlx5/driver.h   |  87 +--
 12 files changed, 179 insertions(+), 135 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/eq.h

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index e9c428071df3..6fbc0cba1bac 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -5337,7 +5337,7 @@ mlx5_ib_get_vector_affinity(struct ib_device *ibdev, int 
comp_vector)
 {
struct mlx5_ib_dev *dev = to_mdev(ibdev);
 
-   return mlx5_get_vector_affinity_hint(dev->mdev, comp_vector);
+   return mlx5_comp_irq_get_affinity_mask(dev->mdev, comp_vector);
 }
 
 /* The mlx5_ib_multiport_mutex should be held when calling this function */
@@ -5701,8 +5701,7 @@ int mlx5_ib_stage_init_init(struct mlx5_ib_dev *dev)
dev->ib_dev.node_type   = RDMA_NODE_IB_CA;
dev->ib_dev.local_dma_lkey  = 0 /* not supported for now */;
dev->ib_dev.phys_port_cnt   = dev->num_ports;
-   dev->ib_dev.num_comp_vectors=
-   dev->mdev->priv.eq_table.num_comp_vectors;
+   dev->ib_dev.num_comp_vectors= mlx5_comp_vectors_count(mdev);
dev->ib_dev.dev.parent  = >pdev->dev;
 
mutex_init(>cap_mask_mutex);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
index 4b85abb5c9f7..6e55d2f37c6d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include "mlx5_core.h"
+#include "lib/eq.h"
 
 #define TASKLET_MAX_TIME 2
 #define TASKLET_MAX_TIME_JIFFIES msecs_to_jiffies(TASKLET_MAX_TIME)
@@ -124,7 +125,7 @@ int mlx5_core_create_cq(struct mlx5_core_dev *dev, struct 
mlx5_core_cq *cq,
goto err_cmd;
 
/* Add to async EQ CQ tree to recv async events */
-   err = mlx5_eq_add_cq(>priv.eq_table.async_eq, cq);
+   err = mlx5_eq_add_cq(mlx5_get_async_eq(dev), cq);
if (err)
goto err_cq_add;
 
@@ -157,7 +158,7 @@ int mlx5_core_destroy_cq(struct mlx5_core_dev *dev, struct 
mlx5_core_cq *cq)
u32 in[MLX5_ST_SZ_DW(destroy_cq_in)] = {0};
int err;
 
-   err = mlx5_eq_del_cq(>priv.eq_table.async_eq, cq);
+   err = mlx5_eq_del_cq(mlx5_get_async_eq(dev), cq);
if (err)
return err;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c 
b/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
index b76766fb6c67..a11e22d0b0cc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include "mlx5_core.h"
+#include "lib/eq.h"
 
 enum {
QP_PID,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index d7fbd5b6ac95..aea74856c702 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -178,8 +178,7 @@ static inline int mlx5e_get_max_num_channels(struct 
mlx5_core_dev *mdev)
 {
return is_kdump_kernel() ?
MLX5E_MIN_NUM_CHANNELS :
-   min_t(int, mdev->priv.eq_table.num_comp_vectors,
- MLX5E_MAX_NUM_CHANNELS);
+   min_t(int, mlx5_comp_vectors_count(mdev), 
MLX5E_MAX_NUM_CHANNELS);
 }
 
 /* Use this function to get max num channels after netdev was created */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 32ea47c28324..c23caade31bf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -49,6 +49,7 @@
 #include "lib/clock.h"
 #include "en/port.h"
 #include "en/xdp.h"
+#include "lib/eq.h"
 
 struct mlx5e_rq_param {

[PATCH mlx5-next 00/12] mlx5 core generic EQ API for RDMA ODP

2018-11-16 Thread Saeed Mahameed

Hi,

This patchset is for mlx5-next shared branch, and will be applied there
once the review is done.

This patchset introduces mostly refactoring work and EQ related code updates to
allow moving the ODP rdma only logic from mlx5_core into mlx5 ib where it
belongs, and will allow future updates and optimizations for the rdma ODP
(On Demand Paging) feature to go only to rdma tree.

Patch #1: Fixes the offsets of stored irq affinity hints inside mlx5
irq info array.

Patch #2,3,4: Remove unused fields, code and logic

Patch #5: Move all EQ related logic from main.c to eq.c to allow clear
and seamless refactoring for creating generic EQ management API.

Patch #6: Create mlx5 core EQs in one place, in order to have one entry
point to call from main file.

Patch #7,8: Move EQ related structures into eq_table mlx5 structure and
make eq_table fields and logic private to eq.c file.

Patch #9,10: Create one generic EQ struct and use it in different
EQ types (usages) e.g. (Async, Command, FW pages, completion and ODP)
Introduce generic EQ API to allow creating Generic EQs regardless of
their types, will be uesd to create all mlx5 core EQs in mlx5_core and
ODP EQ in mlx5_ib.

Patch #11: Move ODP logic out from mlx5_core eq.c into mlx5 rdma driver.
odp.c file.

Patch #12: Make the trivial EQE access methods inline.

Thanks,
Saeed.

---
 
Saeed Mahameed (12):
  net/mlx5: EQ, Use the right place to store/read IRQ affinity hint
  net/mlx5: EQ, Remove unused fields and structures
  net/mlx5: EQ, No need to store eq index as a field
  net/mlx5: EQ, Remove redundant completion EQ list lock
  net/mlx5: EQ, Move all EQ logic to eq.c
  net/mlx5: EQ, Create all EQs in one place
  net/mlx5: EQ, irq_info and rmap belong to eq_table
  net/mlx5: EQ, Privatize eq_table and friends
  net/mlx5: EQ, Different EQ types
  net/mlx5: EQ, Generic EQ
  {net,IB}/mlx5: Move Page fault EQ and ODP logic to RDMA
  net/mlx5: EQ, Make EQE access methods inline

 drivers/infiniband/hw/mlx5/main.c |  15 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h  |  15 +-
 drivers/infiniband/hw/mlx5/odp.c  | 281 -
 drivers/net/ethernet/mellanox/mlx5/core/cq.c  |  15 +-
 .../net/ethernet/mellanox/mlx5/core/debugfs.c |  11 +
 drivers/net/ethernet/mellanox/mlx5/core/dev.c |  34 -
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   3 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  18 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  | 968 +++---
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |   3 +-
 .../net/ethernet/mellanox/mlx5/core/health.c  |   3 +-
 .../net/ethernet/mellanox/mlx5/core/lib/eq.h  |  93 ++
 .../net/ethernet/mellanox/mlx5/core/main.c| 287 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  23 -
 include/linux/mlx5/cq.h   |   2 +-
 include/linux/mlx5/driver.h   | 151 +--
 include/linux/mlx5/eq.h   |  60 ++
 17 files changed, 1081 insertions(+), 901 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/eq.h
 create mode 100644 include/linux/mlx5/eq.h

-- 
2.19.1

[PATCH mlx5-next 01/12] net/mlx5: EQ, Use the right place to store/read IRQ affinity hint

2018-11-16 Thread Saeed Mahameed

Currently the cpu affinity hint mask for completion EQs is stored and
read from the wrong place, since reading and storing is done from the
same index, there is no actual issue with that, but internal irq_info
for completion EQs stars at MLX5_EQ_VEC_COMP_BASE offset in irq_info
array, this patch changes the code to use the correct offset to store
and read the IRQ affinity hint.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c| 14 --
 include/linux/mlx5/driver.h   |  2 +-
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 1243edbedc9e..2839c30dd3a0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1760,7 +1760,7 @@ static void mlx5e_close_cq(struct mlx5e_cq *cq)
 
 static int mlx5e_get_cpu(struct mlx5e_priv *priv, int ix)
 {
-   return cpumask_first(priv->mdev->priv.irq_info[ix].mask);
+   return cpumask_first(priv->mdev->priv.irq_info[ix + 
MLX5_EQ_VEC_COMP_BASE].mask);
 }
 
 static int mlx5e_open_tx_cqs(struct mlx5e_channel *c,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 28132c7dc05f..d5cea0a36e6a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -640,18 +640,19 @@ u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev)
 static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
 {
struct mlx5_priv *priv  = >priv;
-   int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
+   int vecidx = MLX5_EQ_VEC_COMP_BASE + i;
+   int irq = pci_irq_vector(mdev->pdev, vecidx);
 
-   if (!zalloc_cpumask_var(>irq_info[i].mask, GFP_KERNEL)) {
+   if (!zalloc_cpumask_var(>irq_info[vecidx].mask, GFP_KERNEL)) {
mlx5_core_warn(mdev, "zalloc_cpumask_var failed");
return -ENOMEM;
}
 
cpumask_set_cpu(cpumask_local_spread(i, priv->numa_node),
-   priv->irq_info[i].mask);
+   priv->irq_info[vecidx].mask);
 
if (IS_ENABLED(CONFIG_SMP) &&
-   irq_set_affinity_hint(irq, priv->irq_info[i].mask))
+   irq_set_affinity_hint(irq, priv->irq_info[vecidx].mask))
mlx5_core_warn(mdev, "irq_set_affinity_hint failed, irq 
0x%.4x", irq);
 
return 0;
@@ -659,11 +660,12 @@ static int mlx5_irq_set_affinity_hint(struct 
mlx5_core_dev *mdev, int i)
 
 static void mlx5_irq_clear_affinity_hint(struct mlx5_core_dev *mdev, int i)
 {
+   int vecidx = MLX5_EQ_VEC_COMP_BASE + i;
struct mlx5_priv *priv  = >priv;
-   int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
+   int irq = pci_irq_vector(mdev->pdev, vecidx);
 
irq_set_affinity_hint(irq, NULL);
-   free_cpumask_var(priv->irq_info[i].mask);
+   free_cpumask_var(priv->irq_info[vecidx].mask);
 }
 
 static int mlx5_irq_set_affinity_hints(struct mlx5_core_dev *mdev)
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index aa5963b5d38e..7d4ed995b4ce 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1309,7 +1309,7 @@ enum {
 static inline const struct cpumask *
 mlx5_get_vector_affinity_hint(struct mlx5_core_dev *dev, int vector)
 {
-   return dev->priv.irq_info[vector].mask;
+   return dev->priv.irq_info[vector + MLX5_EQ_VEC_COMP_BASE].mask;
 }
 
 #endif /* MLX5_DRIVER_H */
-- 
2.19.1

[PATCH mlx5-next 12/12] net/mlx5: EQ, Make EQE access methods inline

2018-11-16 Thread Saeed Mahameed

These are one/two liner generic EQ access methods, better have them
declared static inline in eq.h.

Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  | 23 -
 .../net/ethernet/mellanox/mlx5/core/lib/eq.h  | 25 ++-
 2 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 895401609c63..6ba8e401a0c7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -46,7 +46,6 @@
 #include "diag/fw_tracer.h"
 
 enum {
-   MLX5_EQE_SIZE   = sizeof(struct mlx5_eqe),
MLX5_EQE_OWNER_INIT_VAL = 0x1,
 };
 
@@ -103,18 +102,6 @@ static int mlx5_cmd_destroy_eq(struct mlx5_core_dev *dev, 
u8 eqn)
return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
 }
 
-static struct mlx5_eqe *get_eqe(struct mlx5_eq *eq, u32 entry)
-{
-   return mlx5_buf_offset(>buf, entry * MLX5_EQE_SIZE);
-}
-
-static struct mlx5_eqe *next_eqe_sw(struct mlx5_eq *eq)
-{
-   struct mlx5_eqe *eqe = get_eqe(eq, eq->cons_index & (eq->nent - 1));
-
-   return ((eqe->owner & 1) ^ !!(eq->cons_index & eq->nent)) ? NULL : eqe;
-}
-
 static const char *eqe_type_str(u8 type)
 {
switch (type) {
@@ -202,16 +189,6 @@ static enum mlx5_dev_event port_subtype_event(u8 subtype)
return -1;
 }
 
-static void eq_update_ci(struct mlx5_eq *eq, int arm)
-{
-   __be32 __iomem *addr = eq->doorbell + (arm ? 0 : 2);
-   u32 val = (eq->cons_index & 0xff) | (eq->eqn << 24);
-
-   __raw_writel((__force u32)cpu_to_be32(val), addr);
-   /* We still want ordering, just not swabbing, so add a barrier */
-   mb();
-}
-
 static void general_event_handler(struct mlx5_core_dev *dev,
  struct mlx5_eqe *eqe)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/eq.h 
b/drivers/net/ethernet/mellanox/mlx5/core/lib/eq.h
index 4cc2d442cef6..6d8c8a57d52b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/eq.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/eq.h
@@ -5,7 +5,8 @@
 #define __LIB_MLX5_EQ_H__
 #include 
 
-#define MLX5_MAX_IRQ_NAME   (32)
+#define MLX5_MAX_IRQ_NAME   (32)
+#define MLX5_EQE_SIZE   (sizeof(struct mlx5_eqe))
 
 struct mlx5_eq_tasklet {
struct list_head  list;
@@ -39,6 +40,28 @@ struct mlx5_eq_comp {
struct list_headlist;
 };
 
+static inline struct mlx5_eqe *get_eqe(struct mlx5_eq *eq, u32 entry)
+{
+   return mlx5_buf_offset(>buf, entry * MLX5_EQE_SIZE);
+}
+
+static inline struct mlx5_eqe *next_eqe_sw(struct mlx5_eq *eq)
+{
+   struct mlx5_eqe *eqe = get_eqe(eq, eq->cons_index & (eq->nent - 1));
+
+   return ((eqe->owner & 1) ^ !!(eq->cons_index & eq->nent)) ? NULL : eqe;
+}
+
+static inline void eq_update_ci(struct mlx5_eq *eq, int arm)
+{
+   __be32 __iomem *addr = eq->doorbell + (arm ? 0 : 2);
+   u32 val = (eq->cons_index & 0xff) | (eq->eqn << 24);
+
+   __raw_writel((__force u32)cpu_to_be32(val), addr);
+   /* We still want ordering, just not swabbing, so add a barrier */
+   mb();
+}
+
 int mlx5_eq_table_init(struct mlx5_core_dev *dev);
 void mlx5_eq_table_cleanup(struct mlx5_core_dev *dev);
 int mlx5_eq_table_create(struct mlx5_core_dev *dev);
-- 
2.19.1

Re: [PATCH bpf-next v2] bpftool: make libbfd optional

2018-11-16 Thread Alexei Starovoitov

On Mon, Nov 12, 2018 at 1:44 PM Stanislav Fomichev  wrote:
>
> Make it possible to build bpftool without libbfd. libbfd and libopcodes are
> typically provided in dev/dbg packages (binutils-dev in debian) which we
> usually don't have installed on the fleet machines and we'd like a way to have
> bpftool version that works without installing any additional packages.
> This excludes support for disassembling jit-ted code and prints an error if
> the user tries to use these features.
>
> Tested by:
> cat > FEATURES_DUMP.bpftool < feature-libbfd=0
> feature-disassembler-four-args=1
> feature-reallocarray=0
> feature-libelf=1
> feature-libelf-mmap=1
> feature-bpf=1
> EOF
> FEATURES_DUMP=$PWD/FEATURES_DUMP.bpftool make
> ldd bpftool | grep libbfd
>
> Signed-off-by: Stanislav Fomichev 

applied, thanks

Re: [PATCH bpf-next v2] bpftool: make libbfd optional

2018-11-16 Thread Alexei Starovoitov

On Fri, Nov 16, 2018 at 08:52:23PM -0800, Stanislav Fomichev wrote:
> I actually wanted to follow up with a v2 when
> https://lkml.org/lkml/2018/11/16/243 reaches bpf-next (I got an ack
> already).

it will go via perf tree, so not related.

> Alternatively, I can follow up with another patch on top of that to fix
> libbfd feature detection (it's semi broken on ubuntu/fedora now).

I think you'd need to wait until all trees merge.

pls don't top post.

> On Fri, Nov 16, 2018 at 8:47 PM Alexei Starovoitov <
> alexei.starovoi...@gmail.com> wrote:
> 
> > On Mon, Nov 12, 2018 at 1:44 PM Stanislav Fomichev  wrote:
> > >
> > > Make it possible to build bpftool without libbfd. libbfd and libopcodes
> > are
> > > typically provided in dev/dbg packages (binutils-dev in debian) which we
> > > usually don't have installed on the fleet machines and we'd like a way
> > to have
> > > bpftool version that works without installing any additional packages.
> > > This excludes support for disassembling jit-ted code and prints an error
> > if
> > > the user tries to use these features.
> > >
> > > Tested by:
> > > cat > FEATURES_DUMP.bpftool < > > feature-libbfd=0
> > > feature-disassembler-four-args=1
> > > feature-reallocarray=0
> > > feature-libelf=1
> > > feature-libelf-mmap=1
> > > feature-bpf=1
> > > EOF
> > > FEATURES_DUMP=$PWD/FEATURES_DUMP.bpftool make
> > > ldd bpftool | grep libbfd
> > >
> > > Signed-off-by: Stanislav Fomichev 
> >
> > applied, thanks
> >

Re: [PATCH bpf-next v2] bpftool: make libbfd optional

2018-11-16 Thread Stanislav Fomichev

On 11/16, Alexei Starovoitov wrote:
> On Fri, Nov 16, 2018 at 08:52:23PM -0800, Stanislav Fomichev wrote:
> > I actually wanted to follow up with a v2 when
> > https://lkml.org/lkml/2018/11/16/243 reaches bpf-next (I got an ack
> > already).
> 
> it will go via perf tree, so not related.
My understanding was that you periodically merge whatever goes to Linus
back to bpf-next, so my plan was to wait for that event and propose a
v3.

> > Alternatively, I can follow up with another patch on top of that to fix
> > libbfd feature detection (it's semi broken on ubuntu/fedora now).
> 
> I think you'd need to wait until all trees merge.
Sure, that's no problem. I'll follow up with another patch whenever that
happens.

> pls don't top post.
Sorry, Gmail :-/

> > On Fri, Nov 16, 2018 at 8:47 PM Alexei Starovoitov <
> > alexei.starovoi...@gmail.com> wrote:
> > 
> > > On Mon, Nov 12, 2018 at 1:44 PM Stanislav Fomichev  
> > > wrote:
> > > >
> > > > Make it possible to build bpftool without libbfd. libbfd and libopcodes
> > > are
> > > > typically provided in dev/dbg packages (binutils-dev in debian) which we
> > > > usually don't have installed on the fleet machines and we'd like a way
> > > to have
> > > > bpftool version that works without installing any additional packages.
> > > > This excludes support for disassembling jit-ted code and prints an error
> > > if
> > > > the user tries to use these features.
> > > >
> > > > Tested by:
> > > > cat > FEATURES_DUMP.bpftool < > > > feature-libbfd=0
> > > > feature-disassembler-four-args=1
> > > > feature-reallocarray=0
> > > > feature-libelf=1
> > > > feature-libelf-mmap=1
> > > > feature-bpf=1
> > > > EOF
> > > > FEATURES_DUMP=$PWD/FEATURES_DUMP.bpftool make
> > > > ldd bpftool | grep libbfd
> > > >
> > > > Signed-off-by: Stanislav Fomichev 
> > >
> > > applied, thanks
> > >

Re: [PATCH net-next] selftests: add explicit test for multiple concurrent GRO sockets

2018-11-16 Thread David Miller

From: Paolo Abeni 
Date: Thu, 15 Nov 2018 03:24:05 +0100

> This covers for proper accounting of encap needed static keys
> 
> Signed-off-by: Paolo Abeni 

Applied.

Re: [PATCH net-next 1/8] net: eth: altera: tse_start_xmit ignores tx_buffer call response

2018-11-16 Thread David Miller

From: Dalon Westergreen 
Date: Wed, 14 Nov 2018 16:50:40 -0800

> @@ -202,7 +204,7 @@ int sgdma_tx_buffer(struct altera_tse_private *priv, 
> struct tse_buffer *buffer)
>   /* enqueue the request to the pending transmit queue */
>   queue_tx(priv, buffer);
>  
> - return 1;
> + return 0;

NETDEV_TX_OK.

And now you can make all of these functions properly return netdev_tx_t instead 
of int.

Re: [PATCH net-next v1 1/4] etf: Cancel timer if there are no pending skbs

2018-11-16 Thread David Miller

From: Vinicius Costa Gomes 
Date: Wed, 14 Nov 2018 17:26:32 -0800

> From: Jesus Sanchez-Palencia 
> 
> There is no point in firing the qdisc watchdog if there are no future
> skbs pending in the queue and the watchdog had been set previously.
> 
> Signed-off-by: Jesus Sanchez-Palencia 

Applied.

Re: [PATCH net-next v1 2/4] etf: Use cached rb_root

2018-11-16 Thread David Miller

From: Vinicius Costa Gomes 
Date: Wed, 14 Nov 2018 17:26:33 -0800

> From: Jesus Sanchez-Palencia 
> 
> ETF's peek() operation is heavily used so use an rb_root_cached instead
> and leverage rb_first_cached() which will run in O(1) instead of
> O(log n).
> 
> Even if on 'timesortedlist_clear()' we could be using rb_erase(), we
> choose to use rb_erase_cached(), because if in the future we allow
> runtime changes to ETF parameters, and need to do a '_clear()', this
> might cause some hard to debug issues.
> 
> Signed-off-by: Jesus Sanchez-Palencia 

Applied.

Re: [PATCH net-next v1 3/4] etf: Split timersortedlist_erase()

2018-11-16 Thread David Miller

From: Vinicius Costa Gomes 
Date: Wed, 14 Nov 2018 17:26:34 -0800

> From: Jesus Sanchez-Palencia 
> 
> This is just a refactor that will simplify the implementation of the
> next patch in this series which will drop all expired packets on the
> dequeue flow.
> 
> Signed-off-by: Jesus Sanchez-Palencia 

Applied.

Re: [PATCH net-next v1 4/4] etf: Drop all expired packets

2018-11-16 Thread David Miller

From: Vinicius Costa Gomes 
Date: Wed, 14 Nov 2018 17:26:35 -0800

> From: Jesus Sanchez-Palencia 
> 
> Currently on dequeue() ETF only drops the first expired packet, which
> causes a problem if the next packet is already expired. When this
> happens, the watchdog will be configured with a time in the past, fire
> straight way and the packet will finally be dropped once the dequeue()
> function of the qdisc is called again.
> 
> We can save quite a few cycles and improve the overall behavior of the
> qdisc if we drop all expired packets if the next packet is expired.
> This should allow ETF to recover faster from bad situations. But
> packet drops are still a very serious warning that the requirements
> imposed on the system aren't reasonable.
> 
> This was inspired by how the implementation of hrtimers use the
> rb_tree inside the kernel.
> 
> Signed-off-by: Jesus Sanchez-Palencia 

Applied.

Re: [PATCH net-next 0/3] dpaa2-eth: add bql support

2018-11-16 Thread David Miller

From: Ioana Ciocoi Radulescu 
Date: Wed, 14 Nov 2018 11:48:34 +

> The first two patches make minor tweaks to the driver to
> simplify bql implementation. The third patch adds the actual
> bql support.

Series applied, thanks!

Re: [PATCH] allow DSCP values in ip rulesB

2018-11-16 Thread David Miller

From: Pavel Balaev 
Date: Wed, 14 Nov 2018 17:30:37 +0300

> Hello, for now IP rules supports only old TOS values and we cannot use
> DSCP.
> 
> This patch adds support for DSCP values in IP rules:
> 
> $ ip r add default via 192.168.0.6 table test
> $ ip ru add tos 0x80 table test
> $ ip ru
> 0:from all lookup local 
> 32764:from all tos CS4 lookup test 
> 32766:from all lookup main 
> 32767:from all lookup default 
> $ ip r get fibmatch 8.8.8.9 tos 0x80
> default tos CS4 via 192.168.0.6 dev lan table test
> 
> Signed-off-by: Pavel Balaev 

Please repost this with all of your follow-up comments added to
the commit message.

And provide a proper subsystem prefix in your Subject line, such
as "ipv4: ".

Re: [PATCH bpf-next v2] filter: add BPF_ADJ_ROOM_DATA mode to bpf_skb_adjust_room()

2018-11-16 Thread Alexei Starovoitov

On Tue, Nov 13, 2018 at 05:35:17PM +0100, Nicolas Dichtel wrote:
> This new mode enables to add or remove an l2 header in a programmatic way
> with cls_bpf.
> For example, it enables to play with mpls headers.
> 
> Signed-off-by: Nicolas Dichtel 
> Acked-by: Martin KaFai Lau 

Acked-by: Alexei Starovoitov 

Daniel, thoughts?

Re: [PATCH net-next] udp: fix jump label misuse

2018-11-16 Thread David Miller

From: Paolo Abeni 
Date: Thu, 15 Nov 2018 02:34:50 +0100

> The commit 60fb9567bf30 ("udp: implement complete book-keeping for
> encap_needed") introduced a severe misuse of jump label APIs, which
> syzbot, as reported by Eric, was able to exploit.
> 
> When multiple sockets/process can concurrently request (and than
> disable) the udp encap, we need to track the activation counter with
> *_inc()/*_dec() jump label variants, or we can experience bad things
> at disable time.
> 
> Fixes: 60fb9567bf30 ("udp: implement complete book-keeping for encap_needed")
> Reported-by: Eric Dumazet 
> Signed-off-by: Paolo Abeni 

Applied, thanks Paolo.

Re: [PATCH v2 04/21] octeontx2-af: Relax resource lock into mutex

2018-11-16 Thread David Miller

From: sunil.kovv...@gmail.com
Date: Thu, 15 Nov 2018 16:29:29 +0530

> From: Stanislaw Kardach 
> 
> The resource locks does not need to be a spinlock as they are not
> used in any interrupt handling routines (only in bottom halves).
> Therefore relax them into a mutex so that later on we may use them
> in routines that might sleep.
> 
> Signed-off-by: Stanislaw Kardach 
> Signed-off-by: Sunil Goutham 

This is confusing because software interrupts are often called bottom
halves, and in which sleeping and thus mutexes are not allowed.

Re: [PATCH net-next 0/7] net: sched: gred: introduce per-virtual queue attributes

2018-11-16 Thread David Miller

From: Jakub Kicinski 
Date: Wed, 14 Nov 2018 22:23:44 -0800

> This series updates the GRED Qdisc.  The Qdisc matches nfp offload very
> well, but before we can offload it there are a number of improvements
> to make.
> 
> First few patches add extack messages to the Qdisc and pass extack
> to netlink validation.
> 
> Next a new netlink attribute group is added, to allow GRED to be
> extended more easily.  Currently GRED passes C structures as attributes,
> and even an array of C structs for virtual queue configuration.  User
> space has hard coded the expected length of that array, so adding new
> fields is not possible.
> 
> New two-level attribute group is added:
> 
>   [TCA_GRED_VQ_LIST]
> [TCA_GRED_VQ_ENTRY]
>   [TCA_GRED_VQ_DP]
>   [TCA_GRED_VQ_FLAGS]
>   [TCA_GRED_VQ_STAT_*]
> [TCA_GRED_VQ_ENTRY]
>   [TCA_GRED_VQ_DP]
>   [TCA_GRED_VQ_FLAGS]
>   [TCA_GRED_VQ_STAT_*]
> [TCA_GRED_VQ_ENTRY]
>...
> 
> Statistics are dump only. Patch 4 switches the byte counts to be 64 bit,
> and patch 5 introduces the new stats attributes for dump.  Patch 6
> switches RED flags to be per-virtual queue, and patch 7 allows them
> to be dumped and set at virtual queue granularity.

Nice work, series applied.

Re: [PATCH bpf-next] selftests/bpf: Fix uninitialized duration warning

2018-11-16 Thread Alexei Starovoitov

On Fri, Nov 9, 2018 at 6:20 PM Joe Stringer  wrote:
>
> Daniel Borkmann reports:
>
> test_progs.c: In function ‘main’:
> test_progs.c:81:3: warning: ‘duration’ may be used uninitialized in this 
> function [-Wmaybe-uninitialized]
>printf("%s:PASS:%s %d nsec\n", __func__, tag, duration);\
>^~
> test_progs.c:1706:8: note: ‘duration’ was declared here
>   __u32 duration;
> ^~~~
>
> Signed-off-by: Joe Stringer 

Applied, thanks.

Re: [PATCH net] ipv6: fix a dst leak when removing its exception

2018-11-16 Thread David Miller

From: Xin Long 
Date: Wed, 14 Nov 2018 00:48:28 +0800

> These is no need to hold dst before calling rt6_remove_exception_rt().
> The call to dst_hold_safe() in ip6_link_failure() was for ip6_del_rt(),
> which has been removed in Commit 93531c674315 ("net/ipv6: separate
> handling of FIB entries from dst based routes"). Otherwise, it will
> cause a dst leak.
> 
> This patch is to simply remove the dst_hold_safe() call before calling
> rt6_remove_exception_rt() and also do the same in ip6_del_cached_rt().
> It's safe, because the removal of the exception that holds its dst's
> refcnt is protected by rt6_exception_lock.
> 
> Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst 
> based routes")
> Fixes: 23fb93a4d3f1 ("net/ipv6: Cleanup exception and cache route handling")
> Reported-by: Li Shuang 
> Signed-off-by: Xin Long 

Applied and queued up for -stable.

Re: [PATCH bpf-next v2 0/3] bpf: Support socket lookup in CGROUP_SOCK_ADDR progs

2018-11-16 Thread Alexei Starovoitov

On Fri, Nov 9, 2018 at 6:54 PM Andrey Ignatov  wrote:
>
> This patch set makes bpf_sk_lookup_tcp, bpf_sk_lookup_udp and
> bpf_sk_release helpers available in programs of type
> BPF_PROG_TYPE_CGROUP_SOCK_ADDR.
>
> Patch 1 is a fix for bpf_sk_lookup_udp that was already merged to bpf
> (stable) tree. Here it's prerequisite for patch 3.
>
> Patch 2 is the main patch in the set, it makes the helpers available for
> BPF_PROG_TYPE_CGROUP_SOCK_ADDR and provides more details about use-case.
>
> Patch 3 adds selftest for new functionality.
>
> v1->v2:
> - remove "Split bpf_sk_lookup" patch since it was already split by:
>   commit c8123ead13a5 ("bpf: Extend the sk_lookup() helper to XDP
>   hookpoint.");
> - avoid unnecessary bpf_sock_addr_sk_lookup function.

applied, thanks

Re: [PATCH v3] net: Add trace events for all receive exit points

2018-11-16 Thread David Miller

From: Geneviève Bastien 
Date: Tue, 13 Nov 2018 15:13:26 -0500

> @@ -5222,9 +5228,14 @@ static void netif_receive_skb_list_internal(struct 
> list_head *head)
>   */
>  int netif_receive_skb(struct sk_buff *skb)
>  {
> + int ret;
> +
>   trace_netif_receive_skb_entry(skb);
>  
> - return netif_receive_skb_internal(skb);
> + ret = netif_receive_skb_internal(skb);
> + trace_netif_receive_skb_exit(skb, ret);

Every time I read this code from now on I'm going to say to myself
"oh crap, we reference 'skb' after it's potentially freed up"

I really don't like this.

I know only the pointer is used, but that pointer can be reallocated
to another SLAB object, even another SKB, by the time these exit
tracepoints execute.

Sorry, I can't really convince myself to apply this now.

Re: [patch net-next] net: 8021q: move vlan offload registrations into vlan_core

2018-11-16 Thread David Miller

From: Jiri Pirko 
Date: Tue, 13 Nov 2018 23:22:48 +0100

> From: Jiri Pirko 
> 
> Currently, the vlan packet offloads are registered only upon 8021q module
> load. However, even without this module loaded, the offloads could be
> utilized, for example by openvswitch datapath. As reported by Michael,
> that causes 2x to 5x performance improvement, depending on a testcase.
> 
> So move the vlan offload registrations into vlan_core and make this
> available even without 8021q module loaded.
> 
> Reported-by: Michael Shteinbok 
> Signed-off-by: Jiri Pirko 
> Tested-by: Michael Shteinbok 

Applied, thanks Jiri.

[next-queue PATCH v1 2/2] Documentation: igb: Add a section about CBS

2018-11-16 Thread Vinicius Costa Gomes

Add some pointers to the definition of the CBS algorithm, and some
notes about the limits of its implementation in the i210 family of
controllers.

Signed-off-by: Vinicius Costa Gomes 
---
 Documentation/networking/igb.rst | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/Documentation/networking/igb.rst b/Documentation/networking/igb.rst
index ba16b86d5593..e87a4a72ea2d 100644
--- a/Documentation/networking/igb.rst
+++ b/Documentation/networking/igb.rst
@@ -177,6 +177,25 @@ rate limit using the IProute2 tool. Download the latest 
version of the
 IProute2 tool from Sourceforge if your version does not have all the features
 you require.
 
+Credit Based Shaper (Qav Mode)
+--
+When enabling the CBS qdisc in the hardware offload mode, traffic shaping using
+the CBS (described in the IEEE 802.1Q-2018 Section 8.6.8.2 and discussed in the
+Annex L) algorithm will run in the i210 controller, so it's more accurate and
+uses less CPU.
+
+When using offloaded CBS, and the traffic rate obeys the configured rate
+(doesn't go above it), CBS should have little to no effect in the latency.
+
+The offloaded version of the algorithm has some limits, caused by how the idle
+slope is expressed in the adapter's registers. It can only represent idle 
slopes
+in 16.38431 kbps units, which means that if a idle slope of 2576kbps is
+requested, the controller will be configured to use a idle slope of ~2589 kbps,
+because the driver rounds the value up. For more details, see the comments on
+:c:func:`igb_config_tx_modes()`.
+
+NOTE: This feature is exclusive to i210 models.
+
 
 Support
 ===
-- 
2.19.1

[next-queue PATCH v1 1/2] igb: Change RXPBSIZE size when setting Qav mode

2018-11-16 Thread Vinicius Costa Gomes

From: Jesus Sanchez-Palencia 

Section 4.5.9 of the datasheet says that the total size of all packet
buffers combined (TxPB 0 + 1 + 2 + 3 + RxPB + BMC2OS + OS2BMC) must not
exceed 60KB. Today we are configuring a total of 62KB, so reduce the
RxPB from 32KB to 30KB in order to respect that.

The choice of changing RxPBSIZE here is mainly because it seems more
correct to give more priority to the transmit packet buffers over the
receiver ones when running in Qav mode. Also, the BMC2OS and OS2BMC
sizes are already too short.

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/e1000_defines.h | 1 +
 drivers/net/ethernet/intel/igb/igb_main.c  | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h 
b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 8a28f3388f69..01fcfc6f3415 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -334,6 +334,7 @@
 
 #define I210_RXPBSIZE_DEFAULT  0x00A2 /* RXPBSIZE default */
 #define I210_RXPBSIZE_MASK 0x003F
+#define I210_RXPBSIZE_PB_30KB  0x001E
 #define I210_RXPBSIZE_PB_32KB  0x0020
 #define I210_TXPBSIZE_DEFAULT  0x0414 /* TXPBSIZE default */
 #define I210_TXPBSIZE_MASK 0xC0FF
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index bb4f3f64fbf0..e135adf46980 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1934,7 +1934,7 @@ static void igb_setup_tx_mode(struct igb_adapter *adapter)
 
val = rd32(E1000_RXPBS);
val &= ~I210_RXPBSIZE_MASK;
-   val |= I210_RXPBSIZE_PB_32KB;
+   val |= I210_RXPBSIZE_PB_30KB;
wr32(E1000_RXPBS, val);
 
/* Section 8.12.9 states that MAX_TPKT_SIZE from DTXMXPKTSZ
-- 
2.19.1

Re: [PATCH bpf-next] bpf: libbpf: Fix bpf_program__next() API

2018-11-16 Thread Alexei Starovoitov

On Mon, Nov 12, 2018 at 03:44:53PM -0800, Martin KaFai Lau wrote:
> This patch restores the behavior in
> commit eac7d84519a3 ("tools: libbpf: don't return '.text' as a program for 
> multi-function programs")
> such that bpf_program__next() does not return pseudo programs in ".text".
> 
> Fixes: 0c19a9fbc9cd ("libbpf: cleanup after partial failure in 
> bpf_object__pin")
> Signed-off-by: Martin KaFai Lau 

Applied, Thanks

Re: [Patch net-next] net: remove unused skb_send_sock()

2018-11-16 Thread David Miller

From: Cong Wang 
Date: Mon, 12 Nov 2018 18:05:24 -0800

> Signed-off-by: Cong Wang 

Applied.

Re: Linux kernel hangs if using RV1108 with MSZ8863 switch with two ports connected

2018-11-16 Thread Andrew Lunn

On Fri, Nov 16, 2018 at 04:28:29PM -0200, Otavio Salvador wrote:
> Hi,
> 
> I have a custom design based on Rockchip RV1108 that uses an MSZ8863
> switch running kernel 4.19.
> 
> The dts part is as follows:
> 
>  {
> pinctrl-names = "default";
> pinctrl-0 = <_pins>;
> snps,reset-gpio = < RK_PC1 GPIO_ACTIVE_LOW>;
> snps,reset-active-low;
> clock_in_out = "output";
> status = "okay";
> };
> 
> RV1108 GMAC is connected to KSZ8863 port 3 and after kernel boots, I
> can put an Ethernet cable from my router to the uplink port of
> KSZ8863, which makes the RV1108 board and a Linux PC connected to the
> other KSZ8863 port to both get IP addresses.
> 
> So in this usecase the setup is working fine.
> 
> However, if the RV1108 board boots with both Ethernet cables to the
> KSZ8863 switch connected, then the kernel silently hangs.

Hi Otavio

By silently, you mean it prints nothing at all?

I would try building the kernel with all the lock debugging turned
on. That might find something even with your working case, if there is
a potential deadlock.

If the kernel dies very early, you might need to enable "kernel
low-level debugping print and EARLY_PRINTK, in order to see anything.

  Andrew

Re: [PATCH net] net/sched: act_pedit: fix memory leak when IDR allocation fails

2018-11-16 Thread David Miller

From: Davide Caratti 
Date: Wed, 14 Nov 2018 12:17:25 +0100

> tcf_idr_check_alloc() can return a negative value, on allocation failures
> (-ENOMEM) or IDR exhaustion (-ENOSPC): don't leak keys_ex in these cases.
> 
> Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action")
> Signed-off-by: Davide Caratti 

Applied and queued up for -stable, thanks.

[PATCH net-next] net: align gnet_stats_basic_cpu struct

2018-11-16 Thread Eric Dumazet

This structure is small (12 or 16 bytes depending on 64bit
or 32bit kernels), but we do not want it spanning two cache lines.

Signed-off-by: Eric Dumazet 
---
 include/net/gen_stats.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/gen_stats.h b/include/net/gen_stats.h
index 
946bd53a9f81dff5946579514360a9e5eaf3489b..ca23860adbb956fcfff3605068fdedf59073ce1a
 100644
--- a/include/net/gen_stats.h
+++ b/include/net/gen_stats.h
@@ -10,7 +10,7 @@
 struct gnet_stats_basic_cpu {
struct gnet_stats_basic_packed bstats;
struct u64_stats_sync syncp;
-};
+} __aligned(2 * sizeof(u64));
 
 struct net_rate_estimator;
 
-- 
2.19.1.1215.g8438c0b245-goog

Re: [RFC v1 2/3] vxlan: add support for underlay in non-default VRF

2018-11-16 Thread David Ahern

On 11/16/18 2:41 AM, Alexis Bauvin wrote:
> The case I am trying to cover here is the user creating a VXLAN device with 
> eth0
> as its lower device (ip link add vxlan0 type vxlan ... dev eth0), thus 
> ignoring
> the fact that it should be br0 (the actual L3 interface). In this case, the 
> only
> information available from the module's point of view is eth0. I may be wrong,
> but eth0 is indirectly "part" of vrf-blue (even if it is only L2), as packets
> flowing in from it would land in vrf-blue if L3.

for routing lookups, yes.

> 
> As for the device stacking, I am only interested in the VXLAN underlay: the
> VXLAN device itself could be in a specific VRF or not, it should not influence
> its underlay. 
> 
> +--+ +-+
> |  | | |
> | vrf-blue | | vrf-red |
> |  | | |
> ++-+ +++
>  ||
>  ||
> ++-+ +++
> |  | | |
> | br-blue  | | br-red  |
> |  | | |
> ++-+ +---+-+---+
>  |   | |
>  | +-+ +-+
>  | | |
> ++-++--++   +++
> |  |  lower device  |   |   | |
> |   eth0   | <- - - - - - - | vxlan-red |   | tap-red | (... more taps)
> |  ||   |   | |
> +--++---+   +-+
> 
> 
> While I don't see any use case for having a bridged uplink when using VXLAN,
> someone may and would find a different behavior depending on the lower device.
> In the above example, vxlan-red's lower device should be br-blue, but a user
> would expect the underlay VRF (vrf-blue) to still be taken into account if 
> eth0
> was used as the lower device.
> 
> A different approach would be to check if the lower device is a bridge. If 
> not,
> fetch a potential master bridge. Then, with this L3/router interface, we fetch
> the l3mdev with l3mdev_master_ifindex_by_index (if any).

ok. got it. Add the above diagram to the commit message to document the
use case.

> 
>>
>>> This is because the underlying l3mdev_master_dev_rcu function fetches the 
>>> master
>>> (br0 in this case), checks whether it is an l3mdev (which it is not), and
>>> returns its index if so.
>>>
>>> So if using l3mdev_master_dev_rcu, using eth0 as a lower device will still 
>>> bind
>>> to no specific device, thus in the default VRF.
>>>
>>> Maybe I should have patched l3mdev_master_dev_rcu to do a recursive 
>>> resolution
>>> (as vxlan_get_l3mdev does), but I don’t know the impact of such a change.
>>
>> no, that is definitely the wrong the approach.
> 
> Ok! What is the best approach in your opinion?
> 

Add the new function to l3mdev.c. The name should be consistent with the
others -- so something like l3mdev_master_upper_by_index (l3mdev for the
namespace, you are passing an index and wanting the master device but in
this case want to walk upper devices).

Also, annotate with expected locking.

91 matches

Mail list logo