Re: [PATCH net] net: fib_rules: add protocol check in rule_find

2018-06-28 Thread Roopa Prabhu
On Wed, Jun 27, 2018 at 6:27 PM, Roopa Prabhu  wrote:
> From: Roopa Prabhu 
>
> After commit f9d4b0c1e969 ("fib_rules: move common handling of newrule
> delrule msgs into fib_nl2rule"), rule_find is strict about checking
> for an existing rule. rule_find must check against all
> user given attributes, else it may match against a subset
> of attributes and return an existing rule.
>
> In the below case, without support for protocol match, rule_find
> will match only against 'table main' and return an existing rule.
>
> $ip -4 rule add table main protocol boot
> RTNETLINK answers: File exists
>
> This patch adds protocol support to rule_find, forcing it to
> check protocol match if given by the user.
>
> Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs 
> into fib_nl2rule")
> Signed-off-by: Roopa Prabhu 
> ---
> I spent some time looking at all match keys today and protocol
> was the only missing one (protocol is not in a released kernel yet).
> The only way this could be avoided is to move back to the old loose
> rule_find. I am worried about this new strict checking surprising users,
> but going back to the previous loose checking does not seem right either.
> If there is a reason to believe that users did rely on the previous
> behaviour, I will be happy to revert. Here is another example of old and
> new behaviour.
>
> old rule_find behaviour:
> $ip -4 rule add table main protocol boot
> $ip -4 rule add table main protocol boot
> $ip -4 rule add table main protocol boot
> $ip rule show
> 0:  from all lookup local
> 32763:  from all lookup main  proto boot
> 32764:  from all lookup main  proto boot
> 32765:  from all lookup main  proto boot
> 32766:  from all lookup main
> 32767:  from all lookup default
>
> new rule_find behaviour (after this patch):
> $ip -4 rule add table main protocol boot
> $ip -4 rule add table main protocol boot
> RTNETLINK answers: File exists
>

I found the case where the new rule_find breaks for add.
$ip -4 rule add table main tos 10 fwmark 1
$ip -4 rule add table main tos 10
RTNETLINK answers: File exists

The key masks in the new and old rule need to be compared .
And it cannot be easily compared today without an elaborate if-else block.
Its best to introduce key masks for easier and accurate rule comparison.
But this is best done in net-next. I will submit an incremental patch
tomorrow to
restore previous rule_exists for the add case and the rest should be good.

The current patch in context is needed regardless.

Thanks (and sorry about the iterations).


Re: [PATCH v12 03/10] netdev: cavium: octeon: Add Octeon III BGX Ethernet Nexus

2018-06-28 Thread David Miller
From: Carlos Munoz 
Date: Thu, 28 Jun 2018 14:20:05 -0700

> 
> 
> On 06/28/2018 01:41 AM, Andrew Lunn wrote:
>> External Email
>>
>>> +static char *mix_port;
>>> +module_param(mix_port, charp, 0444);
>>> +MODULE_PARM_DESC(mix_port, "Specifies which ports connect to MIX 
>>> interfaces.");
>>> +
>>> +static char *pki_port;
>>> +module_param(pki_port, charp, 0444);
>>> +MODULE_PARM_DESC(pki_port, "Specifies which ports connect to the PKI.");
>> Module parameters are generally not liked. Can you do without them?
> 
> These parameters change the kernel port assignment required by user
> space applications. We rather keep them as they simplify the
> process.

This is actually a terrible user experience.

Please provide a way to do this by performing operations on a device object
after the driver loads.

Use something like devlink or similar if you have to.


Re: [PATCH net-next 0/4] ila: Cleanup

2018-06-28 Thread David Miller
From: Tom Herbert 
Date: Wed, 27 Jun 2018 14:38:58 -0700

> Perform some cleanup in ILA code. This includes:
> 
> - Fix rhashtable walk for cases where nl dumps are done with muliple
>   function calls. Add a skip index to skip over entries in
>   a node that have been previously visitied. Call rhashtable_walk_peek
>   to avoid dropping items between calls to ila_nl_dump.
> - Call alloc_bucket_spinlocks to create bucket locks.
> - Split out module initialization and netlink definitions into
>   separate files.
> - Add ILA_CMD_FLUSH netlink command to clear the ILA translation table.

Series applied.


Re: [PATCH v12 03/10] netdev: cavium: octeon: Add Octeon III BGX Ethernet Nexus

2018-06-28 Thread Chavva, Chandrakala
David,

How can we support NFS boot if pass the parameters via devlink. Basically this 
determines what phy to use from device tree.

Chandra


From: David Miller 
Sent: Thursday, June 28, 2018 7:19:05 PM
To: Munoz, Carlos
Cc: and...@lunn.ch; Hill, Steven; netdev@vger.kernel.org; Chavva, Chandrakala
Subject: Re: [PATCH v12 03/10] netdev: cavium: octeon: Add Octeon III BGX 
Ethernet Nexus

External Email

From: Carlos Munoz 
Date: Thu, 28 Jun 2018 14:20:05 -0700

>
>
> On 06/28/2018 01:41 AM, Andrew Lunn wrote:
>> External Email
>>
>>> +static char *mix_port;
>>> +module_param(mix_port, charp, 0444);
>>> +MODULE_PARM_DESC(mix_port, "Specifies which ports connect to MIX 
>>> interfaces.");
>>> +
>>> +static char *pki_port;
>>> +module_param(pki_port, charp, 0444);
>>> +MODULE_PARM_DESC(pki_port, "Specifies which ports connect to the PKI.");
>> Module parameters are generally not liked. Can you do without them?
>
> These parameters change the kernel port assignment required by user
> space applications. We rather keep them as they simplify the
> process.

This is actually a terrible user experience.

Please provide a way to do this by performing operations on a device object
after the driver loads.

Use something like devlink or similar if you have to.


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Linus Torvalds
On Thu, Jun 28, 2018 at 7:21 AM Christoph Hellwig  wrote:
>
> Note that for this removes the possibility of actually returning an
> error before waiting in poll.  We could still do this with an ERR_PTR
> in f_poll_head with a little bit of WRITE_ONCE/READ_ONCE magic, but
> I'd like to defer that until actually required.

I'm still going to just revert the whole poll mess for now.

It's still completely broken. This helps things, but it doesn't fix
the fundamental issue: the new interface is strictly worse than the
old interface ever was.

So Christoph, if you don't like the tradoitional ->poll() method, and
you want something else for aio polling, I seriously will suggest that
you introduce a new f_op for *that*. Don't mess with the existing
->poll() function, don't make select() and poll() use a slower and
less capable function just because aio wants something else.

In other words, you need to see AIO as the less important case, not as
the driver for this.

I also want to understand what the AIO race was, and what the problem
with the poll() thing was. You claimed it was racy. I don't see it,
and it was never ever explained in the whole series. I should never
have pulled it in the first place if only for that reason, but I tend
to trust Al when it comes to the VFS layer, so I did. My bad.

So before we try this again, I most definitely want _explanations_.
And I want the whole approach to be very clear that AIO is the ugly
step-sister, not the driving force.

 Linus


[PATCH net-next 3/4] selftests: forwarding: Tweak tc filters for mirror-to-gretap tests

2018-06-28 Thread Petr Machata
When running mirror_gre_bridge_1d_vlan tests on veth, several issues
cause spurious failures:

- vlan_ethtype should be ip, not ipv6 even in mirror-to-ip6gretap case,
  because the overlay packet is still IPv4.
- Similarly ip_proto matches the innermost IP protocol, so can't be used
  to filter out GRE packet. Drop the corresponding condition.
- Because the above fixes the filters to match in slow path as well,
  they need to be made skip_hw so as not to double-count packets.

Signed-off-by: Petr Machata 
---
 tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d_vlan.sh | 6 --
 tools/testing/selftests/net/forwarding/mirror_gre_lib.sh| 2 +-
 tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh | 6 --
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git 
a/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d_vlan.sh 
b/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d_vlan.sh
index 3bb4c2ba7b14..197e769c2ed1 100755
--- a/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d_vlan.sh
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d_vlan.sh
@@ -74,12 +74,14 @@ test_vlan_match()
 
 test_gretap()
 {
-   test_vlan_match gt4 'vlan_id 555 vlan_ethtype ip' "mirror to gretap"
+   test_vlan_match gt4 'skip_hw vlan_id 555 vlan_ethtype ip' \
+   "mirror to gretap"
 }
 
 test_ip6gretap()
 {
-   test_vlan_match gt6 'vlan_id 555 vlan_ethtype ipv6' "mirror to 
ip6gretap"
+   test_vlan_match gt6 'skip_hw vlan_id 555 vlan_ethtype ip' \
+   "mirror to ip6gretap"
 }
 
 test_gretap_stp()
diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_lib.sh 
b/tools/testing/selftests/net/forwarding/mirror_gre_lib.sh
index 619b469365be..1c18e332cd4f 100644
--- a/tools/testing/selftests/net/forwarding/mirror_gre_lib.sh
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_lib.sh
@@ -62,7 +62,7 @@ full_test_span_gre_dir_vlan_ips()
  "$backward_type" "$ip1" "$ip2"
 
tc filter add dev $h3 ingress pref 77 prot 802.1q \
-   flower $vlan_match ip_proto 0x2f \
+   flower $vlan_match \
action pass
mirror_test v$h1 $ip1 $ip2 $h3 77 10
tc filter del dev $h3 ingress pref 77
diff --git 
a/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh 
b/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh
index 1ac5038ae256..d3e75bb6a2d8 100755
--- a/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh
@@ -88,12 +88,14 @@ test_vlan_match()
 
 test_gretap()
 {
-   test_vlan_match gt4 'vlan_id 555 vlan_ethtype ip' "mirror to gretap"
+   test_vlan_match gt4 'skip_hw vlan_id 555 vlan_ethtype ip' \
+   "mirror to gretap"
 }
 
 test_ip6gretap()
 {
-   test_vlan_match gt6 'vlan_id 555 vlan_ethtype ipv6' "mirror to 
ip6gretap"
+   test_vlan_match gt6 'skip_hw vlan_id 555 vlan_ethtype ip' \
+   "mirror to ip6gretap"
 }
 
 test_span_gre_forbidden_cpu()
-- 
2.4.11



[PATCH net-next 2/4] selftests: forwarding: lib: Avoid trapping soft devices

2018-06-28 Thread Petr Machata
There are several cases where traffic that would normally be forwarded
in silicon needs to be observed in slow path. That's achieved by
trapping such traffic, and the functions trap_install() and
trap_uninstall() realize that. However, such treatment is obviously
wrong if the device in question is actually a soft device not backed by
an ASIC.

Therefore try to trap if possible, but fall back to inserting a continue
if not.

Signed-off-by: Petr Machata 
---
 tools/testing/selftests/net/forwarding/lib.sh | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh 
b/tools/testing/selftests/net/forwarding/lib.sh
index ac1df4860fbe..d1f14f83979e 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -479,9 +479,15 @@ trap_install()
local dev=$1; shift
local direction=$1; shift
 
-   # For slow-path testing, we need to install a trap to get to
-   # slow path the packets that would otherwise be switched in HW.
-   tc filter add dev $dev $direction pref 1 flower skip_sw action trap
+   # Some devices may not support or need in-hardware trapping of traffic
+   # (e.g. the veth pairs that this library creates for non-existent
+   # loopbacks). Use continue instead, so that there is a filter in there
+   # (some tests check counters), and so that other filters are still
+   # processed.
+   tc filter add dev $dev $direction pref 1 \
+   flower skip_sw action trap 2>/dev/null \
+   || tc filter add dev $dev $direction pref 1 \
+  flower action continue
 }
 
 trap_uninstall()
@@ -489,11 +495,13 @@ trap_uninstall()
local dev=$1; shift
local direction=$1; shift
 
-   tc filter del dev $dev $direction pref 1 flower skip_sw
+   tc filter del dev $dev $direction pref 1 flower
 }
 
 slow_path_trap_install()
 {
+   # For slow-path testing, we need to install a trap to get to
+   # slow path the packets that would otherwise be switched in HW.
if [ "${tcflags/skip_hw}" != "$tcflags" ]; then
trap_install "$@"
fi
-- 
2.4.11



[PATCH net-next 0/4] Fixes for running mirror-to-gretap tests on veth

2018-06-28 Thread Petr Machata
The forwarding selftests infrastructure makes it possible to run the
individual tests on a purely software netdevices. Names of interfaces to
run the test with can be passed as command line arguments to a test.
lib.sh then creates veth pairs backing the interfaces if none exist in
the system.

However, the tests need to recognize that they might be run on a soft
device. Many mirror-to-gretap tests are buggy in this regard. This patch
set aims to fix the problems in running mirror-to-gretap tests on veth
devices.

In patch #1, a service function is split out of setup_wait().
In patch #2, installing a trap is made optional.
In patch #3, tc filters in several tests are tweaked to work with veth.
In patch #4, the logic for waiting for neighbor is fixed for veth.

Petr Machata (4):
  selftests: forwarding: lib: Split out setup_wait_dev()
  selftests: forwarding: lib: Avoid trapping soft devices
  selftests: forwarding: Tweak tc filters for mirror-to-gretap tests
  selftests: forwarding: mirror_gre_changes: Fix waiting for neighbor

 tools/testing/selftests/net/forwarding/lib.sh  | 41 +++---
 .../net/forwarding/mirror_gre_bridge_1d_vlan.sh|  6 ++--
 .../selftests/net/forwarding/mirror_gre_changes.sh | 11 ++
 .../selftests/net/forwarding/mirror_gre_lib.sh |  2 +-
 .../net/forwarding/mirror_gre_vlan_bridge_1q.sh|  6 ++--
 5 files changed, 39 insertions(+), 27 deletions(-)

-- 
2.4.11



[PATCH net-next 4/4] selftests: forwarding: mirror_gre_changes: Fix waiting for neighbor

2018-06-28 Thread Petr Machata
When running the test on soft devices, there's no mechanism to
gratuitously start resolving the neighbor for remote tunnel endpoint.
So instead of passively waiting, wait for the device to be up, and then
probe the neighbor with a ping.

Signed-off-by: Petr Machata 
---
 tools/testing/selftests/net/forwarding/mirror_gre_changes.sh | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_changes.sh 
b/tools/testing/selftests/net/forwarding/mirror_gre_changes.sh
index aa29d46186a8..135902aa8b11 100755
--- a/tools/testing/selftests/net/forwarding/mirror_gre_changes.sh
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_changes.sh
@@ -122,15 +122,8 @@ test_span_gre_egress_up()
# After setting the device up, wait for neighbor to get resolved so that
# we can expect mirroring to work.
ip link set dev $swp3 up
-   while true; do
-   ip neigh sh dev $swp3 $remote_ip nud reachable |
-   grep -q ^
-   if [[ $? -ne 0 ]]; then
-   sleep 1
-   else
-   break
-   fi
-   done
+   setup_wait_dev $swp3
+   ping -c 1 -I $swp3 $remote_ip &>/dev/null
 
quick_test_span_gre_dir $tundev ingress
mirror_uninstall $swp1 ingress
-- 
2.4.11



[PATCH net-next 1/4] selftests: forwarding: lib: Split out setup_wait_dev()

2018-06-28 Thread Petr Machata
Split out of setup_wait() a function setup_wait_dev() that waits for a
single device. This gives tests the opportunity to wait for a selected
device after they tinkered with its upness.

Signed-off-by: Petr Machata 
---
 tools/testing/selftests/net/forwarding/lib.sh | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh 
b/tools/testing/selftests/net/forwarding/lib.sh
index 1dfdf14894e2..ac1df4860fbe 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -185,18 +185,25 @@ log_info()
echo "INFO: $msg"
 }
 
+setup_wait_dev()
+{
+   local dev=$1; shift
+
+   while true; do
+   ip link show dev $dev up \
+   | grep 'state UP' &> /dev/null
+   if [[ $? -ne 0 ]]; then
+   sleep 1
+   else
+   break
+   fi
+   done
+}
+
 setup_wait()
 {
for i in $(eval echo {1..$NUM_NETIFS}); do
-   while true; do
-   ip link show dev ${NETIFS[p$i]} up \
-   | grep 'state UP' &> /dev/null
-   if [[ $? -ne 0 ]]; then
-   sleep 1
-   else
-   break
-   fi
-   done
+   setup_wait_dev ${NETIFS[p$i]}
done
 
# Make sure links are ready.
-- 
2.4.11



Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags

2018-06-28 Thread Jiri Benc
On Thu, 28 Jun 2018 09:54:52 -0700, Jakub Kicinski wrote:
> Hmm... in practice we could steal top bits of the size parameter for
> some flags, since it seems to be limited to values < 256 today?  Is it
> worth it?
> 
> It would look something along the lines of:

Something like that, yes. I'll leave to Daniel to review how much sense
it makes from the BPF side.

Thanks!

 Jiri


Re: [PATCH net-next 1/1] tc-testing: initial version of tunnel_key unit tests

2018-06-28 Thread Davide Caratti
hello Lucas,

On Wed, 2018-06-27 at 14:50 -0400, Lucas Bates wrote:
> On Tue, Jun 26, 2018 at 10:51 AM, Davide Caratti  wrote:
> > On Tue, 2018-06-26 at 09:17 -0400, Keara Leibovitz wrote:
> > > Create unittests for the tc tunnel_key action.
> > > 
> > > 
> > > Signed-off-by: Keara Leibovitz 
> > > ---
> > >  .../tc-testing/tc-tests/actions/tunnel_key.json| 676 
> > > +
> > >  1 file changed, 676 insertions(+)
> > >  create mode 100644 
> > > tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json
> > > 
> > > diff --git 
> > > a/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json 
> > > b/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json
> > > new file mode 100644
> > > index ..bfe522ac8177
> > 
> > hello Keara!
> > 
> > I think the 'teardown' stage in some of these tests should be reviewed.
> > Those that are meant to test invalid configurations (like dc6b) should
> > allow non-zero exit codes in the teardown stage, if the wrong
> > configuration is catched by the userspace TC tool, before talking to the
> > kernel.
> > 
> > Otherwise, those tests will fail when they are invoked one by one with the
> > act_tunnel_key module unloaded.
> > 
> 
> Hi Davide, I thought I'd weigh in here.

glad to hear your feedback!

> In the short term, I think this is reasonable, but it's not a feasible
> long-term solution.  Here's why:
> 
> Allowing non-zero exit codes on setup and teardown was a precaution
> that needed to be implemented as flushing actions in a freshly-booted
> kernel returned errors - certain actions would only allow you to flush
> after that action had been added.

I guess this is a desired behavior, and it's common to all TC actions:

# grep bpf /proc/modules
# tc actions flush action bpf
RTNETLINK answers: Invalid argument
We have an error flushing
# modprobe act_bpf
 tc actions flush action bpf
# echo $?
0

> But, doing this on so many test cases means that we can lose control
> of the test environment, especially since a lot of commands get copied
> between test cases.  One test's command under test becomes the next
> test case's setup command, etc.  This can cause false results and
> potentially waste a lot of time for someone trying to track down a
> bug... Or cause bugs to be missed.

I understand, you want to ensure that 'teardown' leaves the scenario in a
status which is the same as before the 'setup' phase. Whether or not this
happened successfully, it's sane not to ignore the error code: otherwise,
test X will perturbate test X+1.

> So, how to fix: we've had some discussions about it already.  Jiri had
> requested the addition of a config file (like the one at
> tools/testing/selftests/net/forwarding/config, and maybe an addition
> to the README for tdc for explanation.  People would then possibly be
> restricted to running one test case file at a time based on what
> options they had loaded...  This is still not ideal.

All this depends on where the error condition is catched. Some parameters
(like the invalid 'index' in act_bpf) are rejected within userspace TC,
some others (like the invalid bytecode for test f84a) in the kernel.

> I think the best possible fix is to add a new plugin for tdc to
> exclude tests based on the kernel config.  This would require the
> addition of a new optional field to the test case format, where any
> and all included modules required for the test to work would be
> listed.  The plugin would look at this information, do its best to
> determine if the currently running kernel supports it, and allows the
> test to run or be skipped as a result.
> 
> Let me show an example of the new field:
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json
> > > @@ -0,0 +1,676 @@
> > > 
> > 
> > ...
> > 
> > > +{
> > > +"id": "dc6b",
> > > +"name": "Add tunnel_key set action with missing mandatory src_ip 
> > > parameter",
> > > +"category": [
> > > +"actions",
> > > +"tunnel_key"
> > > +],
> 
>"reqModules": [
>"CONFIG_NET_ACT_TUNNEL_KEY"
>],
> > > +"setup": [
> > > +[
> > > +"$TC actions flush action tunnel_key",
> > > +0,
> > > +1,
> > > +255
> > > +]
> > > +],
> > > +"cmdUnderTest": "$TC actions add action tunnel_key set dst_ip 
> > > 20.20.20.2 id 100",
> > > +"expExitCode": "255",
> > > +"verifyCmd": "$TC actions list action tunnel_key",
> > > +"matchPattern": "action order [0-9]+: tunnel_key set.*dst_ip 
> > > 20.20.20.2.*key_id 100",
> > > +"matchCount": "0",
> > > +"teardown": [
> > > +"$TC actions flush action tunnel_key"
> > > +]
> > > +},
> 
> As we venture into more and more complicated tests, where different
> modules would start getting mixed 

[PATCH bpf-net 06/14] bpf/verifier: introduce BPF_PTR_TO_MAP_VALUE

2018-06-28 Thread Roman Gushchin
BPF_MAP_TYPE_CGROUP_STORAGE maps are special in a way
that the access from the bpf program side is lookup-free.
That means the result is guaranteed to be a valid
pointer to the cgroup storage; no NULL-check is required.

This patch introduces BPF_PTR_TO_MAP_VALUE return type,
which is required to cause the verifier accept programs,
which are not checking the map value pointer for being NULL.

Signed-off-by: Roman Gushchin 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Acked-by: Martin KaFai Lau 
---
 include/linux/bpf.h   | 1 +
 kernel/bpf/verifier.c | 8 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 709354a0608a..6d7e0dfc 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -154,6 +154,7 @@ enum bpf_arg_type {
 enum bpf_return_type {
RET_INTEGER,/* function returns integer */
RET_VOID,   /* function doesn't return anything */
+   RET_PTR_TO_MAP_VALUE,   /* returns a pointer to map elem value 
*/
RET_PTR_TO_MAP_VALUE_OR_NULL,   /* returns a pointer to map elem value 
or NULL */
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index de097a642c3f..cc0c7990f849 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2545,8 +2545,12 @@ static int check_helper_call(struct bpf_verifier_env 
*env, int func_id, int insn
mark_reg_unknown(env, regs, BPF_REG_0);
} else if (fn->ret_type == RET_VOID) {
regs[BPF_REG_0].type = NOT_INIT;
-   } else if (fn->ret_type == RET_PTR_TO_MAP_VALUE_OR_NULL) {
-   regs[BPF_REG_0].type = PTR_TO_MAP_VALUE_OR_NULL;
+   } else if (fn->ret_type == RET_PTR_TO_MAP_VALUE_OR_NULL ||
+  fn->ret_type == RET_PTR_TO_MAP_VALUE) {
+   if (fn->ret_type == RET_PTR_TO_MAP_VALUE)
+   regs[BPF_REG_0].type = PTR_TO_MAP_VALUE;
+   else
+   regs[BPF_REG_0].type = PTR_TO_MAP_VALUE_OR_NULL;
/* There is no offset yet applied, variable or fixed */
mark_reg_known_zero(env, regs, BPF_REG_0);
regs[BPF_REG_0].off = 0;
-- 
2.14.4



[PATCH bpf-net 05/14] bpf: extend bpf_prog_array to store pointers to the cgroup storage

2018-06-28 Thread Roman Gushchin
This patch converts bpf_prog_array from an array of prog pointers
to the array of struct bpf_prog_array_item elements.

This allows to save a cgroup storage pointer for each bpf program
efficiently attached to a cgroup.

Signed-off-by: Roman Gushchin 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Acked-by: Martin KaFai Lau 
---
 include/linux/bpf.h | 19 +-
 kernel/bpf/cgroup.c | 24 ++---
 kernel/bpf/core.c   | 76 +++--
 3 files changed, 66 insertions(+), 53 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4b3e42e5b6d0..709354a0608a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -348,9 +348,14 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const 
union bpf_attr *kattr,
  * The 'struct bpf_prog_array *' should only be replaced with xchg()
  * since other cpus are walking the array of pointers in parallel.
  */
+struct bpf_prog_array_item {
+   struct bpf_prog *prog;
+   struct bpf_cgroup_storage *cgroup_storage;
+};
+
 struct bpf_prog_array {
struct rcu_head rcu;
-   struct bpf_prog *progs[0];
+   struct bpf_prog_array_item items[0];
 };
 
 struct bpf_prog_array __rcu *bpf_prog_array_alloc(u32 prog_cnt, gfp_t flags);
@@ -371,7 +376,8 @@ int bpf_prog_array_copy(struct bpf_prog_array __rcu 
*old_array,
 
 #define __BPF_PROG_RUN_ARRAY(array, ctx, func, check_non_null) \
({  \
-   struct bpf_prog **_prog, *__prog;   \
+   struct bpf_prog_array_item *_item;  \
+   struct bpf_prog *_prog; \
struct bpf_prog_array *_array;  \
u32 _ret = 1;   \
preempt_disable();  \
@@ -379,10 +385,11 @@ int bpf_prog_array_copy(struct bpf_prog_array __rcu 
*old_array,
_array = rcu_dereference(array);\
if (unlikely(check_non_null && !_array))\
goto _out;  \
-   _prog = _array->progs;  \
-   while ((__prog = READ_ONCE(*_prog))) {  \
-   _ret &= func(__prog, ctx);  \
-   _prog++;\
+   _item = &_array->items[0];  \
+   while ((_prog = READ_ONCE(_item->prog))) {  \
+   bpf_cgroup_storage_set(_item->cgroup_storage);  \
+   _ret &= func(_prog, ctx);   \
+   _item++;\
}   \
 _out:  \
rcu_read_unlock();  \
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index f0a809868f92..14a1f6c94592 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -117,16 +117,20 @@ static int compute_effective_progs(struct cgroup *cgrp,
cnt = 0;
p = cgrp;
do {
-   if (cnt == 0 || (p->bpf.flags[type] & BPF_F_ALLOW_MULTI))
-   list_for_each_entry(pl,
-   >bpf.progs[type], node) {
-   if (!pl->prog)
-   continue;
-   rcu_dereference_protected(progs, 1)->
-   progs[cnt++] = pl->prog;
-   }
-   p = cgroup_parent(p);
-   } while (p);
+   if (cnt > 0 && !(p->bpf.flags[type] & BPF_F_ALLOW_MULTI))
+   continue;
+
+   list_for_each_entry(pl, >bpf.progs[type], node) {
+   if (!pl->prog)
+   continue;
+
+   rcu_dereference_protected(progs, 1)->
+   items[cnt].prog = pl->prog;
+   rcu_dereference_protected(progs, 1)->
+   items[cnt].cgroup_storage = pl->storage;
+   cnt++;
+   }
+   } while ((p = cgroup_parent(p)));
 
*array = progs;
return 0;
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index a9e6c04d0f4a..145f44cb0cad 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1570,7 +1570,8 @@ struct bpf_prog_array __rcu *bpf_prog_array_alloc(u32 
prog_cnt, gfp_t flags)
 {
if (prog_cnt)
return kzalloc(sizeof(struct bpf_prog_array) +
-  sizeof(struct bpf_prog *) * (prog_cnt + 1),
+  sizeof(struct bpf_prog_array_item) *
+  (prog_cnt + 1),
   flags);
 
return _prog_array.hdr;
@@ -1584,43 +1585,45 @@ void bpf_prog_array_free(struct bpf_prog_array __rcu 
*progs)
kfree_rcu(progs, rcu);
 }
 
-int 

[PATCH bpf-net 02/14] bpf: introduce cgroup storage maps

2018-06-28 Thread Roman Gushchin
This commit introduces BPF_MAP_TYPE_CGROUP_STORAGE maps:
a special type of maps which are implementing the cgroup storage.

>From the userspace point of view it's almost a generic
hash map with the (cgroup inode id, attachment type) pair
used as a key.

The only difference is that some operations are restricted:
  1) a user can't create new entries,
  2) a user can't remove existing entries.

The lookup from userspace is o(log(n)).

Signed-off-by: Roman Gushchin 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Acked-by: Martin KaFai Lau 
---
 include/linux/bpf-cgroup.h |  38 +
 include/linux/bpf.h|   1 +
 include/linux/bpf_types.h  |   3 +
 include/uapi/linux/bpf.h   |   6 +
 kernel/bpf/Makefile|   1 +
 kernel/bpf/local_storage.c | 367 +
 kernel/bpf/verifier.c  |  12 ++
 7 files changed, 428 insertions(+)
 create mode 100644 kernel/bpf/local_storage.c

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 975fb4cf1bb7..b4e2e42c1d2a 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -3,19 +3,39 @@
 #define _BPF_CGROUP_H
 
 #include 
+#include 
 #include 
 
 struct sock;
 struct sockaddr;
 struct cgroup;
 struct sk_buff;
+struct bpf_map;
+struct bpf_prog;
 struct bpf_sock_ops_kern;
+struct bpf_cgroup_storage;
 
 #ifdef CONFIG_CGROUP_BPF
 
 extern struct static_key_false cgroup_bpf_enabled_key;
 #define cgroup_bpf_enabled static_branch_unlikely(_bpf_enabled_key)
 
+struct bpf_cgroup_storage_map;
+
+struct bpf_storage_buffer {
+   struct rcu_head rcu;
+   char data[0];
+};
+
+struct bpf_cgroup_storage {
+   struct bpf_storage_buffer *buf;
+   struct bpf_cgroup_storage_map *map;
+   struct bpf_cgroup_storage_key key;
+   struct list_head list;
+   struct rb_node node;
+   struct rcu_head rcu;
+};
+
 struct bpf_prog_list {
struct list_head node;
struct bpf_prog *prog;
@@ -76,6 +96,15 @@ int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
 int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
  short access, enum bpf_attach_type type);
 
+struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog);
+void bpf_cgroup_storage_free(struct bpf_cgroup_storage *storage);
+void bpf_cgroup_storage_link(struct bpf_cgroup_storage *storage,
+struct cgroup *cgroup,
+enum bpf_attach_type type);
+void bpf_cgroup_storage_unlink(struct bpf_cgroup_storage *storage);
+int bpf_cgroup_storage_assign(struct bpf_prog *prog, struct bpf_map *map);
+void bpf_cgroup_storage_release(struct bpf_prog *prog, struct bpf_map *map);
+
 /* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)\
 ({   \
@@ -194,6 +223,15 @@ struct cgroup_bpf {};
 static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
 static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
 
+static inline int bpf_cgroup_storage_assign(struct bpf_prog *prog,
+   struct bpf_map *map) { return 0; }
+static inline void bpf_cgroup_storage_release(struct bpf_prog *prog,
+ struct bpf_map *map) {}
+static inline struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(
+   struct bpf_prog *prog) { return 0; }
+static inline void bpf_cgroup_storage_free(
+   struct bpf_cgroup_storage *storage) {}
+
 #define cgroup_bpf_enabled (0)
 #define BPF_CGROUP_PRE_CONNECT_ENABLED(sk) (0)
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e4d684ce3f5e..4b3e42e5b6d0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -281,6 +281,7 @@ struct bpf_prog_aux {
struct bpf_prog *prog;
struct user_struct *user;
u64 load_time; /* ns since boottime */
+   struct bpf_map *cgroup_storage;
char name[BPF_OBJ_NAME_LEN];
 #ifdef CONFIG_SECURITY
void *security;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index c5700c2d5549..add08be53b6f 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -37,6 +37,9 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_PERF_EVENT_ARRAY, 
perf_event_array_map_ops)
 #ifdef CONFIG_CGROUPS
 BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, cgroup_array_map_ops)
 #endif
+#ifdef CONFIG_CGROUP_BPF
+BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
+#endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_LRU_HASH, htab_lru_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 59b19b6a40d7..7aa135e4c2f3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -75,6 

[PATCH bpf-net 00/14] bpf: cgroup local storage

2018-06-28 Thread Roman Gushchin
This patchset implements cgroup local storage for bpf programs.
The main idea is to provide a fast accessible memory for storing
various per-cgroup data, e.g. number of transmitted packets.

Cgroup local storage looks as a special type of map for userspace,
and is accessible using generic bpf maps API for reading and
updating of the data. The (cgroup inode id, attachment type) pair
is used as a map key.

A user can't create new entries or destroy existing entries;
it happens automatically when a user attaches/detaches a bpf program
to a cgroup.

>From a bpf program's point of view, cgroup storage is accessible
without lookup using the special get_local_storage() helper function.
It takes a map fd as an argument. It always returns a valid pointer
to the corresponding memory area.
To implement such a lookup-free access a pointer to the cgroup
storage is saved for an attachment of a bpf program to a cgroup,
if required by the program. Before running the program, it's saved
in a special global per-cpu variable, which is accessible from the
get_local_storage() helper.

This patchset implement only cgroup local storage, however the API
is intentionally made extensible to support other local storage types
further: e.g. thread local storage, socket local storage, etc.

Patch (1) adds an ability to charge bpf maps for consuming memory
dynamically.
Patch (2) introduces cgroup storage maps.
Patch (3) implements a mechanism to pass cgroup storage pointer
to a bpf program.
Patch (4) implements allocation/releasing of cgroup local storage
on attaching/detaching of a bpf program to/from a cgroup.
Patch (5) extends bpf_prog_array to store cgroup storage pointers.
Patch (6) introduces BPF_PTR_TO_MAP_VALUE, required to skip
non-necessary NULL-check in bpf programs.
Patch (7) disables creation of maps of cgroup storage maps.
Patch (8) introduces the get_local_storage() helper.
Patch (9) syncs bpf.h to tools/.
Patch (10) adds cgroup storage maps support to bpftool.
Patch (11) adds support for testing programs which are using
cgroup storage without actually attaching them to cgroups.
Patches (12), (13) and (14) are adding necessary tests.

Signed-off-by: Roman Gushchin 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Martin KaFai Lau 

Roman Gushchin (14):
  bpf: add ability to charge bpf maps memory dynamically
  bpf: introduce cgroup storage maps
  bpf: pass a pointer to a cgroup storage using pcpu variable
  bpf: allocate cgroup storage entries on attaching bpf programs
  bpf: extend bpf_prog_array to store pointers to the cgroup storage
  bpf/verifier: introduce BPF_PTR_TO_MAP_VALUE
  bpf: don't allow create maps of cgroup local storages
  bpf: introduce the bpf_get_local_storage() helper function
  bpf: sync bpf.h to tools/
  bpftool: add support for CGROUP_STORAGE maps
  bpf/test_run: support cgroup local storage
  selftests/bpf: add verifier cgroup storage tests
  selftests/bpf: add a cgroup storage test
  samples/bpf: extend test_cgrp2_attach2 test to use cgroup storage

 include/linux/bpf-cgroup.h|  53 
 include/linux/bpf.h   |  25 +-
 include/linux/bpf_types.h |   3 +
 include/uapi/linux/bpf.h  |  19 +-
 kernel/bpf/Makefile   |   1 +
 kernel/bpf/cgroup.c   |  54 +++-
 kernel/bpf/core.c |  76 ++---
 kernel/bpf/helpers.c  |  20 ++
 kernel/bpf/local_storage.c| 369 ++
 kernel/bpf/map_in_map.c   |   3 +-
 kernel/bpf/syscall.c  |  53 +++-
 kernel/bpf/verifier.c |  38 ++-
 net/bpf/test_run.c|  13 +-
 net/core/filter.c |  23 +-
 samples/bpf/test_cgrp2_attach2.c  |  27 +-
 tools/bpf/bpftool/map.c   |   1 +
 tools/include/uapi/linux/bpf.h|   9 +-
 tools/testing/selftests/bpf/Makefile  |   4 +-
 tools/testing/selftests/bpf/bpf_helpers.h |   2 +
 tools/testing/selftests/bpf/test_cgroup_storage.c | 130 
 tools/testing/selftests/bpf/test_verifier.c   | 123 +++-
 21 files changed, 965 insertions(+), 81 deletions(-)
 create mode 100644 kernel/bpf/local_storage.c
 create mode 100644 tools/testing/selftests/bpf/test_cgroup_storage.c

-- 
2.14.4



[PATCH bpf-net 12/14] selftests/bpf: add verifier cgroup storage tests

2018-06-28 Thread Roman Gushchin
Add the following verifier tests to cover the cgroup storage
functionality:
1) valid access to the cgroup storage
2) invalid access: use regular hashmap instead of cgroup storage map
3) invalid access: use invalid map fd
4) invalid access: try access memory after the cgroup storage
5) invalid access: try access memory before the cgroup storage
6) invalid access: call get_local_storage() with non-zero flags

For tests 2)-6) check returned error strings.

Expected output:
  $ ./test_verifier
  #0/u add+sub+mul OK
  #0/p add+sub+mul OK
  #1/u DIV32 by 0, zero check 1 OK
  ...
  #280/p valid cgroup storage access OK
  #281/p invalid cgroup storage access 1 OK
  #282/p invalid cgroup storage access 2 OK
  #283/p invalid per-cgroup storage access 3 OK
  #284/p invalid cgroup storage access 4 OK
  #285/p invalid cgroup storage access 5 OK
  ...
  #649/p pass modified ctx pointer to helper, 2 OK
  #650/p pass modified ctx pointer to helper, 3 OK
  Summary: 901 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Roman Gushchin 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Acked-by: Martin KaFai Lau 
---
 tools/testing/selftests/bpf/bpf_helpers.h   |   2 +
 tools/testing/selftests/bpf/test_verifier.c | 123 +++-
 2 files changed, 124 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index f2f28b6c8915..ccd959fd940e 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -133,6 +133,8 @@ static int (*bpf_rc_keydown)(void *ctx, unsigned int 
protocol,
(void *) BPF_FUNC_rc_keydown;
 static unsigned long long (*bpf_get_current_cgroup_id)(void) =
(void *) BPF_FUNC_get_current_cgroup_id;
+static void *(*bpf_get_local_storage)(void *map, unsigned long long flags) =
+   (void *) BPF_FUNC_get_local_storage;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 2ecd27b670d7..7016fb2964a1 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -50,7 +50,7 @@
 
 #define MAX_INSNS  BPF_MAXINSNS
 #define MAX_FIXUPS 8
-#define MAX_NR_MAPS7
+#define MAX_NR_MAPS8
 #define POINTER_VALUE  0xcafe4all
 #define TEST_DATA_LEN  64
 
@@ -70,6 +70,7 @@ struct bpf_test {
int fixup_prog1[MAX_FIXUPS];
int fixup_prog2[MAX_FIXUPS];
int fixup_map_in_map[MAX_FIXUPS];
+   int fixup_cgroup_storage[MAX_FIXUPS];
const char *errstr;
const char *errstr_unpriv;
uint32_t retval;
@@ -4630,6 +4631,104 @@ static struct bpf_test tests[] = {
.result = REJECT,
.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
+   {
+   "valid cgroup storage access",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_2, 0),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_get_local_storage),
+   BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_1),
+   BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_cgroup_storage = { 1 },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+   },
+   {
+   "invalid cgroup storage access 1",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_2, 0),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_get_local_storage),
+   BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_1),
+   BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map1 = { 1 },
+   .result = REJECT,
+   .errstr = "cannot pass map_type 1 into func 
bpf_get_local_storage",
+   .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+   },
+   {
+   "invalid cgroup storage access 2",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_2, 0),
+   BPF_LD_MAP_FD(BPF_REG_1, 1),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_get_local_storage),
+   BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   },
+   .result = REJECT,
+   .errstr = "fd 1 is not pointing to valid bpf_map",
+   .prog_type = 

[PATCH bpf-net 01/14] bpf: add ability to charge bpf maps memory dynamically

2018-06-28 Thread Roman Gushchin
This commits extends existing bpf maps memory charging API
to support dynamic charging/uncharging.

This is required to account memory used by maps,
if all entries are created dynamically after
the map initialization.

Signed-off-by: Roman Gushchin 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Acked-by: Martin KaFai Lau 
---
 include/linux/bpf.h  |  2 ++
 kernel/bpf/syscall.c | 53 +---
 2 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7df32a3200f7..e4d684ce3f5e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -434,6 +434,8 @@ struct bpf_map * __must_check bpf_map_inc(struct bpf_map 
*map, bool uref);
 void bpf_map_put_with_uref(struct bpf_map *map);
 void bpf_map_put(struct bpf_map *map);
 int bpf_map_precharge_memlock(u32 pages);
+int bpf_map_charge_memlock(struct bpf_map *map, u32 pages);
+void bpf_map_uncharge_memlock(struct bpf_map *map, u32 pages);
 void *bpf_map_area_alloc(size_t size, int numa_node);
 void bpf_map_area_free(void *base);
 void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 35dc466641f2..e03aeeec01e0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -181,32 +181,55 @@ int bpf_map_precharge_memlock(u32 pages)
return 0;
 }
 
-static int bpf_map_charge_memlock(struct bpf_map *map)
+static int bpf_charge_memlock(struct user_struct *user, u32 pages)
 {
-   struct user_struct *user = get_current_user();
-   unsigned long memlock_limit;
+   unsigned long memlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 
-   memlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+   if (atomic_long_add_return(pages, >locked_vm) > memlock_limit) {
+   atomic_long_sub(pages, >locked_vm);
+   return -EPERM;
+   }
+   return 0;
+}
 
-   atomic_long_add(map->pages, >locked_vm);
+static int bpf_map_init_memlock(struct bpf_map *map)
+{
+   struct user_struct *user = get_current_user();
+   int ret;
 
-   if (atomic_long_read(>locked_vm) > memlock_limit) {
-   atomic_long_sub(map->pages, >locked_vm);
+   ret = bpf_charge_memlock(user, map->pages);
+   if (ret) {
free_uid(user);
-   return -EPERM;
+   return ret;
}
map->user = user;
-   return 0;
+   return ret;
 }
 
-static void bpf_map_uncharge_memlock(struct bpf_map *map)
+static void bpf_map_release_memlock(struct bpf_map *map)
 {
struct user_struct *user = map->user;
-
-   atomic_long_sub(map->pages, >locked_vm);
+   atomic_long_sub(map->pages, >user->locked_vm);
free_uid(user);
 }
 
+int bpf_map_charge_memlock(struct bpf_map *map, u32 pages)
+{
+   int ret;
+
+   ret = bpf_charge_memlock(map->user, pages);
+   if (ret)
+   return ret;
+   map->pages += pages;
+   return ret;
+}
+
+void bpf_map_uncharge_memlock(struct bpf_map *map, u32 pages)
+{
+   atomic_long_sub(pages, >user->locked_vm);
+   map->pages -= pages;
+}
+
 static int bpf_map_alloc_id(struct bpf_map *map)
 {
int id;
@@ -256,7 +279,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
 {
struct bpf_map *map = container_of(work, struct bpf_map, work);
 
-   bpf_map_uncharge_memlock(map);
+   bpf_map_release_memlock(map);
security_bpf_map_free(map);
/* implementation dependent freeing */
map->ops->map_free(map);
@@ -492,7 +515,7 @@ static int map_create(union bpf_attr *attr)
if (err)
goto free_map_nouncharge;
 
-   err = bpf_map_charge_memlock(map);
+   err = bpf_map_init_memlock(map);
if (err)
goto free_map_sec;
 
@@ -515,7 +538,7 @@ static int map_create(union bpf_attr *attr)
return err;
 
 free_map:
-   bpf_map_uncharge_memlock(map);
+   bpf_map_release_memlock(map);
 free_map_sec:
security_bpf_map_free(map);
 free_map_nouncharge:
-- 
2.14.4



Re: [PATCH net-next 0/4] net: Geneve options support for TC act_tunnel_key

2018-06-28 Thread Jakub Kicinski
On Thu, 28 Jun 2018 16:17:31 +0900 (KST), David Miller wrote:
> From: Jakub Kicinski 
> Date: Tue, 26 Jun 2018 11:53:04 -0700
> 
> > Hi,
> > 
> > Simon & Pieter say:
> > 
> > This set adds Geneve Options support to the TC tunnel key action.
> > It provides the plumbing required to configure Geneve variable length
> > options.  The options can be configured in the form CLASS:TYPE:DATA,
> > where CLASS is represented as a 16bit hexadecimal value, TYPE as an 8bit
> > hexadecimal value and DATA as a variable length hexadecimal value.
> > Additionally multiple options may be listed using a comma delimiter.  
> 
> Looks like there are some sparse endianness warnings to fix up as
> per kbuild robot.

Sorry about that!


Re: [PATCH bpf-net 00/14] bpf: cgroup local storage

2018-06-28 Thread Roman Gushchin
On Thu, Jun 28, 2018 at 09:34:44AM -0700, Roman Gushchin wrote:
> This patchset implements cgroup local storage for bpf programs.
> The main idea is to provide a fast accessible memory for storing
> various per-cgroup data, e.g. number of transmitted packets.

Just noticed a typo in the subject: "bpf-net" :)
Will resend the patchset.
Sorry for confusion.

Thanks,
Roman


Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags

2018-06-28 Thread Jakub Kicinski
On Thu, 28 Jun 2018 09:42:06 +0200, Jiri Benc wrote:
> On Wed, 27 Jun 2018 11:49:49 +0200, Daniel Borkmann wrote:
> > Looks good to me, and yes in BPF case a mask like TUNNEL_OPTIONS_PRESENT is
> > right approach since this is opaque info and solely defined by the BPF prog
> > that is using the generic helper.  
> 
> Wouldn't it make sense to introduce some safeguards here (in a backward
> compatible way, of course)? It's easy to mistakenly set data for a
> different tunnel type in a BPF program and then be surprised by the
> result. It might help users if such usage was detected by the kernel,
> one way or another.

Well, that's how it works today ;)

> I'm thinking about something like the BPF program voluntarily
> specifying the type of the data; if not specified, the wildcard would be
> used as it is now.

Hmm... in practice we could steal top bits of the size parameter for
some flags, since it seems to be limited to values < 256 today?  Is it
worth it?

It would look something along the lines of:

---

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 59b19b6a40d7..194b40efa8e8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2213,6 +2213,13 @@ enum bpf_func_id {
 /* BPF_FUNC_perf_event_output for sk_buff input context. */
 #define BPF_F_CTXLEN_MASK  (0xfULL << 32)
 
+#define BPF_F_TUN_VXLAN(1U << 31)
+#define BPF_F_TUN_GENEVE   (1U << 30)
+#define BPF_F_TUN_ERSPAN   (1U << 29)
+#define BPF_F_TUN_FLAGS_ALL(BPF_F_TUN_VXLAN | \
+BPF_F_TUN_GENEVE | \
+BPF_F_TUN_ERSPAN)
+
 /* Mode for BPF_FUNC_skb_adjust_room helper. */
 enum bpf_adj_room_mode {
BPF_ADJ_ROOM_NET,
diff --git a/net/core/filter.c b/net/core/filter.c
index dade922678f6..cc592a1e8945 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3576,6 +3576,22 @@ BPF_CALL_3(bpf_skb_set_tunnel_opt, struct sk_buff *, skb,
 {
struct ip_tunnel_info *info = skb_tunnel_info(skb);
const struct metadata_dst *md = this_cpu_ptr(md_dst);
+   __be16 tun_flags;
+   u32 flags;
+
+   BUILD_BUG_ON(BPF_F_TUN_FLAGS_ALL & IP_TUNNEL_OPTS_MAX);
+
+   flags = size & BPF_F_TUN_FLAGS_ALL;
+   size &= ~flags;
+   if (flags & BPF_F_TUN_VXLAN)
+   tun_flags |= TUNNEL_VXLAN_OPT;
+   if (flags & BPF_F_TUN_GENEVE)
+   tun_flags |= TUNNEL_GENEVE_OPT;
+   if (flags & BPF_F_TUN_ERSPAN)
+   tun_flags |= TUNNEL_ERSPAN_OPT;
+   /* User didn't specify the tunnel type, for backward compat set all */
+   if (!(tun_flags & TUNNEL_OPTIONS_PRESENT))
+   tun_flags |= TUNNEL_OPTIONS_PRESENT;
 
if (unlikely(info != >u.tun_info || (size & (sizeof(u32) - 1
return -EINVAL;


[PATCH net-next 05/10] net/smc: add pnetid support for SMC-D and ISM

2018-06-28 Thread Ursula Braun
From: Hans Wippel 

SMC-D relies on PNETIDs to find usable SMC-D/ISM devices for a SMC
connection. This patch adds SMC-D/ISM support to the current PNETID
implementation.

Signed-off-by: Hans Wippel 
Signed-off-by: Ursula Braun 
Suggested-by: Thomas Richter 
---
 include/net/smc.h  |  1 +
 net/smc/smc_ism.c  |  2 ++
 net/smc/smc_pnet.c | 41 +
 net/smc/smc_pnet.h |  2 ++
 4 files changed, 46 insertions(+)

diff --git a/include/net/smc.h b/include/net/smc.h
index 824a7af8d654..9ef49f8b1002 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -73,6 +73,7 @@ struct smcd_dev {
struct smc_connection **conn;
struct list_head vlan;
struct workqueue_struct *event_wq;
+   u8 pnetid[SMC_MAX_PNETID_LEN];
 };
 
 struct smcd_dev *smcd_alloc_dev(struct device *parent, const char *name,
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index ca1ce42fd49f..f44e4dff244a 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -13,6 +13,7 @@
 #include "smc.h"
 #include "smc_core.h"
 #include "smc_ism.h"
+#include "smc_pnet.h"
 
 struct smcd_dev_list smcd_dev_list = {
.list = LIST_HEAD_INIT(smcd_dev_list.list),
@@ -227,6 +228,7 @@ struct smcd_dev *smcd_alloc_dev(struct device *parent, 
const char *name,
device_initialize(>dev);
dev_set_name(>dev, name);
smcd->ops = ops;
+   smc_pnetid_by_dev_port(parent, 0, smcd->pnetid);
 
spin_lock_init(>lock);
INIT_LIST_HEAD(>vlan);
diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c
index cdc6e23b6ce1..1b6c066d3495 100644
--- a/net/smc/smc_pnet.c
+++ b/net/smc/smc_pnet.c
@@ -22,6 +22,7 @@
 
 #include "smc_pnet.h"
 #include "smc_ib.h"
+#include "smc_ism.h"
 
 static struct nla_policy smc_pnet_policy[SMC_PNETID_MAX + 1] = {
[SMC_PNETID_NAME] = {
@@ -564,6 +565,27 @@ static void smc_pnet_find_roce_by_pnetid(struct net_device 
*ndev,
spin_unlock(_ib_devices.lock);
 }
 
+static void smc_pnet_find_ism_by_pnetid(struct net_device *ndev,
+   struct smcd_dev **smcismdev)
+{
+   u8 ndev_pnetid[SMC_MAX_PNETID_LEN];
+   struct smcd_dev *ismdev;
+
+   ndev = pnet_find_base_ndev(ndev);
+   if (smc_pnetid_by_dev_port(ndev->dev.parent, ndev->dev_port,
+  ndev_pnetid))
+   return; /* pnetid could not be determined */
+
+   spin_lock(_dev_list.lock);
+   list_for_each_entry(ismdev, _dev_list.list, list) {
+   if (!memcmp(ismdev->pnetid, ndev_pnetid, SMC_MAX_PNETID_LEN)) {
+   *smcismdev = ismdev;
+   break;
+   }
+   }
+   spin_unlock(_dev_list.lock);
+}
+
 /* Lookup of coupled ib_device via SMC pnet table */
 static void smc_pnet_find_roce_by_table(struct net_device *netdev,
struct smc_ib_device **smcibdev,
@@ -615,3 +637,22 @@ void smc_pnet_find_roce_resource(struct sock *sk,
 out:
return;
 }
+
+void smc_pnet_find_ism_resource(struct sock *sk, struct smcd_dev **smcismdev)
+{
+   struct dst_entry *dst = sk_dst_get(sk);
+
+   *smcismdev = NULL;
+   if (!dst)
+   goto out;
+   if (!dst->dev)
+   goto out_rel;
+
+   /* if possible, lookup via hardware-defined pnetid */
+   smc_pnet_find_ism_by_pnetid(dst->dev, smcismdev);
+
+out_rel:
+   dst_release(dst);
+out:
+   return;
+}
diff --git a/net/smc/smc_pnet.h b/net/smc/smc_pnet.h
index ad4455cde9e7..1e94fd4df7bc 100644
--- a/net/smc/smc_pnet.h
+++ b/net/smc/smc_pnet.h
@@ -17,6 +17,7 @@
 #endif
 
 struct smc_ib_device;
+struct smcd_dev;
 
 static inline int smc_pnetid_by_dev_port(struct device *dev,
 unsigned short port, u8 *pnetid)
@@ -33,5 +34,6 @@ void smc_pnet_exit(void);
 int smc_pnet_remove_by_ibdev(struct smc_ib_device *ibdev);
 void smc_pnet_find_roce_resource(struct sock *sk,
 struct smc_ib_device **smcibdev, u8 *ibport);
+void smc_pnet_find_ism_resource(struct sock *sk, struct smcd_dev **smcismdev);
 
 #endif
-- 
2.16.4



[PATCH net-next 04/10] net/smc: add base infrastructure for SMC-D and ISM

2018-06-28 Thread Ursula Braun
From: Hans Wippel 

SMC supports two variants: SMC-R and SMC-D. For data transport, SMC-R
uses RDMA devices, SMC-D uses so-called Internal Shared Memory (ISM)
devices. An ISM device only allows shared memory communication between
SMC instances on the same machine. For example, this allows virtual
machines on the same host to communicate via SMC without RDMA devices.

This patch adds the base infrastructure for SMC-D and ISM devices to
the existing SMC code. It contains the following:

* ISM driver interface:
  This interface allows an ISM driver to register ISM devices in SMC. In
  the process, the driver provides a set of device ops for each device.
  SMC uses these ops to execute SMC specific operations on or transfer
  data over the device.

* Core SMC-D link group, connection, and buffer support:
  Link groups, SMC connections and SMC buffers (in smc_core) are
  extended to support SMC-D.

* SMC type checks:
  Some type checks are added to prevent using SMC-R specific code for
  SMC-D and vice versa.

To actually use SMC-D, additional changes to pnetid, CLC, CDC, etc. are
required. These are added in follow-up patches.

Signed-off-by: Hans Wippel 
Signed-off-by: Ursula Braun 
Suggested-by: Thomas Richter 
---
 include/net/smc.h  |  62 +++
 net/smc/Makefile   |   2 +-
 net/smc/af_smc.c   |  11 +-
 net/smc/smc_core.c | 270 +++
 net/smc/smc_core.h |  71 +
 net/smc/smc_diag.c |   3 +-
 net/smc/smc_ism.c  | 304 +
 net/smc/smc_ism.h  |  48 +
 8 files changed, 679 insertions(+), 92 deletions(-)
 create mode 100644 net/smc/smc_ism.c
 create mode 100644 net/smc/smc_ism.h

diff --git a/include/net/smc.h b/include/net/smc.h
index 2173932fab9d..824a7af8d654 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -20,4 +20,66 @@ struct smc_hashinfo {
 
 int smc_hash_sk(struct sock *sk);
 void smc_unhash_sk(struct sock *sk);
+
+/* SMCD/ISM device driver interface */
+struct smcd_dmb {
+   u64 dmb_tok;
+   u64 rgid;
+   u32 dmb_len;
+   u32 sba_idx;
+   u32 vlan_valid;
+   u32 vlan_id;
+   void *cpu_addr;
+   dma_addr_t dma_addr;
+};
+
+#define ISM_EVENT_DMB  0
+#define ISM_EVENT_GID  1
+#define ISM_EVENT_SWR  2
+
+struct smcd_event {
+   u32 type;
+   u32 code;
+   u64 tok;
+   u64 time;
+   u64 info;
+};
+
+struct smcd_dev;
+
+struct smcd_ops {
+   int (*query_remote_gid)(struct smcd_dev *dev, u64 rgid, u32 vid_valid,
+   u32 vid);
+   int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+   int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+   int (*add_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
+   int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
+   int (*set_vlan_required)(struct smcd_dev *dev);
+   int (*reset_vlan_required)(struct smcd_dev *dev);
+   int (*signal_event)(struct smcd_dev *dev, u64 rgid, u32 trigger_irq,
+   u32 event_code, u64 info);
+   int (*move_data)(struct smcd_dev *dev, u64 dmb_tok, unsigned int idx,
+bool sf, unsigned int offset, void *data,
+unsigned int size);
+};
+
+struct smcd_dev {
+   const struct smcd_ops *ops;
+   struct device dev;
+   void *priv;
+   u64 local_gid;
+   struct list_head list;
+   spinlock_t lock;
+   struct smc_connection **conn;
+   struct list_head vlan;
+   struct workqueue_struct *event_wq;
+};
+
+struct smcd_dev *smcd_alloc_dev(struct device *parent, const char *name,
+   const struct smcd_ops *ops, int max_dmbs);
+int smcd_register_dev(struct smcd_dev *smcd);
+void smcd_unregister_dev(struct smcd_dev *smcd);
+void smcd_free_dev(struct smcd_dev *smcd);
+void smcd_handle_event(struct smcd_dev *dev, struct smcd_event *event);
+void smcd_handle_irq(struct smcd_dev *dev, unsigned int bit);
 #endif /* _SMC_H */
diff --git a/net/smc/Makefile b/net/smc/Makefile
index 188104654b54..4df96b4b8130 100644
--- a/net/smc/Makefile
+++ b/net/smc/Makefile
@@ -1,4 +1,4 @@
 obj-$(CONFIG_SMC)  += smc.o
 obj-$(CONFIG_SMC_DIAG) += smc_diag.o
 smc-y := af_smc.o smc_pnet.o smc_ib.o smc_clc.o smc_core.o smc_wr.o smc_llc.o
-smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o
+smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o smc_ism.o
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index da7f02edcd37..8ce48799cf68 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -475,8 +475,8 @@ static int smc_connect_rdma(struct smc_sock *smc,
int reason_code = 0;
 
mutex_lock(_create_lgr_pending);
-   local_contact = smc_conn_create(smc, ibdev, ibport, >lcl,
-   aclc->hdr.flag);
+   local_contact = smc_conn_create(smc, false, aclc->hdr.flag, ibdev,
+   ibport, >lcl, NULL, 0);
if 

[PATCH net-next 03/10] net/smc: optimize consumer cursor updates

2018-06-28 Thread Ursula Braun
From: Ursula Braun 

The SMC protocol requires to send a separate consumer cursor update,
if it cannot be piggybacked to updates of the producer cursor.
Currently the decision to send a separate consumer cursor update
just considers the amount of data already received by the socket
program. It does not consider the amount of data already arrived, but
not yet consumed by the receiver. Basing the decision on the
difference between already confirmed and already arrived data
(instead of difference between already confirmed and already consumed
data), may lead to a somewhat earlier consumer cursor update send in
fast unidirectional traffic scenarios, and thus to better throughput.

Signed-off-by: Ursula Braun 
Suggested-by: Thomas Richter 
---
 net/smc/smc_tx.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index cee666400752..f82886b7d1d8 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -495,7 +495,8 @@ void smc_tx_work(struct work_struct *work)
 
 void smc_tx_consumer_update(struct smc_connection *conn, bool force)
 {
-   union smc_host_cursor cfed, cons;
+   union smc_host_cursor cfed, cons, prod;
+   int sender_free = conn->rmb_desc->len;
int to_confirm;
 
smc_curs_write(,
@@ -505,11 +506,18 @@ void smc_tx_consumer_update(struct smc_connection *conn, 
bool force)
   smc_curs_read(>rx_curs_confirmed, conn),
   conn);
to_confirm = smc_curs_diff(conn->rmb_desc->len, , );
+   if (to_confirm > conn->rmbe_update_limit) {
+   smc_curs_write(,
+  smc_curs_read(>local_rx_ctrl.prod, conn),
+  conn);
+   sender_free = conn->rmb_desc->len -
+ smc_curs_diff(conn->rmb_desc->len, , );
+   }
 
if (conn->local_rx_ctrl.prod_flags.cons_curs_upd_req ||
force ||
((to_confirm > conn->rmbe_update_limit) &&
-((to_confirm > (conn->rmb_desc->len / 2)) ||
+((sender_free <= (conn->rmb_desc->len / 2)) ||
  conn->local_rx_ctrl.prod_flags.write_blocked))) {
if ((smc_cdc_get_slot_and_msg_send(conn) < 0) &&
conn->alert_token_local) { /* connection healthy */
-- 
2.16.4



[PATCH net-next 06/10] net/smc: add SMC-D support in CLC messages

2018-06-28 Thread Ursula Braun
From: Hans Wippel 

There are two types of SMC: SMC-R and SMC-D. These types are signaled
within the CLC messages during the CLC handshake. This patch adds
support for and checks of the SMC type.

Also, SMC-R and SMC-D need to exchange different information during the
CLC handshake. So, this patch extends the current message formats to
support the SMC-D header fields. The Proposal message can contain both
SMC-R and SMC-D information. The Accept and Confirm messages contain
either SMC-R or SMC-D information.

Signed-off-by: Hans Wippel 
Signed-off-by: Ursula Braun 
Suggested-by: Thomas Richter 
---
 net/smc/af_smc.c  |   9 +--
 net/smc/smc_clc.c | 193 ++
 net/smc/smc_clc.h |  81 ++-
 3 files changed, 205 insertions(+), 78 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 8ce48799cf68..20afa94be8bb 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -451,14 +451,14 @@ static int smc_check_rdma(struct smc_sock *smc, struct 
smc_ib_device **ibdev,
 }
 
 /* CLC handshake during connect */
-static int smc_connect_clc(struct smc_sock *smc,
+static int smc_connect_clc(struct smc_sock *smc, int smc_type,
   struct smc_clc_msg_accept_confirm *aclc,
   struct smc_ib_device *ibdev, u8 ibport)
 {
int rc = 0;
 
/* do inband token exchange */
-   rc = smc_clc_send_proposal(smc, ibdev, ibport);
+   rc = smc_clc_send_proposal(smc, smc_type, ibdev, ibport, NULL);
if (rc)
return rc;
/* receive SMC Accept CLC message */
@@ -564,7 +564,7 @@ static int __smc_connect(struct smc_sock *smc)
return smc_connect_decline_fallback(smc, SMC_CLC_DECL_CNFERR);
 
/* perform CLC handshake */
-   rc = smc_connect_clc(smc, , ibdev, ibport);
+   rc = smc_connect_clc(smc, SMC_TYPE_R, , ibdev, ibport);
if (rc)
return smc_connect_decline_fallback(smc, rc);
 
@@ -1008,7 +1008,8 @@ static void smc_listen_work(struct work_struct *work)
smc_tx_init(new_smc);
 
/* check if RDMA is available */
-   if (smc_check_rdma(new_smc, , ) ||
+   if ((pclc->hdr.path != SMC_TYPE_R && pclc->hdr.path != SMC_TYPE_B) ||
+   smc_check_rdma(new_smc, , ) ||
smc_listen_rdma_check(new_smc, pclc) ||
smc_listen_rdma_init(new_smc, pclc, ibdev, ibport,
 _contact) ||
diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
index 717449b1da0b..038d70ef7892 100644
--- a/net/smc/smc_clc.c
+++ b/net/smc/smc_clc.c
@@ -23,9 +23,15 @@
 #include "smc_core.h"
 #include "smc_clc.h"
 #include "smc_ib.h"
+#include "smc_ism.h"
+
+#define SMCR_CLC_ACCEPT_CONFIRM_LEN 68
+#define SMCD_CLC_ACCEPT_CONFIRM_LEN 48
 
 /* eye catcher "SMCR" EBCDIC for CLC messages */
 static const char SMC_EYECATCHER[4] = {'\xe2', '\xd4', '\xc3', '\xd9'};
+/* eye catcher "SMCD" EBCDIC for CLC messages */
+static const char SMCD_EYECATCHER[4] = {'\xe2', '\xd4', '\xc3', '\xc4'};
 
 /* check if received message has a correct header length and contains valid
  * heading and trailing eyecatchers
@@ -38,10 +44,14 @@ static bool smc_clc_msg_hdr_valid(struct smc_clc_msg_hdr 
*clcm)
struct smc_clc_msg_decline *dclc;
struct smc_clc_msg_trail *trl;
 
-   if (memcmp(clcm->eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER)))
+   if (memcmp(clcm->eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER)) &&
+   memcmp(clcm->eyecatcher, SMCD_EYECATCHER, sizeof(SMCD_EYECATCHER)))
return false;
switch (clcm->type) {
case SMC_CLC_PROPOSAL:
+   if (clcm->path != SMC_TYPE_R && clcm->path != SMC_TYPE_D &&
+   clcm->path != SMC_TYPE_B)
+   return false;
pclc = (struct smc_clc_msg_proposal *)clcm;
pclc_prfx = smc_clc_proposal_get_prefix(pclc);
if (ntohs(pclc->hdr.length) !=
@@ -56,10 +66,16 @@ static bool smc_clc_msg_hdr_valid(struct smc_clc_msg_hdr 
*clcm)
break;
case SMC_CLC_ACCEPT:
case SMC_CLC_CONFIRM:
+   if (clcm->path != SMC_TYPE_R && clcm->path != SMC_TYPE_D)
+   return false;
clc = (struct smc_clc_msg_accept_confirm *)clcm;
-   if (ntohs(clc->hdr.length) != sizeof(*clc))
+   if ((clcm->path == SMC_TYPE_R &&
+ntohs(clc->hdr.length) != SMCR_CLC_ACCEPT_CONFIRM_LEN) ||
+   (clcm->path == SMC_TYPE_D &&
+ntohs(clc->hdr.length) != SMCD_CLC_ACCEPT_CONFIRM_LEN))
return false;
-   trl = >trl;
+   trl = (struct smc_clc_msg_trail *)
+   ((u8 *)clc + ntohs(clc->hdr.length) - sizeof(*trl));
break;
case SMC_CLC_DECLINE:
dclc = (struct smc_clc_msg_decline *)clcm;
@@ -70,7 +86,8 @@ static bool 

Re: [PATCH bpf 1/4] xsk: fix potential lost completion message in SKB path

2018-06-28 Thread Song Liu
On Wed, Jun 27, 2018 at 7:02 AM, Magnus Karlsson
 wrote:
> The code in xskq_produce_addr erroneously checked if there
> was up to LAZY_UPDATE_THRESHOLD amount of space in the completion
> queue. It only needs to check if there is one slot left in the
> queue. This bug could under some circumstances lead to a WARN_ON_ONCE
> being triggered and the completion message to user space being lost.
>
> Fixes: 35fcde7f8deb ("xsk: support for Tx")
> Signed-off-by: Magnus Karlsson 
> Reported-by: Pavel Odintsov 

Acked-by: Song Liu 

> ---
>  net/xdp/xsk_queue.h | 9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index ef6a6f0ec949..52ecaf770642 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -62,14 +62,9 @@ static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 
> dcnt)
> return (entries > dcnt) ? dcnt : entries;
>  }
>
> -static inline u32 xskq_nb_free_lazy(struct xsk_queue *q, u32 producer)
> -{
> -   return q->nentries - (producer - q->cons_tail);
> -}
> -
>  static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt)
>  {
> -   u32 free_entries = xskq_nb_free_lazy(q, producer);
> +   u32 free_entries = q->nentries - (producer - q->cons_tail);
>
> if (free_entries >= dcnt)
> return free_entries;
> @@ -129,7 +124,7 @@ static inline int xskq_produce_addr(struct xsk_queue *q, 
> u64 addr)
>  {
> struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
>
> -   if (xskq_nb_free(q, q->prod_tail, LAZY_UPDATE_THRESHOLD) == 0)
> +   if (xskq_nb_free(q, q->prod_tail, 1) == 0)
> return -ENOSPC;
>
> ring->desc[q->prod_tail++ & q->ring_mask] = addr;
> --
> 2.7.4
>


Re: [PATCH v1 net-next 13/14] net/sched: Enforce usage of CLOCK_TAI for sch_etf

2018-06-28 Thread Jesus Sanchez-Palencia



On 06/28/2018 07:26 AM, Willem de Bruijn wrote:
> On Wed, Jun 27, 2018 at 8:45 PM Jesus Sanchez-Palencia
>  wrote:
>>
>> The qdisc and the SO_TXTIME ABIs allow for a clockid to be configured,
>> but it's been decided that usage of CLOCK_TAI should be enforced until
>> we decide to allow for other clockids to be used. The rationale here is
>> that PTP times are usually in the TAI scale, thus no other clocks should
>> be necessary.
>>
>> For now, the qdisc will return EINVAL if any clocks other than
>> CLOCK_TAI are used.
>>
>> Signed-off-by: Jesus Sanchez-Palencia 
>> ---
>>  net/sched/sch_etf.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
>> index cd6cb5b69228..5514a8aa3bd5 100644
>> --- a/net/sched/sch_etf.c
>> +++ b/net/sched/sch_etf.c
>> @@ -56,8 +56,8 @@ static inline int validate_input_params(struct tc_etf_qopt 
>> *qopt,
>> return -ENOTSUPP;
>> }
>>
>> -   if (qopt->clockid >= MAX_CLOCKS) {
>> -   NL_SET_ERR_MSG(extack, "Invalid clockid");
>> +   if (qopt->clockid != CLOCK_TAI) {
>> +   NL_SET_ERR_MSG(extack, "Invalid clockid. CLOCK_TAI must be 
>> used");
> 
> Similar to the comment in patch 12, this should be squashed (into
> patch 6) to avoid incorrect behavior in a range of SHA1s.


Ok. Fixed for v2.

Thanks,
Jesus


Re: [PATCH v1 net-next 12/14] igb: Only call skb_tx_timestamp after descriptors are ready

2018-06-28 Thread Jesus Sanchez-Palencia



On 06/27/2018 04:56 PM, Eric Dumazet wrote:
> 
> 
> On 06/27/2018 02:59 PM, Jesus Sanchez-Palencia wrote:
>> Currently, skb_tx_timestamp() is being called before the DMA
>> descriptors are prepared in igb_xmit_frame_ring(), which happens
>> during either the igb_tso() or igb_tx_csum() calls.
>>
>> Given that now the skb->tstamp might be used to carry the timestamp
>> for SO_TXTIME, we must only call skb_tx_timestamp() after the
>> information has been copied into the DMA tx_ring.
> 
> 
> Since when this skb->tstamp use happened ?
> 
> If this is in patch 11/14 (igb: Add support for ETF offload), then you should 
> either :
> 
> 1) Squash this into 11/14
> 
> 2) swap 11 and 12 patch, so that this change is done before "igb: Add support 
> for ETF offload"  
> 
> Otherwise a bisection could fail badly.


OK. Fixed for v2 by swapping patches 11 and 12.

Thanks,
Jesus


Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw

2018-06-28 Thread Jiri Pirko
Thu, Jun 28, 2018 at 05:50:08PM CEST, dsah...@gmail.com wrote:
>On 6/28/18 9:37 AM, Jiri Pirko wrote:
>
> Why this restriction? It's a template, so why can't it be removed
> regardless of whether there are filters?

 That means you could start to insert filters that does not match the
 original template. I wanted to avoid it. The chain is utilized in hw for
 the original template, the filter insertion would have to be sanitized
 in driver. With this restriction, drivers can depend on filters always
 be fitting.

>>>
>>> Then the hardware driver should have that restriction not the core tc code.
>> 
>> But why? The same restriction would be in all drivers. I believe it is
>> better to have in in tc in single place. Drivers can then depend on it.
>> Do you have a usecase where you need to remove template for non-empty
>> chain?
>> 
>
>If the hardware has the limitation then the driver should be rejecting a
>change.

The behaviour I defend is symmetrical with "template add". There is also
possible to add the template only if the chain is empty.



[PATCH bpf-net 09/14] bpf: sync bpf.h to tools/

2018-06-28 Thread Roman Gushchin
Sync cgroup storage related changes:
1) new BPF_MAP_TYPE_CGROUP_STORAGE map type
2) struct bpf_cgroup_sotrage_key definition
3) get_local_storage() helper

Signed-off-by: Roman Gushchin 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Acked-by: Martin KaFai Lau 
---
 tools/include/uapi/linux/bpf.h | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e0b06784f227..06e81dda 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -75,6 +75,11 @@ struct bpf_lpm_trie_key {
__u8data[0];/* Arbitrary size */
 };
 
+struct bpf_cgroup_storage_key {
+   __u64   cgroup_inode_id;/* cgroup inode id */
+   __u32   attach_type;/* program attach type */
+};
+
 /* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
BPF_MAP_CREATE,
@@ -120,6 +125,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_CPUMAP,
BPF_MAP_TYPE_XSKMAP,
BPF_MAP_TYPE_SOCKHASH,
+   BPF_MAP_TYPE_CGROUP_STORAGE,
 };
 
 enum bpf_prog_type {
@@ -2157,7 +2163,8 @@ union bpf_attr {
FN(rc_repeat),  \
FN(rc_keydown), \
FN(skb_cgroup_id),  \
-   FN(get_current_cgroup_id),
+   FN(get_current_cgroup_id),  \
+   FN(get_local_storage),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.14.4



[PATCH bpf-net 04/14] bpf: allocate cgroup storage entries on attaching bpf programs

2018-06-28 Thread Roman Gushchin
If a bpf program is using cgroup local storage, allocate
a bpf_cgroup_storage structure automatically on attaching the program
to a cgroup and save the pointer into the corresponding bpf_prog_list
entry.
Analogically, release the cgroup local storage on detaching
of the bpf program.

Signed-off-by: Roman Gushchin 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Acked-by: Martin KaFai Lau 
---
 include/linux/bpf-cgroup.h |  1 +
 kernel/bpf/cgroup.c| 28 ++--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 128fb0e39b4d..25ba744d2364 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -41,6 +41,7 @@ struct bpf_cgroup_storage {
 struct bpf_prog_list {
struct list_head node;
struct bpf_prog *prog;
+   struct bpf_cgroup_storage *storage;
 };
 
 struct bpf_prog_array;
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index f7c00bd6f8e4..f0a809868f92 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -34,6 +34,8 @@ void cgroup_bpf_put(struct cgroup *cgrp)
list_for_each_entry_safe(pl, tmp, progs, node) {
list_del(>node);
bpf_prog_put(pl->prog);
+   bpf_cgroup_storage_unlink(pl->storage);
+   bpf_cgroup_storage_free(pl->storage);
kfree(pl);
static_branch_dec(_bpf_enabled_key);
}
@@ -189,6 +191,7 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct 
bpf_prog *prog,
 {
struct list_head *progs = >bpf.progs[type];
struct bpf_prog *old_prog = NULL;
+   struct bpf_cgroup_storage *storage, *old_storage = NULL;
struct cgroup_subsys_state *css;
struct bpf_prog_list *pl;
bool pl_was_allocated;
@@ -211,6 +214,10 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct 
bpf_prog *prog,
if (prog_list_length(progs) >= BPF_CGROUP_MAX_PROGS)
return -E2BIG;
 
+   storage = bpf_cgroup_storage_alloc(prog);
+   if (IS_ERR(storage))
+   return -ENOMEM;
+
if (flags & BPF_F_ALLOW_MULTI) {
list_for_each_entry(pl, progs, node)
if (pl->prog == prog)
@@ -218,24 +225,33 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct 
bpf_prog *prog,
return -EINVAL;
 
pl = kmalloc(sizeof(*pl), GFP_KERNEL);
-   if (!pl)
+   if (!pl) {
+   bpf_cgroup_storage_free(storage);
return -ENOMEM;
+   }
+
pl_was_allocated = true;
pl->prog = prog;
+   pl->storage = storage;
list_add_tail(>node, progs);
} else {
if (list_empty(progs)) {
pl = kmalloc(sizeof(*pl), GFP_KERNEL);
-   if (!pl)
+   if (!pl) {
+   bpf_cgroup_storage_free(storage);
return -ENOMEM;
+   }
pl_was_allocated = true;
list_add_tail(>node, progs);
} else {
pl = list_first_entry(progs, typeof(*pl), node);
old_prog = pl->prog;
+   old_storage = pl->storage;
+   bpf_cgroup_storage_unlink(old_storage);
pl_was_allocated = false;
}
pl->prog = prog;
+   pl->storage = storage;
}
 
cgrp->bpf.flags[type] = flags;
@@ -258,10 +274,13 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct 
bpf_prog *prog,
}
 
static_branch_inc(_bpf_enabled_key);
+   if (old_storage)
+   bpf_cgroup_storage_free(old_storage);
if (old_prog) {
bpf_prog_put(old_prog);
static_branch_dec(_bpf_enabled_key);
}
+   bpf_cgroup_storage_link(storage, cgrp, type);
return 0;
 
 cleanup:
@@ -277,6 +296,9 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct 
bpf_prog *prog,
 
/* and cleanup the prog list */
pl->prog = old_prog;
+   bpf_cgroup_storage_free(pl->storage);
+   pl->storage = old_storage;
+   bpf_cgroup_storage_link(old_storage, cgrp, type);
if (pl_was_allocated) {
list_del(>node);
kfree(pl);
@@ -357,6 +379,8 @@ int __cgroup_bpf_detach(struct cgroup *cgrp, struct 
bpf_prog *prog,
 
/* now can actually delete it from this cgroup list */
list_del(>node);
+   bpf_cgroup_storage_unlink(pl->storage);
+   bpf_cgroup_storage_free(pl->storage);
kfree(pl);
if (list_empty(progs))
/* last program was detached, reset flags to zero */
-- 
2.14.4



[PATCH bpf-net 10/14] bpftool: add support for CGROUP_STORAGE maps

2018-06-28 Thread Roman Gushchin
Add BPF_MAP_TYPE_CGROUP_STORAGE maps to the list
of maps types which bpftool recognizes.

Signed-off-by: Roman Gushchin 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Jakub Kicinski 
Acked-by: Martin KaFai Lau 
---
 tools/bpf/bpftool/map.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 097b1a5e046b..154d258cdde3 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -67,6 +67,7 @@ static const char * const map_type_name[] = {
[BPF_MAP_TYPE_SOCKMAP]  = "sockmap",
[BPF_MAP_TYPE_CPUMAP]   = "cpumap",
[BPF_MAP_TYPE_SOCKHASH] = "sockhash",
+   [BPF_MAP_TYPE_CGROUP_STORAGE]   = "cgroup_storage",
 };
 
 static bool map_is_per_cpu(__u32 type)
-- 
2.14.4



[PATCH bpf-net 14/14] samples/bpf: extend test_cgrp2_attach2 test to use cgroup storage

2018-06-28 Thread Roman Gushchin
The test_cgrp2_attach test covers bpf cgroup attachment code well,
so let's re-use it for testing allocation/releasing of cgroup storage.

The extension is pretty straightforward: the bpf program will use
the cgroup storage to save the number of transmitted bytes.

Expected output:
  $ ./test_cgrp2_attach2
  Attached DROP prog. This ping in cgroup /foo should fail...
  ping: sendmsg: Operation not permitted
  Attached DROP prog. This ping in cgroup /foo/bar should fail...
  ping: sendmsg: Operation not permitted
  Attached PASS prog. This ping in cgroup /foo/bar should pass...
  Detached PASS from /foo/bar while DROP is attached to /foo.
  This ping in cgroup /foo/bar should fail...
  ping: sendmsg: Operation not permitted
  Attached PASS from /foo/bar and detached DROP from /foo.
  This ping in cgroup /foo/bar should pass...
  ### override:PASS
  ### multi:PASS

Signed-off-by: Roman Gushchin 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Acked-by: Martin KaFai Lau 
---
 samples/bpf/test_cgrp2_attach2.c | 27 ++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/samples/bpf/test_cgrp2_attach2.c b/samples/bpf/test_cgrp2_attach2.c
index b453e6a161be..f682e0b8aa83 100644
--- a/samples/bpf/test_cgrp2_attach2.c
+++ b/samples/bpf/test_cgrp2_attach2.c
@@ -8,7 +8,8 @@
  *   information. The number of invocations of the program, which maps
  *   to the number of packets received, is stored to key 0. Key 1 is
  *   incremented on each iteration by the number of bytes stored in
- *   the skb.
+ *   the skb. The program also stores the number of received bytes
+ *   in the cgroup storage.
  *
  * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
  *
@@ -21,12 +22,15 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 #include 
 #include 
 
 #include "bpf_insn.h"
+#include "bpf_rlimit.h"
 #include "cgroup_helpers.h"
 
 #define FOO"/foo"
@@ -205,6 +209,8 @@ static int map_fd = -1;
 
 static int prog_load_cnt(int verdict, int val)
 {
+   int cgroup_storage_fd;
+
if (map_fd < 0)
map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, 4, 8, 1, 0);
if (map_fd < 0) {
@@ -212,6 +218,13 @@ static int prog_load_cnt(int verdict, int val)
return -1;
}
 
+   cgroup_storage_fd = bpf_create_map(BPF_MAP_TYPE_CGROUP_STORAGE,
+   sizeof(struct bpf_cgroup_storage_key), 8, 0, 0);
+   if (cgroup_storage_fd < 0) {
+   printf("failed to create map '%s'\n", strerror(errno));
+   return -1;
+   }
+
struct bpf_insn prog[] = {
BPF_MOV32_IMM(BPF_REG_0, 0),
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 
4) = r0 */
@@ -222,6 +235,11 @@ static int prog_load_cnt(int verdict, int val)
BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
BPF_MOV64_IMM(BPF_REG_1, val), /* r1 = 1 */
BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 
0, 0), /* xadd r0 += r1 */
+   BPF_LD_MAP_FD(BPF_REG_1, cgroup_storage_fd),
+   BPF_MOV64_IMM(BPF_REG_2, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
BPF_FUNC_get_local_storage),
+   BPF_MOV64_IMM(BPF_REG_1, val),
+   BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_W, BPF_REG_0, BPF_REG_1, 
0, 0),
BPF_MOV64_IMM(BPF_REG_0, verdict), /* r0 = verdict */
BPF_EXIT_INSN(),
};
@@ -237,6 +255,7 @@ static int prog_load_cnt(int verdict, int val)
printf("Output from verifier:\n%s\n---\n", bpf_log_buf);
return 0;
}
+   close(cgroup_storage_fd);
return ret;
 }
 
@@ -414,6 +433,12 @@ static int test_multiprog(void)
 int main(int argc, char **argv)
 {
int rc = 0;
+   struct rlimit r = {1024*1024, RLIM_INFINITY};
+
+   if (setrlimit(RLIMIT_MEMLOCK, )) {
+   log_err("Setrlimit(RLIMIT_MEMLOCK) failed");
+   return 1;
+   }
 
rc = test_foo_bar();
if (rc)
-- 
2.14.4



[PATCH bpf-net 13/14] selftests/bpf: add a cgroup storage test

2018-06-28 Thread Roman Gushchin
Implement a test to cover the cgroup storage functionality.
The test implements a bpf program which drops every second packet
by using the cgroup storage as a persistent storage.

The test also use the userspace API to check the data
in the cgroup storage, alter it, and check that the loaded
and attached bpf program sees the update.

Expected output:
  $ ./test_cgroup_storage
  test_cgroup_storage:PASS

Signed-off-by: Roman Gushchin 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Acked-by: Martin KaFai Lau 
---
 tools/testing/selftests/bpf/Makefile  |   4 +-
 tools/testing/selftests/bpf/test_cgroup_storage.c | 130 ++
 2 files changed, 133 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_cgroup_storage.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 7a6214e9ae58..81f38623fc9f 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -22,7 +22,8 @@ $(TEST_CUSTOM_PROGS): $(OUTPUT)/%: %.c
 # Order correspond to 'make run_tests' order
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map 
test_progs \
test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
-   test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user
+   test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user 
\
+   test_cgroup_storage
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
@@ -63,6 +64,7 @@ $(OUTPUT)/test_sock_addr: cgroup_helpers.c
 $(OUTPUT)/test_sockmap: cgroup_helpers.c
 $(OUTPUT)/test_progs: trace_helpers.c
 $(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c
+$(OUTPUT)/test_cgroup_storage: cgroup_helpers.c
 
 .PHONY: force
 
diff --git a/tools/testing/selftests/bpf/test_cgroup_storage.c 
b/tools/testing/selftests/bpf/test_cgroup_storage.c
new file mode 100644
index ..0597943ce34b
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_cgroup_storage.c
@@ -0,0 +1,130 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "cgroup_helpers.h"
+
+char bpf_log_buf[BPF_LOG_BUF_SIZE];
+
+#define TEST_CGROUP "/test-bpf-cgroup-storage-buf/"
+
+int main(int argc, char **argv)
+{
+   struct bpf_insn prog[] = {
+   BPF_LD_MAP_FD(BPF_REG_1, 0), /* map fd */
+   BPF_MOV64_IMM(BPF_REG_2, 0), /* flags, not used */
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_get_local_storage),
+   BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 1),
+   BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, 0),
+   BPF_ALU64_IMM(BPF_AND, BPF_REG_1, 0x1),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_1),
+   BPF_EXIT_INSN(),
+   };
+   size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
+   int error = EXIT_FAILURE;
+   int map_fd, prog_fd, cgroup_fd;
+   struct bpf_cgroup_storage_key key;
+   unsigned long long value;
+
+   map_fd = bpf_create_map(BPF_MAP_TYPE_CGROUP_STORAGE, sizeof(key),
+   sizeof(value), 0, 0);
+   if (map_fd < 0) {
+   printf("Failed to create map: %s\n", strerror(errno));
+   goto out;
+   }
+
+   prog[0].imm = map_fd;
+   prog_fd = bpf_load_program(BPF_PROG_TYPE_CGROUP_SKB,
+  prog, insns_cnt, "GPL", 0,
+  bpf_log_buf, BPF_LOG_BUF_SIZE);
+   if (prog_fd < 0) {
+   printf("Failed to load bpf program: %s\n", bpf_log_buf);
+   goto out;
+   }
+
+   if (setup_cgroup_environment()) {
+   printf("Failed to setup cgroup environment\n");
+   goto err;
+   }
+
+   /* Create a cgroup, get fd, and join it */
+   cgroup_fd = create_and_get_cgroup(TEST_CGROUP);
+   if (!cgroup_fd) {
+   printf("Failed to create test cgroup\n");
+   goto err;
+   }
+
+   if (join_cgroup(TEST_CGROUP)) {
+   printf("Failed to join cgroup\n");
+   goto err;
+   }
+
+   /* Attach the bpf program */
+   if (bpf_prog_attach(prog_fd, cgroup_fd, BPF_CGROUP_INET_EGRESS, 0)) {
+   printf("Failed to attach bpf program\n");
+   goto err;
+   }
+
+   if (bpf_map_get_next_key(map_fd, NULL, )) {
+   printf("Failed to get the first key in cgroup storage\n");
+   goto err;
+   }
+
+   if (bpf_map_lookup_elem(map_fd, , )) {
+   printf("Failed to lookup cgroup storage\n");
+   goto err;
+   }
+
+   /* Every second packet should be dropped */
+   assert(system("ping localhost -c 1 -W 1 -q > /dev/null") 

[PATCH bpf-net 08/14] bpf: introduce the bpf_get_local_storage() helper function

2018-06-28 Thread Roman Gushchin
The bpf_get_local_storage() helper function is used
to get a pointer to the bpf local storage from a bpf program.

It takes a pointer to a storage map and flags as arguments.
Right now it accepts only cgroup storage maps, and flags
argument has to be 0. Further it can be extended to support
other types of local storage: e.g. thread local storage etc.

Signed-off-by: Roman Gushchin 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Acked-by: Martin KaFai Lau 
---
 include/linux/bpf.h  |  2 ++
 include/uapi/linux/bpf.h | 13 -
 kernel/bpf/cgroup.c  |  2 ++
 kernel/bpf/helpers.c | 20 
 kernel/bpf/verifier.c| 18 ++
 net/core/filter.c| 23 ++-
 6 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6d7e0dfc..1fdcf9d21b74 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -771,6 +771,8 @@ extern const struct bpf_func_proto 
bpf_sock_map_update_proto;
 extern const struct bpf_func_proto bpf_sock_hash_update_proto;
 extern const struct bpf_func_proto bpf_get_current_cgroup_id_proto;
 
+extern const struct bpf_func_proto bpf_get_local_storage_proto;
+
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7aa135e4c2f3..baf74db6c06e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2081,6 +2081,16 @@ union bpf_attr {
  * Return
  * A 64-bit integer containing the current cgroup id based
  * on the cgroup within which the current task is running.
+ *
+ * void* get_local_storage(void *map, u64 flags)
+ * Description
+ * Get the pointer to the local storage area.
+ * The type and the size of the local storage is defined
+ * by the *map* argument.
+ * The *flags* meaning is specific for each map type,
+ * and has to be 0 for cgroup local storage.
+ * Return
+ * Pointer to the local storage area.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2163,7 +2173,8 @@ union bpf_attr {
FN(rc_repeat),  \
FN(rc_keydown), \
FN(skb_cgroup_id),  \
-   FN(get_current_cgroup_id),
+   FN(get_current_cgroup_id),  \
+   FN(get_local_storage),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 14a1f6c94592..47d4519a6847 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -629,6 +629,8 @@ cgroup_dev_func_proto(enum bpf_func_id func_id, const 
struct bpf_prog *prog)
return _map_delete_elem_proto;
case BPF_FUNC_get_current_uid_gid:
return _get_current_uid_gid_proto;
+   case BPF_FUNC_get_local_storage:
+   return _get_local_storage_proto;
case BPF_FUNC_trace_printk:
if (capable(CAP_SYS_ADMIN))
return bpf_get_trace_printk_proto();
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 73065e2d23c2..ca17b4ed3ac9 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -193,4 +193,24 @@ const struct bpf_func_proto 
bpf_get_current_cgroup_id_proto = {
.gpl_only   = false,
.ret_type   = RET_INTEGER,
 };
+
+DECLARE_PER_CPU(void*, bpf_cgroup_storage);
+
+BPF_CALL_2(bpf_get_local_storage, struct bpf_map *, map, u64, flags)
+{
+   /* map and flags arguments are not used now,
+* but provide an ability to extend the API
+* for other types of local storages.
+* verifier checks that their values are correct.
+*/
+   return (u64)this_cpu_read(bpf_cgroup_storage);
+}
+
+const struct bpf_func_proto bpf_get_local_storage_proto = {
+   .func   = bpf_get_local_storage,
+   .gpl_only   = false,
+   .ret_type   = RET_PTR_TO_MAP_VALUE,
+   .arg1_type  = ARG_CONST_MAP_PTR,
+   .arg2_type  = ARG_ANYTHING,
+};
 #endif
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index cc0c7990f849..a0f5c26fffc1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2127,6 +2127,10 @@ static int check_map_func_compatibility(struct 
bpf_verifier_env *env,
func_id != BPF_FUNC_current_task_under_cgroup)
goto error;
break;
+   case BPF_MAP_TYPE_CGROUP_STORAGE:
+   if (func_id != BPF_FUNC_get_local_storage)
+   goto error;
+   break;
/* devmap returns a pointer to a live net_device ifindex that we cannot
 * allow to be modified from bpf side. So do not allow lookup elements
 * for now.
@@ -2209,6 +2213,10 @@ 

[PATCH bpf-net 07/14] bpf: don't allow create maps of cgroup local storages

2018-06-28 Thread Roman Gushchin
As there is one-to-one relation between a bpf program
and cgroup local storage map, there is no sense in
creating a map of cgroup local storage maps.

Forbid it explicitly to avoid possible side effects.

Signed-off-by: Roman Gushchin 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Acked-by: Martin KaFai Lau 
---
 kernel/bpf/map_in_map.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 1da574612bea..3bfbf4464416 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -23,7 +23,8 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 * is a runtime binding.  Doing static check alone
 * in the verifier is not enough.
 */
-   if (inner_map->map_type == BPF_MAP_TYPE_PROG_ARRAY) {
+   if (inner_map->map_type == BPF_MAP_TYPE_PROG_ARRAY ||
+   inner_map->map_type == BPF_MAP_TYPE_CGROUP_STORAGE) {
fdput(f);
return ERR_PTR(-ENOTSUPP);
}
-- 
2.14.4



[PATCH bpf-net 11/14] bpf/test_run: support cgroup local storage

2018-06-28 Thread Roman Gushchin
Allocate a temporary cgroup storage to use for bpf program test runs.

Because the test program is not actually attached to a cgroup,
the storage is allocated manually just for the execution
of the bpf program.

If the program is executed multiple times, the storage is not zeroed
on each run, emulating multiple runs of the program, attached to
a real cgroup.

Signed-off-by: Roman Gushchin 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Acked-by: Martin KaFai Lau 
---
 net/bpf/test_run.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 68c3578343b4..74971a9b7cfb 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -11,12 +11,14 @@
 #include 
 #include 
 
-static __always_inline u32 bpf_test_run_one(struct bpf_prog *prog, void *ctx)
+static __always_inline u32 bpf_test_run_one(struct bpf_prog *prog, void *ctx,
+   struct bpf_cgroup_storage *storage)
 {
u32 ret;
 
preempt_disable();
rcu_read_lock();
+   bpf_cgroup_storage_set(storage);
ret = BPF_PROG_RUN(prog, ctx);
rcu_read_unlock();
preempt_enable();
@@ -26,14 +28,19 @@ static __always_inline u32 bpf_test_run_one(struct bpf_prog 
*prog, void *ctx)
 
 static u32 bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, u32 
*time)
 {
+   struct bpf_cgroup_storage *storage = NULL;
u64 time_start, time_spent = 0;
u32 ret = 0, i;
 
+   storage = bpf_cgroup_storage_alloc(prog);
+   if (IS_ERR(storage))
+   return PTR_ERR(storage);
+
if (!repeat)
repeat = 1;
time_start = ktime_get_ns();
for (i = 0; i < repeat; i++) {
-   ret = bpf_test_run_one(prog, ctx);
+   ret = bpf_test_run_one(prog, ctx, storage);
if (need_resched()) {
if (signal_pending(current))
break;
@@ -46,6 +53,8 @@ static u32 bpf_test_run(struct bpf_prog *prog, void *ctx, u32 
repeat, u32 *time)
do_div(time_spent, repeat);
*time = time_spent > U32_MAX ? U32_MAX : (u32)time_spent;
 
+   bpf_cgroup_storage_free(storage);
+
return ret;
 }
 
-- 
2.14.4



[PATCH bpf-net 03/14] bpf: pass a pointer to a cgroup storage using pcpu variable

2018-06-28 Thread Roman Gushchin
This commit introduces the bpf_cgroup_storage_set() helper,
which will be used to pass a pointer to a cgroup storage
to the bpf helper.

Signed-off-by: Roman Gushchin 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Acked-by: Martin KaFai Lau 
---
 include/linux/bpf-cgroup.h | 14 ++
 kernel/bpf/local_storage.c |  2 ++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index b4e2e42c1d2a..128fb0e39b4d 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -20,6 +20,8 @@ struct bpf_cgroup_storage;
 extern struct static_key_false cgroup_bpf_enabled_key;
 #define cgroup_bpf_enabled static_branch_unlikely(_bpf_enabled_key)
 
+DECLARE_PER_CPU(void*, bpf_cgroup_storage);
+
 struct bpf_cgroup_storage_map;
 
 struct bpf_storage_buffer {
@@ -96,6 +98,17 @@ int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
 int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
  short access, enum bpf_attach_type type);
 
+static inline void bpf_cgroup_storage_set(struct bpf_cgroup_storage *storage)
+{
+   struct bpf_storage_buffer *buf;
+
+   if (!storage)
+   return;
+
+   buf = rcu_dereference(storage->buf);
+   this_cpu_write(bpf_cgroup_storage, >data[0]);
+}
+
 struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog);
 void bpf_cgroup_storage_free(struct bpf_cgroup_storage *storage);
 void bpf_cgroup_storage_link(struct bpf_cgroup_storage *storage,
@@ -223,6 +236,7 @@ struct cgroup_bpf {};
 static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
 static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
 
+static inline void bpf_cgroup_storage_set(struct bpf_cgroup_storage *storage) 
{}
 static inline int bpf_cgroup_storage_assign(struct bpf_prog *prog,
struct bpf_map *map) { return 0; }
 static inline void bpf_cgroup_storage_release(struct bpf_prog *prog,
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index 940889eda2c7..38810a712971 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -7,6 +7,8 @@
 #include 
 #include 
 
+DEFINE_PER_CPU(void*, bpf_cgroup_storage);
+
 #ifdef CONFIG_CGROUP_BPF
 
 struct bpf_cgroup_storage_map {
-- 
2.14.4



[PATCH net-next 00/10] pnetid and SMC-D support

2018-06-28 Thread Ursula Braun
Dave,

SMC requires a configured pnet table to map Ethernet interfaces to
RoCE adapter ports. For s390 there exists hardware support to group
such devices. The first three patches cover the s390 pnetid support,
enabling SMC-R usage on s390 without configuring an extra pnet table.

SMC currently requires RoCE adapters, and uses RDMA-techniques
implemented with IB-verbs. But s390 offers another method for
intra-CEC Shared Memory communication. The following seven patches
implement a solution to run SMC traffic based on intra-CEC DMA,
called SMC-D.

Thanks, Ursula

Hans Wippel (6):
  net/smc: add base infrastructure for SMC-D and ISM
  net/smc: add pnetid support for SMC-D and ISM
  net/smc: add SMC-D support in CLC messages
  net/smc: add SMC-D support in data transfer
  net/smc: add SMC-D support in af_smc
  net/smc: add SMC-D diag support

Sebastian Ott (1):
  s390/ism: add device driver for internal shared memory

Ursula Braun (3):
  net/smc: determine port attributes independent from pnet table
  net/smc: add pnetid support
  net/smc: optimize consumer cursor updates

 drivers/s390/net/Kconfig  |  10 +
 drivers/s390/net/Makefile |   3 +
 drivers/s390/net/ism.h| 221 +++
 drivers/s390/net/ism_drv.c| 623 ++
 include/net/smc.h |  65 +
 include/uapi/linux/smc_diag.h |  10 +
 net/smc/Makefile  |   2 +-
 net/smc/af_smc.c  | 228 ++--
 net/smc/smc.h |   7 +-
 net/smc/smc_cdc.c |  86 +-
 net/smc/smc_cdc.h |  43 ++-
 net/smc/smc_clc.c | 193 +
 net/smc/smc_clc.h |  81 --
 net/smc/smc_core.c| 285 ++-
 net/smc/smc_core.h|  72 +++--
 net/smc/smc_diag.c|  18 +-
 net/smc/smc_ib.c  | 134 +
 net/smc/smc_ib.h  |   4 +-
 net/smc/smc_ism.c | 314 +
 net/smc/smc_ism.h |  48 
 net/smc/smc_pnet.c| 157 +--
 net/smc/smc_pnet.h|  16 ++
 net/smc/smc_rx.c  |   2 +-
 net/smc/smc_tx.c  | 205 +++---
 net/smc/smc_tx.h  |   2 +
 25 files changed, 2505 insertions(+), 324 deletions(-)
 create mode 100644 drivers/s390/net/ism.h
 create mode 100644 drivers/s390/net/ism_drv.c
 create mode 100644 net/smc/smc_ism.c
 create mode 100644 net/smc/smc_ism.h

-- 
2.16.4



[PATCH net-next 10/10] s390/ism: add device driver for internal shared memory

2018-06-28 Thread Ursula Braun
From: Sebastian Ott 

Add support for the Internal Shared Memory vPCI Adapter.
This driver implements the interfaces of the SMC-D protocol.

Signed-off-by: Sebastian Ott 
Signed-off-by: Ursula Braun 
---
 drivers/s390/net/Kconfig   |  10 +
 drivers/s390/net/Makefile  |   3 +
 drivers/s390/net/ism.h | 221 
 drivers/s390/net/ism_drv.c | 623 +
 4 files changed, 857 insertions(+)
 create mode 100644 drivers/s390/net/ism.h
 create mode 100644 drivers/s390/net/ism_drv.c

diff --git a/drivers/s390/net/Kconfig b/drivers/s390/net/Kconfig
index c7e484f70654..7c5a25ddf832 100644
--- a/drivers/s390/net/Kconfig
+++ b/drivers/s390/net/Kconfig
@@ -95,4 +95,14 @@ config CCWGROUP
tristate
default (LCS || CTCM || QETH)
 
+config ISM
+   tristate "Support for ISM vPCI Adapter"
+   depends on PCI && SMC
+   default n
+   help
+ Select this option if you want to use the Internal Shared Memory
+ vPCI Adapter.
+
+ To compile as a module choose M. The module name is ism.
+ If unsure, choose N.
 endmenu
diff --git a/drivers/s390/net/Makefile b/drivers/s390/net/Makefile
index 513b7ae64980..f2d6bbe57a6f 100644
--- a/drivers/s390/net/Makefile
+++ b/drivers/s390/net/Makefile
@@ -15,3 +15,6 @@ qeth_l2-y += qeth_l2_main.o qeth_l2_sys.o
 obj-$(CONFIG_QETH_L2) += qeth_l2.o
 qeth_l3-y += qeth_l3_main.o qeth_l3_sys.o
 obj-$(CONFIG_QETH_L3) += qeth_l3.o
+
+ism-y := ism_drv.o
+obj-$(CONFIG_ISM) += ism.o
diff --git a/drivers/s390/net/ism.h b/drivers/s390/net/ism.h
new file mode 100644
index ..0aab90817326
--- /dev/null
+++ b/drivers/s390/net/ism.h
@@ -0,0 +1,221 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef S390_ISM_H
+#define S390_ISM_H
+
+#include 
+#include 
+#include 
+#include 
+
+#define UTIL_STR_LEN   16
+
+/*
+ * Do not use the first word of the DMB bits to ensure 8 byte aligned access.
+ */
+#define ISM_DMB_WORD_OFFSET1
+#define ISM_DMB_BIT_OFFSET (ISM_DMB_WORD_OFFSET * 32)
+#define ISM_NR_DMBS1920
+
+#define ISM_REG_SBA0x1
+#define ISM_REG_IEQ0x2
+#define ISM_READ_GID   0x3
+#define ISM_ADD_VLAN_ID0x4
+#define ISM_DEL_VLAN_ID0x5
+#define ISM_SET_VLAN   0x6
+#define ISM_RESET_VLAN 0x7
+#define ISM_QUERY_INFO 0x8
+#define ISM_QUERY_RGID 0x9
+#define ISM_REG_DMB0xA
+#define ISM_UNREG_DMB  0xB
+#define ISM_SIGNAL_IEQ 0xE
+#define ISM_UNREG_SBA  0x11
+#define ISM_UNREG_IEQ  0x12
+
+#define ISM_ERROR  0x
+
+struct ism_req_hdr {
+   u32 cmd;
+   u16 : 16;
+   u16 len;
+};
+
+struct ism_resp_hdr {
+   u32 cmd;
+   u16 ret;
+   u16 len;
+};
+
+union ism_reg_sba {
+   struct {
+   struct ism_req_hdr hdr;
+   u64 sba;
+   } request;
+   struct {
+   struct ism_resp_hdr hdr;
+   } response;
+} __aligned(16);
+
+union ism_reg_ieq {
+   struct {
+   struct ism_req_hdr hdr;
+   u64 ieq;
+   u64 len;
+   } request;
+   struct {
+   struct ism_resp_hdr hdr;
+   } response;
+} __aligned(16);
+
+union ism_read_gid {
+   struct {
+   struct ism_req_hdr hdr;
+   } request;
+   struct {
+   struct ism_resp_hdr hdr;
+   u64 gid;
+   } response;
+} __aligned(16);
+
+union ism_qi {
+   struct {
+   struct ism_req_hdr hdr;
+   } request;
+   struct {
+   struct ism_resp_hdr hdr;
+   u32 version;
+   u32 max_len;
+   u64 ism_state;
+   u64 my_gid;
+   u64 sba;
+   u64 ieq;
+   u32 ieq_len;
+   u32 : 32;
+   u32 dmbs_owned;
+   u32 dmbs_used;
+   u32 vlan_required;
+   u32 vlan_nr_ids;
+   u16 vlan_id[64];
+   } response;
+} __aligned(64);
+
+union ism_query_rgid {
+   struct {
+   struct ism_req_hdr hdr;
+   u64 rgid;
+   u32 vlan_valid;
+   u32 vlan_id;
+   } request;
+   struct {
+   struct ism_resp_hdr hdr;
+   } response;
+} __aligned(16);
+
+union ism_reg_dmb {
+   struct {
+   struct ism_req_hdr hdr;
+   u64 dmb;
+   u32 dmb_len;
+   u32 sba_idx;
+   u32 vlan_valid;
+   u32 vlan_id;
+   u64 rgid;
+   } request;
+   struct {
+   struct ism_resp_hdr hdr;
+   u64 dmb_tok;
+   } response;
+} __aligned(32);
+
+union ism_sig_ieq {
+   struct {
+   struct ism_req_hdr hdr;
+   u64 rgid;
+   u32 trigger_irq;
+   u32 event_code;
+   u64 info;
+   } request;
+   struct {
+   struct ism_resp_hdr hdr;
+   } response;
+} __aligned(32);
+
+union ism_unreg_dmb {
+   struct {
+   struct 

[PATCH net-next 07/10] net/smc: add SMC-D support in data transfer

2018-06-28 Thread Ursula Braun
From: Hans Wippel 

The data transfer and CDC message headers differ in SMC-R and SMC-D.
This patch adds support for the SMC-D data transfer to the existing SMC
code. It consists of the following:

* SMC-D CDC support
* SMC-D tx support
* SMC-D rx support

The CDC header is stored at the beginning of the receive buffer. Thus, a
rx_offset variable is added for the CDC header offset within the buffer
(0 for SMC-R).

Signed-off-by: Hans Wippel 
Signed-off-by: Ursula Braun 
Suggested-by: Thomas Richter 
---
 net/smc/smc.h  |   5 ++
 net/smc/smc_cdc.c  |  86 +++-
 net/smc/smc_cdc.h  |  43 +++-
 net/smc/smc_core.c |  25 +--
 net/smc/smc_ism.c  |   8 +++
 net/smc/smc_rx.c   |   2 +-
 net/smc/smc_tx.c   | 193 +
 net/smc/smc_tx.h   |   2 +
 8 files changed, 308 insertions(+), 56 deletions(-)

diff --git a/net/smc/smc.h b/net/smc/smc.h
index 7c86f716a92e..8c6231011779 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -183,6 +183,11 @@ struct smc_connection {
spinlock_t  acurs_lock; /* protect cursors */
 #endif
struct work_struct  close_work; /* peer sent some closing */
+   struct tasklet_struct   rx_tsklet;  /* Receiver tasklet for SMC-D */
+   u8  rx_off; /* receive offset:
+* 0 for SMC-R, 32 for SMC-D
+*/
+   u64 peer_token; /* SMC-D token of peer */
 };
 
 struct smc_sock {  /* smc sock container */
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index a7e8d63fc8ae..621d8cca570b 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -117,7 +117,7 @@ int smc_cdc_msg_send(struct smc_connection *conn,
return rc;
 }
 
-int smc_cdc_get_slot_and_msg_send(struct smc_connection *conn)
+static int smcr_cdc_get_slot_and_msg_send(struct smc_connection *conn)
 {
struct smc_cdc_tx_pend *pend;
struct smc_wr_buf *wr_buf;
@@ -130,6 +130,21 @@ int smc_cdc_get_slot_and_msg_send(struct smc_connection 
*conn)
return smc_cdc_msg_send(conn, wr_buf, pend);
 }
 
+int smc_cdc_get_slot_and_msg_send(struct smc_connection *conn)
+{
+   int rc;
+
+   if (conn->lgr->is_smcd) {
+   spin_lock_bh(>send_lock);
+   rc = smcd_cdc_msg_send(conn);
+   spin_unlock_bh(>send_lock);
+   } else {
+   rc = smcr_cdc_get_slot_and_msg_send(conn);
+   }
+
+   return rc;
+}
+
 static bool smc_cdc_tx_filter(struct smc_wr_tx_pend_priv *tx_pend,
  unsigned long data)
 {
@@ -157,6 +172,45 @@ void smc_cdc_tx_dismiss_slots(struct smc_connection *conn)
(unsigned long)conn);
 }
 
+/* Send a SMC-D CDC header.
+ * This increments the free space available in our send buffer.
+ * Also update the confirmed receive buffer with what was sent to the peer.
+ */
+int smcd_cdc_msg_send(struct smc_connection *conn)
+{
+   struct smc_sock *smc = container_of(conn, struct smc_sock, conn);
+   struct smcd_cdc_msg cdc;
+   int rc, diff;
+
+   memset(, 0, sizeof(cdc));
+   cdc.common.type = SMC_CDC_MSG_TYPE;
+   cdc.prod_wrap = conn->local_tx_ctrl.prod.wrap;
+   cdc.prod_count = conn->local_tx_ctrl.prod.count;
+
+   cdc.cons_wrap = conn->local_tx_ctrl.cons.wrap;
+   cdc.cons_count = conn->local_tx_ctrl.cons.count;
+   cdc.prod_flags = conn->local_tx_ctrl.prod_flags;
+   cdc.conn_state_flags = conn->local_tx_ctrl.conn_state_flags;
+   rc = smcd_tx_ism_write(conn, , sizeof(cdc), 0, 1);
+   if (rc)
+   return rc;
+   smc_curs_write(>rx_curs_confirmed,
+  smc_curs_read(>local_tx_ctrl.cons, conn), conn);
+   /* Calculate transmitted data and increment free send buffer space */
+   diff = smc_curs_diff(conn->sndbuf_desc->len, >tx_curs_fin,
+>tx_curs_sent);
+   /* increased by confirmed number of bytes */
+   smp_mb__before_atomic();
+   atomic_add(diff, >sndbuf_space);
+   /* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
+   smp_mb__after_atomic();
+   smc_curs_write(>tx_curs_fin,
+  smc_curs_read(>tx_curs_sent, conn), conn);
+
+   smc_tx_sndbuf_nonfull(smc);
+   return rc;
+}
+
 /* receive ***/
 
 static inline bool smc_cdc_before(u16 seq1, u16 seq2)
@@ -178,7 +232,7 @@ static void smc_cdc_handle_urg_data_arrival(struct smc_sock 
*smc,
if (!sock_flag(>sk, SOCK_URGINLINE))
/* we'll skip the urgent byte, so don't account for it */
(*diff_prod)--;
-   base = (char *)conn->rmb_desc->cpu_addr;
+   base = (char *)conn->rmb_desc->cpu_addr + conn->rx_off;
if (conn->urg_curs.count)
conn->urg_rx_byte = *(base 

[PATCH net-next 08/10] net/smc: add SMC-D support in af_smc

2018-06-28 Thread Ursula Braun
From: Hans Wippel 

This patch ties together the previous SMC-D patches. It adds support for
SMC-D to the listen and connect functions and, thus, enables SMC-D
support in the SMC code. If a connection supports both SMC-R and SMC-D,
SMC-D is preferred.

Signed-off-by: Hans Wippel 
Signed-off-by: Ursula Braun 
Suggested-by: Thomas Richter 
---
 net/smc/af_smc.c   | 216 -
 net/smc/smc_core.c |   2 +-
 net/smc/smc_core.h |   1 +
 3 files changed, 200 insertions(+), 19 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 20afa94be8bb..cbbb947dbfcf 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -35,6 +36,7 @@
 #include "smc_cdc.h"
 #include "smc_core.h"
 #include "smc_ib.h"
+#include "smc_ism.h"
 #include "smc_pnet.h"
 #include "smc_tx.h"
 #include "smc_rx.h"
@@ -372,8 +374,8 @@ static int smc_clnt_conf_first_link(struct smc_sock *smc)
return 0;
 }
 
-static void smc_conn_save_peer_info(struct smc_sock *smc,
-   struct smc_clc_msg_accept_confirm *clc)
+static void smcr_conn_save_peer_info(struct smc_sock *smc,
+struct smc_clc_msg_accept_confirm *clc)
 {
int bufsize = smc_uncompress_bufsize(clc->rmbe_size);
 
@@ -384,6 +386,28 @@ static void smc_conn_save_peer_info(struct smc_sock *smc,
smc->conn.tx_off = bufsize * (smc->conn.peer_rmbe_idx - 1);
 }
 
+static void smcd_conn_save_peer_info(struct smc_sock *smc,
+struct smc_clc_msg_accept_confirm *clc)
+{
+   int bufsize = smc_uncompress_bufsize(clc->dmbe_size);
+
+   smc->conn.peer_rmbe_idx = clc->dmbe_idx;
+   smc->conn.peer_token = clc->token;
+   /* msg header takes up space in the buffer */
+   smc->conn.peer_rmbe_size = bufsize - sizeof(struct smcd_cdc_msg);
+   atomic_set(>conn.peer_rmbe_space, smc->conn.peer_rmbe_size);
+   smc->conn.tx_off = bufsize * smc->conn.peer_rmbe_idx;
+}
+
+static void smc_conn_save_peer_info(struct smc_sock *smc,
+   struct smc_clc_msg_accept_confirm *clc)
+{
+   if (smc->conn.lgr->is_smcd)
+   smcd_conn_save_peer_info(smc, clc);
+   else
+   smcr_conn_save_peer_info(smc, clc);
+}
+
 static void smc_link_save_peer_info(struct smc_link *link,
struct smc_clc_msg_accept_confirm *clc)
 {
@@ -450,15 +474,51 @@ static int smc_check_rdma(struct smc_sock *smc, struct 
smc_ib_device **ibdev,
return reason_code;
 }
 
+/* check if there is an ISM device available for this connection. */
+/* called for connect and listen */
+static int smc_check_ism(struct smc_sock *smc, struct smcd_dev **ismdev)
+{
+   /* Find ISM device with same PNETID as connecting interface  */
+   smc_pnet_find_ism_resource(smc->clcsock->sk, ismdev);
+   if (!(*ismdev))
+   return SMC_CLC_DECL_CNFERR; /* configuration error */
+   return 0;
+}
+
+/* Check for VLAN ID and register it on ISM device just for CLC handshake */
+static int smc_connect_ism_vlan_setup(struct smc_sock *smc,
+ struct smcd_dev *ismdev,
+ unsigned short vlan_id)
+{
+   if (vlan_id && smc_ism_get_vlan(ismdev, vlan_id))
+   return SMC_CLC_DECL_CNFERR;
+   return 0;
+}
+
+/* cleanup temporary VLAN ID registration used for CLC handshake. If ISM is
+ * used, the VLAN ID will be registered again during the connection setup.
+ */
+static int smc_connect_ism_vlan_cleanup(struct smc_sock *smc, bool is_smcd,
+   struct smcd_dev *ismdev,
+   unsigned short vlan_id)
+{
+   if (!is_smcd)
+   return 0;
+   if (vlan_id && smc_ism_put_vlan(ismdev, vlan_id))
+   return SMC_CLC_DECL_CNFERR;
+   return 0;
+}
+
 /* CLC handshake during connect */
 static int smc_connect_clc(struct smc_sock *smc, int smc_type,
   struct smc_clc_msg_accept_confirm *aclc,
-  struct smc_ib_device *ibdev, u8 ibport)
+  struct smc_ib_device *ibdev, u8 ibport,
+  struct smcd_dev *ismdev)
 {
int rc = 0;
 
/* do inband token exchange */
-   rc = smc_clc_send_proposal(smc, smc_type, ibdev, ibport, NULL);
+   rc = smc_clc_send_proposal(smc, smc_type, ibdev, ibport, ismdev);
if (rc)
return rc;
/* receive SMC Accept CLC message */
@@ -538,11 +598,50 @@ static int smc_connect_rdma(struct smc_sock *smc,
return 0;
 }
 
+/* setup for ISM connection of client */
+static int smc_connect_ism(struct smc_sock *smc,
+  struct smc_clc_msg_accept_confirm *aclc,
+  struct smcd_dev *ismdev)
+{
+   int 

Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags

2018-06-28 Thread Jakub Kicinski
On Thu, 28 Jun 2018 19:01:52 +0200, Jiri Benc wrote:
> On Thu, 28 Jun 2018 09:54:52 -0700, Jakub Kicinski wrote:
> > Hmm... in practice we could steal top bits of the size parameter for
> > some flags, since it seems to be limited to values < 256 today?  Is it
> > worth it?
> > 
> > It would look something along the lines of:  
> 
> Something like that, yes. I'll leave to Daniel to review how much sense
> it makes from the BPF side.

Can we take this as a follow up through the bpf-next tree or do you
want us to respin as part of this set?


[PATCH net-next 01/10] net/smc: determine port attributes independent from pnet table

2018-06-28 Thread Ursula Braun
For SMC it is important to know the current port state of RoCE devices.
Monitoring port states has been triggered, when a RoCE device was added
to the pnet table. To support future alternatives to the pnet table the
monitoring of ports is made independent of the existence of a pnet table.
It starts once the smc_ib_device is established.

Due to this change smc_ib_remember_port_attr() is now a local function
and shuffling its location and the location of its used functions
makes any forward references obsolete.

And the duplicate SMC_MAX_PORTS definition is removed.

Signed-off-by: Ursula Braun 
---
 net/smc/smc.h  |   2 -
 net/smc/smc_ib.c   | 130 -
 net/smc/smc_ib.h   |   1 -
 net/smc/smc_pnet.c |   7 +--
 4 files changed, 72 insertions(+), 68 deletions(-)

diff --git a/net/smc/smc.h b/net/smc/smc.h
index 51ae1f10d81a..7c86f716a92e 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -21,8 +21,6 @@
 #define SMCPROTO_SMC   0   /* SMC protocol, IPv4 */
 #define SMCPROTO_SMC6  1   /* SMC protocol, IPv6 */
 
-#define SMC_MAX_PORTS  2   /* Max # of ports */
-
 extern struct proto smc_proto;
 extern struct proto smc_proto6;
 
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 0eed7ab9f28b..f8b159ced032 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -143,6 +143,62 @@ int smc_ib_ready_link(struct smc_link *lnk)
return rc;
 }
 
+static int smc_ib_fill_gid_and_mac(struct smc_ib_device *smcibdev, u8 ibport)
+{
+   struct ib_gid_attr gattr;
+   int rc;
+
+   rc = ib_query_gid(smcibdev->ibdev, ibport, 0,
+ >gid[ibport - 1], );
+   if (rc || !gattr.ndev)
+   return -ENODEV;
+
+   memcpy(smcibdev->mac[ibport - 1], gattr.ndev->dev_addr, ETH_ALEN);
+   dev_put(gattr.ndev);
+   return 0;
+}
+
+/* Create an identifier unique for this instance of SMC-R.
+ * The MAC-address of the first active registered IB device
+ * plus a random 2-byte number is used to create this identifier.
+ * This name is delivered to the peer during connection initialization.
+ */
+static inline void smc_ib_define_local_systemid(struct smc_ib_device *smcibdev,
+   u8 ibport)
+{
+   memcpy(_systemid[2], >mac[ibport - 1],
+  sizeof(smcibdev->mac[ibport - 1]));
+   get_random_bytes(_systemid[0], 2);
+}
+
+bool smc_ib_port_active(struct smc_ib_device *smcibdev, u8 ibport)
+{
+   return smcibdev->pattr[ibport - 1].state == IB_PORT_ACTIVE;
+}
+
+static int smc_ib_remember_port_attr(struct smc_ib_device *smcibdev, u8 ibport)
+{
+   int rc;
+
+   memset(>pattr[ibport - 1], 0,
+  sizeof(smcibdev->pattr[ibport - 1]));
+   rc = ib_query_port(smcibdev->ibdev, ibport,
+  >pattr[ibport - 1]);
+   if (rc)
+   goto out;
+   /* the SMC protocol requires specification of the RoCE MAC address */
+   rc = smc_ib_fill_gid_and_mac(smcibdev, ibport);
+   if (rc)
+   goto out;
+   if (!strncmp(local_systemid, SMC_LOCAL_SYSTEMID_RESET,
+sizeof(local_systemid)) &&
+   smc_ib_port_active(smcibdev, ibport))
+   /* create unique system identifier */
+   smc_ib_define_local_systemid(smcibdev, ibport);
+out:
+   return rc;
+}
+
 /* process context wrapper for might_sleep smc_ib_remember_port_attr */
 static void smc_ib_port_event_work(struct work_struct *work)
 {
@@ -370,62 +426,6 @@ void smc_ib_buf_unmap_sg(struct smc_ib_device *smcibdev,
buf_slot->sgt[SMC_SINGLE_LINK].sgl->dma_address = 0;
 }
 
-static int smc_ib_fill_gid_and_mac(struct smc_ib_device *smcibdev, u8 ibport)
-{
-   struct ib_gid_attr gattr;
-   int rc;
-
-   rc = ib_query_gid(smcibdev->ibdev, ibport, 0,
- >gid[ibport - 1], );
-   if (rc || !gattr.ndev)
-   return -ENODEV;
-
-   memcpy(smcibdev->mac[ibport - 1], gattr.ndev->dev_addr, ETH_ALEN);
-   dev_put(gattr.ndev);
-   return 0;
-}
-
-/* Create an identifier unique for this instance of SMC-R.
- * The MAC-address of the first active registered IB device
- * plus a random 2-byte number is used to create this identifier.
- * This name is delivered to the peer during connection initialization.
- */
-static inline void smc_ib_define_local_systemid(struct smc_ib_device *smcibdev,
-   u8 ibport)
-{
-   memcpy(_systemid[2], >mac[ibport - 1],
-  sizeof(smcibdev->mac[ibport - 1]));
-   get_random_bytes(_systemid[0], 2);
-}
-
-bool smc_ib_port_active(struct smc_ib_device *smcibdev, u8 ibport)
-{
-   return smcibdev->pattr[ibport - 1].state == IB_PORT_ACTIVE;
-}
-
-int smc_ib_remember_port_attr(struct smc_ib_device *smcibdev, u8 ibport)
-{
-   int rc;
-
-   memset(>pattr[ibport - 1], 0,
-  sizeof(smcibdev->pattr[ibport - 1]));
-  

[PATCH net-next 02/10] net/smc: add pnetid support

2018-06-28 Thread Ursula Braun
s390 hardware supports the definition of a so-call Physical NETwork
IDentifier (short PNETID) per network device port. These PNETIDS
can be used to identify network devices that are attached to the same
physical network (broadcast domain).

On s390 try to use the PNETID of the ethernet device port used for
initial connecting, and derive the IB device port used for SMC RDMA
traffic.

On platforms without PNETID support fall back to the existing
solution of a configured pnet table.

Signed-off-by: Ursula Braun 
---
 include/net/smc.h  |   2 +
 net/smc/smc_ib.c   |   6 ++-
 net/smc/smc_ib.h   |   3 ++
 net/smc/smc_pnet.c | 109 +++--
 net/smc/smc_pnet.h |  14 +++
 5 files changed, 114 insertions(+), 20 deletions(-)

diff --git a/include/net/smc.h b/include/net/smc.h
index 8381d163fefa..2173932fab9d 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -11,6 +11,8 @@
 #ifndef _SMC_H
 #define _SMC_H
 
+#define SMC_MAX_PNETID_LEN 16  /* Max. length of PNET id */
+
 struct smc_hashinfo {
rwlock_t lock;
struct hlist_head ht;
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index f8b159ced032..36de2fd76170 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -504,8 +504,12 @@ static void smc_ib_add_dev(struct ib_device *ibdev)
port_cnt = smcibdev->ibdev->phys_port_cnt;
for (i = 0;
 i < min_t(size_t, port_cnt, SMC_MAX_PORTS);
-i++)
+i++) {
set_bit(i, >port_event_mask);
+   /* determine pnetids of the port */
+   smc_pnetid_by_dev_port(ibdev->dev.parent, i,
+  smcibdev->pnetid[i]);
+   }
schedule_work(>port_event_work);
 }
 
diff --git a/net/smc/smc_ib.h b/net/smc/smc_ib.h
index 2c480b352928..7c1223c91229 100644
--- a/net/smc/smc_ib.h
+++ b/net/smc/smc_ib.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define SMC_MAX_PORTS  2   /* Max # of ports */
 #define SMC_GID_SIZE   sizeof(union ib_gid)
@@ -40,6 +41,8 @@ struct smc_ib_device {/* 
ib-device infos for smc */
charmac[SMC_MAX_PORTS][ETH_ALEN];
/* mac address per port*/
union ib_gidgid[SMC_MAX_PORTS]; /* gid per port */
+   u8  pnetid[SMC_MAX_PORTS][SMC_MAX_PNETID_LEN];
+   /* pnetid per port */
u8  initialized : 1; /* ib dev CQ, evthdl done */
struct work_struct  port_event_work;
unsigned long   port_event_mask;
diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c
index a82a5cad0282..cdc6e23b6ce1 100644
--- a/net/smc/smc_pnet.c
+++ b/net/smc/smc_pnet.c
@@ -23,12 +23,10 @@
 #include "smc_pnet.h"
 #include "smc_ib.h"
 
-#define SMC_MAX_PNET_ID_LEN16  /* Max. length of PNET id */
-
 static struct nla_policy smc_pnet_policy[SMC_PNETID_MAX + 1] = {
[SMC_PNETID_NAME] = {
.type = NLA_NUL_STRING,
-   .len = SMC_MAX_PNET_ID_LEN - 1
+   .len = SMC_MAX_PNETID_LEN - 1
},
[SMC_PNETID_ETHNAME] = {
.type = NLA_NUL_STRING,
@@ -65,7 +63,7 @@ static struct smc_pnettable {
  */
 struct smc_pnetentry {
struct list_head list;
-   char pnet_name[SMC_MAX_PNET_ID_LEN + 1];
+   char pnet_name[SMC_MAX_PNETID_LEN + 1];
struct net_device *ndev;
struct smc_ib_device *smcibdev;
u8 ib_port;
@@ -209,7 +207,7 @@ static bool smc_pnetid_valid(const char *pnet_name, char 
*pnetid)
return false;
while (--end >= bf && isspace(*end))
;
-   if (end - bf >= SMC_MAX_PNET_ID_LEN)
+   if (end - bf >= SMC_MAX_PNETID_LEN)
return false;
while (bf <= end) {
if (!isalnum(*bf))
@@ -512,26 +510,70 @@ void smc_pnet_exit(void)
genl_unregister_family(_pnet_nl_family);
 }
 
-/* PNET table analysis for a given sock:
- * determine ib_device and port belonging to used internal TCP socket
- * ethernet interface.
+/* Determine one base device for stacked net devices.
+ * If the lower device level contains more than one devices
+ * (for instance with bonding slaves), just the first device
+ * is used to reach a base device.
  */
-void smc_pnet_find_roce_resource(struct sock *sk,
-struct smc_ib_device **smcibdev, u8 *ibport)
+static struct net_device *pnet_find_base_ndev(struct net_device *ndev)
 {
-   struct dst_entry *dst = sk_dst_get(sk);
-   struct smc_pnetentry *pnetelem;
+   int i, nest_lvl;
 
-   *smcibdev = NULL;
-   *ibport = 0;
+   rtnl_lock();
+   nest_lvl = dev_get_nest_level(ndev);
+   for (i = 0; i < nest_lvl; i++) {
+   struct list_head *lower = >adj_list.lower;
+
+   if 

[PATCH net-next 09/10] net/smc: add SMC-D diag support

2018-06-28 Thread Ursula Braun
From: Hans Wippel 

This patch adds diag support for SMC-D.

Signed-off-by: Hans Wippel 
Signed-off-by: Ursula Braun 
Suggested-by: Thomas Richter 
---
 include/uapi/linux/smc_diag.h | 10 ++
 net/smc/smc_diag.c| 15 +++
 2 files changed, 25 insertions(+)

diff --git a/include/uapi/linux/smc_diag.h b/include/uapi/linux/smc_diag.h
index 0ae5d4685ba3..92be255e534c 100644
--- a/include/uapi/linux/smc_diag.h
+++ b/include/uapi/linux/smc_diag.h
@@ -35,6 +35,7 @@ enum {
SMC_DIAG_CONNINFO,
SMC_DIAG_LGRINFO,
SMC_DIAG_SHUTDOWN,
+   SMC_DIAG_DMBINFO,
__SMC_DIAG_MAX,
 };
 
@@ -83,4 +84,13 @@ struct smc_diag_lgrinfo {
struct smc_diag_linkinfolnk[1];
__u8role;
 };
+
+struct smcd_diag_dmbinfo { /* SMC-D Socket internals */
+   __u32 linkid;   /* Link identifier */
+   __u64 peer_gid; /* Peer GID */
+   __u64 my_gid;   /* My GID */
+   __u64 token;/* Token of DMB */
+   __u64 peer_token;   /* Token of remote DMBE */
+};
+
 #endif /* _UAPI_SMC_DIAG_H_ */
diff --git a/net/smc/smc_diag.c b/net/smc/smc_diag.c
index 64ce107c24d9..6d83eef1b743 100644
--- a/net/smc/smc_diag.c
+++ b/net/smc/smc_diag.c
@@ -156,6 +156,21 @@ static int __smc_diag_dump(struct sock *sk, struct sk_buff 
*skb,
if (nla_put(skb, SMC_DIAG_LGRINFO, sizeof(linfo), ) < 0)
goto errout;
}
+   if (smc->conn.lgr && smc->conn.lgr->is_smcd &&
+   (req->diag_ext & (1 << (SMC_DIAG_DMBINFO - 1))) &&
+   !list_empty(>conn.lgr->list)) {
+   struct smc_connection *conn = >conn;
+   struct smcd_diag_dmbinfo dinfo = {
+   .linkid = *((u32 *)conn->lgr->id),
+   .peer_gid = conn->lgr->peer_gid,
+   .my_gid = conn->lgr->smcd->local_gid,
+   .token = conn->rmb_desc->token,
+   .peer_token = conn->peer_token
+   };
+
+   if (nla_put(skb, SMC_DIAG_DMBINFO, sizeof(dinfo), ) < 0)
+   goto errout;
+   }
 
nlmsg_end(skb, nlh);
return 0;
-- 
2.16.4



Re: [PATCH] test_bpf: flag tests that cannot be jited on s390

2018-06-28 Thread Song Liu
On Wed, Jun 27, 2018 at 8:19 AM, Kleber Sacilotto de Souza
 wrote:
> Flag with FLAG_EXPECTED_FAIL the BPF_MAXINSNS tests that cannot be jited
> on s390 because they exceed BPF_SIZE_MAX and fail when
> CONFIG_BPF_JIT_ALWAYS_ON is set. Also set .expected_errcode to -ENOTSUPP
> so the tests pass in that case.
>
> Signed-off-by: Kleber Sacilotto de Souza 

Acked-by: Song Liu 

> ---
>  lib/test_bpf.c | 20 
>  1 file changed, 20 insertions(+)
>
> diff --git a/lib/test_bpf.c b/lib/test_bpf.c
> index 60aedc879361..08d3d59dca17 100644
> --- a/lib/test_bpf.c
> +++ b/lib/test_bpf.c
> @@ -5282,21 +5282,31 @@ static struct bpf_test tests[] = {
> {   /* Mainly checking JIT here. */
> "BPF_MAXINSNS: Ctx heavy transformations",
> { },
> +#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_S390)
> +   CLASSIC | FLAG_EXPECTED_FAIL,
> +#else
> CLASSIC,
> +#endif
> { },
> {
> {  1, !!(SKB_VLAN_TCI & VLAN_TAG_PRESENT) },
> { 10, !!(SKB_VLAN_TCI & VLAN_TAG_PRESENT) }
> },
> .fill_helper = bpf_fill_maxinsns6,
> +   .expected_errcode = -ENOTSUPP,
> },
> {   /* Mainly checking JIT here. */
> "BPF_MAXINSNS: Call heavy transformations",
> { },
> +#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_S390)
> +   CLASSIC | FLAG_NO_DATA | FLAG_EXPECTED_FAIL,
> +#else
> CLASSIC | FLAG_NO_DATA,
> +#endif
> { },
> { { 1, 0 }, { 10, 0 } },
> .fill_helper = bpf_fill_maxinsns7,
> +   .expected_errcode = -ENOTSUPP,
> },
> {   /* Mainly checking JIT here. */
> "BPF_MAXINSNS: Jump heavy test",
> @@ -5347,18 +5357,28 @@ static struct bpf_test tests[] = {
> {
> "BPF_MAXINSNS: exec all MSH",
> { },
> +#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_S390)
> +   CLASSIC | FLAG_EXPECTED_FAIL,
> +#else
> CLASSIC,
> +#endif
> { 0xfa, 0xfb, 0xfc, 0xfd, },
> { { 4, 0xababab83 } },
> .fill_helper = bpf_fill_maxinsns13,
> +   .expected_errcode = -ENOTSUPP,
> },
> {
> "BPF_MAXINSNS: ld_abs+get_processor_id",
> { },
> +#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_S390)
> +   CLASSIC | FLAG_EXPECTED_FAIL,
> +#else
> CLASSIC,
> +#endif
> { },
> { { 1, 0xbee } },
> .fill_helper = bpf_fill_ld_abs_get_processor_id,
> +   .expected_errcode = -ENOTSUPP,
> },
> /*
>  * LD_IND / LD_ABS on fragmented SKBs
> --
> 2.17.1
>


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Linus Torvalds
On Thu, Jun 28, 2018 at 4:37 PM Al Viro  wrote:
>
> You underestimate the nastiness of that thing (and for the record, I'm
> a lot *less* fond of AIO than you are, what with having had to read that
> nest of horrors lately).  It does not "copy the data to userland"; what it
> does instead is copying into an array of pages it keeps, right from
> IO completion callback.  I

Ugh.

Oh well. I'd be perfectly happy to have somebody re-write and
re-architect the aio code entirely.  Much rather than that the
->poll() code. Because I know which one I think is well-desiged with a
nice usable interface, and which one is a pile of shit.

In the meantime, if AIO wants to do poll() in some irq callback, may I
suggest just using workqueues.

Linus


Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction

2018-06-28 Thread Lawrence Brakmo


On 6/28/18, 1:48 PM, "netdev-ow...@vger.kernel.org on behalf of Neal Cardwell" 
 wrote:

On Thu, Jun 28, 2018 at 4:20 PM Lawrence Brakmo  wrote:
>
> I just looked at 4.18 traces and the behavior is as follows:
>
>Host A sends the last packets of the request
>
>Host B receives them, and the last packet is marked with congestion 
(CE)
>
>Host B sends ACKs for packets not marked with congestion
>
>Host B sends data packet with reply and ACK for packet marked with 
congestion (TCP flag ECE)
>
>Host A receives ACKs with no ECE flag
>
>Host A receives data packet with ACK for the last packet of request 
and has TCP ECE bit set
>
>Host A sends 1st data packet of the next request with TCP flag CWR
>
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>
>Host B sends a dup ACK that also has the TCP ECE flag
>
>Host A RTO timer fires!
>
>Host A to send the next packet
>
>Host A receives an ACK for everything it has sent (i.e. Host B did 
receive 1st packet of request)
>
>Host A send more packets…

Thanks, Larry! This is very interesting. I don't know the cause, but
this reminds me of an issue  Steve Ibanez raised on the netdev list
last December, where he was seeing cases with DCTCP where a CWR packet
would be received and buffered by Host B but not ACKed by Host B. This
was the thread "Re: Linux ECN Handling", starting around December 5. I
have cc-ed Steve.

I wonder if this may somehow be related to the DCTCP logic to rewind
tp->rcv_nxt and call tcp_send_ack(), and then restore tp->rcv_nxt, if
DCTCP notices that the incoming CE bits have been changed while the
receiver thinks it is holding on to a delayed ACK (in
dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0()). I wonder if the
"synthetic" call to tcp_send_ack() somehow has side effects in the
delayed ACK state machine that can cause the connection to forget that
it still needs to fire a delayed ACK, even though it just sent an ACK
just now.

neal

Here is a packetdrill script that reproduces the problem:

// Repro bug that does not ack data, not even with delayed-ack

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < [ect0] SEW 0:0(0) win 32792 
0.100 > SE. 0:0(0) ack 1 
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect0] . 1:1(0) ack 1001
0.200 write(4, ..., 1) = 1
0.200 > [ect0] P. 1:2(1) ack 1001

0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
0.200 write(4, ..., 1) = 1
0.200 > [ect0] P. 2:3(1) ack 2001

0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
0.200 > [ect0] . 3:3(0) ack 4001

0.210 < [ce] P. 4001:4501(500) ack 3 win 257

+0.001 read(4, ..., 4500) = 4500
+0 write(4, ..., 1) = 1
+0 > [ect0] PE. 3:4(1) ack 4501

+0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
+0  > [ect0] E. 4:4(0) ack 4501   // dup ack sent

+0.311 < [ect0] . 5501:6501(1000) ack 4 win 257  // Long RTO
+0 > [ect0] . 4:4(0) ack 6501 // now acks everything

+0.500 < F. 9501:9501(0) ack 4 win 257




Re: [PATCH bpf-net 08/14] bpf: introduce the bpf_get_local_storage() helper function

2018-06-28 Thread kbuild test robot
Hi Roman,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]
[also build test ERROR on v4.18-rc2]
[cannot apply to next-20180628]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Roman-Gushchin/bpf-cgroup-local-storage/20180629-035104
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: um-x86_64_defconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=um SUBARCH=x86_64

All errors (new ones prefixed by >>):

   net/core/filter.o: In function `cg_skb_func_proto':
   filter.c:(.text+0x58a7): undefined reference to `bpf_get_local_storage_proto'
   net/core/filter.o: In function `sock_filter_func_proto':
   filter.c:(.text+0x5b3d): undefined reference to `bpf_get_local_storage_proto'
   net/core/filter.o: In function `sock_ops_func_proto':
   filter.c:(.text+0x5b9d): undefined reference to `bpf_get_local_storage_proto'
   net/core/filter.o: In function `sk_skb_func_proto':
   filter.c:(.text+0x5c60): undefined reference to `bpf_get_local_storage_proto'
   net/core/filter.o: In function `sk_msg_func_proto':
   filter.c:(.text+0x5cc3): undefined reference to `bpf_get_local_storage_proto'
   net/core/filter.o:filter.c:(.text+0x5ee6): more undefined references to 
`bpf_get_local_storage_proto' follow
>> collect2: error: ld returned 1 exit status

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Al Viro
On Thu, Jun 28, 2018 at 03:55:35PM -0700, Linus Torvalds wrote:
> > You are misreading that mess.  What he's trying to do (other than surviving
> > the awful clusterfuck around cancels) is to handle the decision what to
> > report to userland right in the wakeup callback.  *That* is what really
> > drives the "make the second-pass ->poll() or something similar to it
> > non-blocking" (in addition to the fact that it is such in considerable
> > majority of instances).
> 
> That's just crazy BS.
> 
> Just call poll() again when you copy the data to userland (which by
> definition can block, again).
> 
> Stop the idiotic "let's break poll for stupid AIO reasons, because the
> AIO people are morons".

You underestimate the nastiness of that thing (and for the record, I'm
a lot *less* fond of AIO than you are, what with having had to read that
nest of horrors lately).  It does not "copy the data to userland"; what it
does instead is copying into an array of pages it keeps, right from
IO completion callback.  In read/write case.  This
ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
event = ev_page + pos % AIO_EVENTS_PER_PAGE;

event->obj = (u64)(unsigned long)iocb->ki_user_iocb;
event->data = iocb->ki_user_data;
event->res = res;
event->res2 = res2;

kunmap_atomic(ev_page);
flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
is what does the copying.  And that might be done from IRQ context.
Yes, really.

They do have a slightly saner syscall that does copying from the damn
ring buffer, but its use is optional - userland can (and does) direct
read access to mmapped buffer.

Single-consumer ABIs suck and AIO is one such...

It could do schedule_work() and do blocking stuff from that - does so, in
case if it can't grab ->ctx_lock.  Earlier iteration used to try doing
everything straight from wakeup callback, and *that* was racy as hell;
I'd rather have Christoph explain which races he'd been refering to,
but there had been a whole lot of that.  Solution I suggested in the
last round of that was to offload __aio_poll_complete() via schedule_work()
both for cancel and poll wakeup cases.  Doing the common case right
from poll wakeup callback was argued to avoid noticable overhead in
common situation - that's what "aio: try to complete poll iocbs without
context switch" is about.  I'm more than slightly unhappy about the
lack of performance regression testing in non-AIO case...

At that point I would really like to see replies from Christoph - he's
on CET usually, no idea what his effective timezone is...


Re: [PATCH v2 net-next 0/6] net sched actions: code style cleanup and fixes

2018-06-28 Thread David Miller
From: Roman Mashak 
Date: Wed, 27 Jun 2018 13:33:29 -0400

> The patchset fixes a few code stylistic issues and typos, as well as one
> detected by sparse semantic checker tool.
> 
> No functional changes introduced.
> 
> Patch 1 & 2 fix coding style bits caught by the checkpatch.pl script
> Patch 3 fixes an issue with a shadowed variable
> Patch 4 adds sizeof() operator instead of magic number for buffer length
> Patch 5 fixes typos in diagnostics messages
> Patch 6 explicitly sets unsigned char for bitwise operation
> 
> v2:
>- submit for net-next
>- added Reviewed-by tags
>- use u8* instead of char* as per Davide Caratti suggestion

Series applied.


[patch iproute2/net-next v2] tc: introduce support for chain templates

2018-06-28 Thread Jiri Pirko
From: Jiri Pirko 

Signed-off-by: Jiri Pirko 
---
v1->v2:
- moved the template handling
  from "tc filter template" to "tc chaintemplate"
---
 include/uapi/linux/rtnetlink.h |   7 +++
 man/man8/tc.8  |  26 ++
 tc/tc.c|   5 +-
 tc/tc_common.h |   1 +
 tc/tc_filter.c | 106 ++---
 tc/tc_monitor.c|   5 +-
 6 files changed, 121 insertions(+), 29 deletions(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index c3a7d8ecc7b9..dddb05e5cca8 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -150,6 +150,13 @@ enum {
RTM_NEWCACHEREPORT = 96,
 #define RTM_NEWCACHEREPORT RTM_NEWCACHEREPORT
 
+   RTM_NEWCHAINTMPLT = 100,
+#define RTM_NEWCHAINTMPLT RTM_NEWCHAINTMPLT
+   RTM_DELCHAINTMPLT,
+#define RTM_DELCHAINTMPLT RTM_DELCHAINTMPLT
+   RTM_GETCHAINTMPLT,
+#define RTM_GETCHAINTMPLT RTM_GETCHAINTMPLT
+
__RTM_MAX,
 #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1)
 };
diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 840880fbdba6..3eee1aceaae4 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -58,6 +58,22 @@ tc \- show / manipulate traffic control settings
 .B flowid
 \fIflow-id\fR
 
+.B tc
+.RI "[ " OPTIONS " ]"
+.B chaintemplate [ add | delete | get ] dev
+\fIDEV\fR
+.B [ parent
+\fIqdisc-id\fR
+.B | root ]\fR filtertype
+[ filtertype specific parameters ]
+
+.B tc
+.RI "[ " OPTIONS " ]"
+.B chaintemplate [ add | delete | get ] block
+\fIBLOCK_INDEX\fR filtertype
+[ filtertype specific parameters ]
+
+
 .B tc
 .RI "[ " OPTIONS " ]"
 .RI "[ " FORMAT " ]"
@@ -80,6 +96,16 @@ tc \- show / manipulate traffic control settings
 .RI "[ " OPTIONS " ]"
 .B filter show block
 \fIBLOCK_INDEX\fR
+.P
+.B tc
+.RI "[ " OPTIONS " ]"
+.B chaintemplate show dev
+\fIDEV\fR
+.P
+.B tc
+.RI "[ " OPTIONS " ]"
+.B chaintemplate show block
+\fIBLOCK_INDEX\fR
 
 .P
 .B tc
diff --git a/tc/tc.c b/tc/tc.c
index 0d223281ba25..8a0592c45800 100644
--- a/tc/tc.c
+++ b/tc/tc.c
@@ -197,7 +197,8 @@ static void usage(void)
fprintf(stderr,
"Usage: tc [ OPTIONS ] OBJECT { COMMAND | help }\n"
"   tc [-force] -batch filename\n"
-   "where  OBJECT := { qdisc | class | filter | action | monitor | 
exec }\n"
+   "where  OBJECT := { qdisc | class | filter | chaintemplate |\n"
+   "   action | monitor | exec }\n"
"   OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails] | 
-r[aw] |\n"
"-o[neline] | -j[son] | -p[retty] | 
-c[olor]\n"
"-b[atch] [filename] | -n[etns] name |\n"
@@ -212,6 +213,8 @@ static int do_cmd(int argc, char **argv, void *buf, size_t 
buflen)
return do_class(argc-1, argv+1);
if (matches(*argv, "filter") == 0)
return do_filter(argc-1, argv+1, buf, buflen);
+   if (matches(*argv, "chaintemplate") == 0)
+   return do_chaintmplt(argc-1, argv+1, buf, buflen);
if (matches(*argv, "actions") == 0)
return do_action(argc-1, argv+1, buf, buflen);
if (matches(*argv, "monitor") == 0)
diff --git a/tc/tc_common.h b/tc/tc_common.h
index 49c24616c2c3..16cefe896109 100644
--- a/tc/tc_common.h
+++ b/tc/tc_common.h
@@ -8,6 +8,7 @@ extern struct rtnl_handle rth;
 extern int do_qdisc(int argc, char **argv);
 extern int do_class(int argc, char **argv);
 extern int do_filter(int argc, char **argv, void *buf, size_t buflen);
+extern int do_chaintmplt(int argc, char **argv, void *buf, size_t buflen);
 extern int do_action(int argc, char **argv, void *buf, size_t buflen);
 extern int do_tcmonitor(int argc, char **argv);
 extern int do_exec(int argc, char **argv);
diff --git a/tc/tc_filter.c b/tc/tc_filter.c
index c5bb0bffe19b..df0ce853fbcc 100644
--- a/tc/tc_filter.c
+++ b/tc/tc_filter.c
@@ -39,12 +39,21 @@ static void usage(void)
"\n"
"   tc filter show [ dev STRING ] [ root | ingress | egress 
| parent CLASSID ]\n"
"   tc filter show [ block BLOCK_INDEX ]\n"
+   "   tc chaintemplate [ add | del | get | show ] [ dev 
STRING ]\n"
+   "   tc chaintemplate [ add | del | get | show ] [ block 
BLOCK_INDEX ] ]\n"
"Where:\n"
"FILTER_TYPE := { rsvp | u32 | bpf | fw | route | etc. }\n"
"FILTERID := ... format depends on classifier, see there\n"
"OPTIONS := ... try tc filter add  
help\n");
 }
 
+static void chaintmplt_usage(void)
+{
+   fprintf(stderr,
+   "Usage: tc chaintemplate [ add | del | get | show ] [ dev 
STRING ]\n"
+   "   tc chaintemplate [ add | del | get | show ] [ block 
BLOCK_INDEX ] ]\n");
+}
+
 struct tc_filter_req {
struct nlmsghdr n;
struct tcmsgt;

Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw

2018-06-28 Thread Jamal Hadi Salim



On 26/06/18 03:59 AM, Jiri Pirko wrote:

From: Jiri Pirko 

For the TC clsact offload these days, some of HW drivers need
to hold a magic ball. The reason is, with the first inserted rule inside
HW they need to guess what fields will be used for the matching. If
later on this guess proves to be wrong and user adds a filter with a
different field to match, there's a problem. Mlxsw resolves it now with
couple of patterns. Those try to cover as many match fields as possible.
This aproach is far from optimal, both performance-wise and scale-wise.
Also, there is a combination of filters that in certain order won't
succeed.


>

Most of the time, when user inserts filters in chain, he knows right away
how the filters are going to look like - what type and option will they
have. For example, he knows that he will only insert filters of type
flower matching destination IP address. He can specify a template that
would cover all the filters in the chain.



Is this just restricted to hardware offload? Example it will make sense
for u32 in s/ware as well (i.e flexible TCAM like TCAM based
classification). i.e it is possible that rules the user enters
end up being worst case a linked list lookup, yes? And allocating
space for a tuple that is not in use is a waste of space.

If yes, then I would reword the above as something like:

For very flexible classifiers such as TCAM based ones,
one could add arbitrary tuple rules which tend to be inefficient both
from a space and lookup performance. One approach, taken by Mlxsw,
is to assume a multi filter tuple arrangement which is inefficient
from a space perspective when the user-specified rules dont make
use of pre-provisioned tuple space.
Typically users already know what tuples are of interest to them:
for example for ipv4 route lookup purposes they may just want to
lookup destination IP with a specified mask etc.
This feature allows user to provide good hints to the classifier to
optimize.



This patchset is providing the possibility to user to provide such
template  to kernel and propagate it all the way down to device
drivers.

See the examples below.

Create dummy device with clsact first:
# ip link add type dummy
# tc qdisc add dev dummy0 clsact

There is no template assigned by default:
# tc filter template show dev dummy0 ingress

Add a template of type flower allowing to insert rules matching on last
2 bytes of destination mac address:
# tc filter template add dev dummy0 ingress proto ip flower dst_mac 
00:00:00:00:00:00/00:00:00:00:FF:FF

The template is now showed in the list:
# tc filter template show dev dummy0 ingress
filter flower chain 0
   dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff
   eth_type ipv4

Add another template, this time for chain number 22:
# tc filter template add dev dummy0 ingress proto ip chain 22 flower dst_ip 
0.0.0.0/16
# tc filter template show dev dummy0 ingress
filter flower chain 0
   dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff
   eth_type ipv4
filter flower chain 22
   eth_type ipv4
   dst_ip 0.0.0.0/16

Add a filter that fits the template:
# tc filter add dev dummy0 ingress proto ip flower dst_mac 
aa:bb:cc:dd:ee:ff/00:00:00:00:00:0F action drop

Addition of filters that does not fit the template would fail:
# tc filter add dev dummy0 ingress proto ip flower dst_mac 
aa:11:22:33:44:55/00:00:00:FF:00:00 action drop
Error: Mask does not fit the template.
We have an error talking to the kernel, -1
# tc filter add dev dummy0 ingress proto ip flower dst_ip 10.0.0.1 action drop
Error: Mask does not fit the template.
We have an error talking to the kernel, -1

Additions of filters to chain 22:
# tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1/8 
action drop
# tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1 
action drop
Error: Mask does not fit the template.
We have an error talking to the kernel, -1
# tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1/24 
action drop
Error: Mask does not fit the template.
We have an error talking to the kernel, -1

Removal of a template from non-empty chain would fail:
# tc filter template del dev dummy0 ingress
Error: The chain is not empty, unable to delete template.
We have an error talking to the kernel, -1

Once the chain is flushed, the template could be removed:
# tc filter del dev dummy0 ingress
# tc filter template del dev dummy0 ingress



BTW: unlike the other comments on this - I think the syntax above
is fine ;-> Chain are already either explicitly or are implicitly
(case of chain 0) specified.

Assuming that one cant add a new template to a chain if it already
has at least one filter (even if no template has been added).

I like it - it may help making u32 more friendly to humans in some
cases.

cheers,
jamal


[PATCH v3 net-next 0/5] Fixes coding style in xilinx_emaclite.c

2018-06-28 Thread Radhey Shyam Pandey
This patchset fixes checkpatch and kernel-doc warnings in
xilinx emaclite driver. No functional change.

Changes from v2:
-In 2/5 patch refactor if-else to make failure path return early.
-In 2/5 patch coalesce the format onto a single line and add the
missing space after the comma.

Radhey Shyam Pandey (5):
  net: emaclite: Use __func__ instead of hardcoded name
  net: emaclite: Simplify if-else statements
  net: emaclite: update kernel-doc comments
  net: emaclite: Fix block comments style
  net: emaclite: Remove unnecessary spaces

 drivers/net/ethernet/xilinx/xilinx_emaclite.c |  112 ++---
 1 files changed, 64 insertions(+), 48 deletions(-)



Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw

2018-06-28 Thread Jamal Hadi Salim

On 28/06/18 09:22 AM, Jiri Pirko wrote:

Thu, Jun 28, 2018 at 03:13:30PM CEST, j...@mojatatu.com wrote:


On 26/06/18 03:59 AM, Jiri Pirko wrote:

From: Jiri Pirko 

For the TC clsact offload these days, some of HW drivers need
to hold a magic ball. The reason is, with the first inserted rule inside
HW they need to guess what fields will be used for the matching. If
later on this guess proves to be wrong and user adds a filter with a
different field to match, there's a problem. Mlxsw resolves it now with
couple of patterns. Those try to cover as many match fields as possible.
This aproach is far from optimal, both performance-wise and scale-wise.
Also, there is a combination of filters that in certain order won't
succeed.


Most of the time, when user inserts filters in chain, he knows right away
how the filters are going to look like - what type and option will they
have. For example, he knows that he will only insert filters of type
flower matching destination IP address. He can specify a template that
would cover all the filters in the chain.



Is this just restricted to hardware offload? Example it will make sense
for u32 in s/ware as well (i.e flexible TCAM like TCAM based
classification). i.e it is possible that rules the user enters
end up being worst case a linked list lookup, yes? And allocating
space for a tuple that is not in use is a waste of space.


I'm afraid I don't understand clearly what you say.


Well - I was trying to understand what you said ;->

I think what you are getting at is two issues:
a) space in the tcams - if the user is just going to enter
rules which use one tuple (dst ip for example) the hardware
would be better off told that this is the case so it doesnt
allocate space in anticipation that someone is going to
specify src ip later on.
b) lookup speed in tcams - without the template hint a
selection of rules may end up looking like a linked list
which is not optimal for lookup


This is not
restricted to hw offload. The templates apply to all filters, no matter
if they are offloaded or not.



Do you save anything with flower(in s/w) if you only added a template
with say dst ip/mask? I can see it will make sense for u32 which is more
flexible and protocol independent.

cheers,
jamal


Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw

2018-06-28 Thread David Ahern
On 6/28/18 7:08 AM, Jiri Pirko wrote:
> Create dummy device with clsact first:
> # ip link add type dummy
> # tc qdisc add dev dummy0 clsact
> 
> There is no template assigned by default:
> # tc chaintemplate show dev dummy0 ingress
> 
> Add a template of type flower allowing to insert rules matching on last
> 2 bytes of destination mac address:
> # tc chaintemplate add dev dummy0 ingress proto ip flower dst_mac 
> 00:00:00:00:00:00/00:00:00:00:FF:FF
> 
> The template is now showed in the list:
> # tc chaintemplate show dev dummy0 ingress
> chaintemplate flower chain 0 
>   dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff
>   eth_type ipv4
> 
> Add another template, this time for chain number 22:
> # tc chaintemplate add dev dummy0 ingress proto ip chain 22 flower dst_ip 
> 0.0.0.0/16
> # tc chaintemplate show dev dummy0 ingress
> chaintemplate flower chain 0 
>   dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff
>   eth_type ipv4
> chaintemplate flower chain 22 
>   eth_type ipv4
>   dst_ip 0.0.0.0/16
> 
> Add a filter that fits the template:
> # tc filter add dev dummy0 ingress proto ip flower dst_mac 
> aa:bb:cc:dd:ee:ff/00:00:00:00:00:0F action drop
> 
> Addition of filters that does not fit the template would fail:
> # tc filter add dev dummy0 ingress proto ip flower dst_mac 
> aa:11:22:33:44:55/00:00:00:FF:00:00 action drop
> Error: Mask does not fit the template.
> We have an error talking to the kernel, -1
> # tc filter add dev dummy0 ingress proto ip flower dst_ip 10.0.0.1 action drop
> Error: Mask does not fit the template.
> We have an error talking to the kernel, -1
> 
> Additions of filters to chain 22:
> # tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1/8 
> action drop
> # tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1 
> action drop
> Error: Mask does not fit the template.
> We have an error talking to the kernel, -1
> # tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 
> 10.0.0.1/24 action drop
> Error: Mask does not fit the template.
> We have an error talking to the kernel, -1
> 
> Removal of a template from non-empty chain would fail:
> # tc chaintemplate del dev dummy0 ingress
> Error: The chain is not empty, unable to delete template.
> We have an error talking to the kernel, -1

Why this restriction? It's a template, so why can't it be removed
regardless of whether there are filters?

> 
> Once the chain is flushed, the template could be removed:
> # tc filter del dev dummy0 ingress
> # tc chaintemplate del dev dummy0 ingress
> 



Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw

2018-06-28 Thread Jiri Pirko
Thu, Jun 28, 2018 at 03:54:17PM CEST, j...@mojatatu.com wrote:
>On 28/06/18 09:22 AM, Jiri Pirko wrote:
>> Thu, Jun 28, 2018 at 03:13:30PM CEST, j...@mojatatu.com wrote:
>> > 
>> > On 26/06/18 03:59 AM, Jiri Pirko wrote:
>> > > From: Jiri Pirko 
>> > > 
>> > > For the TC clsact offload these days, some of HW drivers need
>> > > to hold a magic ball. The reason is, with the first inserted rule inside
>> > > HW they need to guess what fields will be used for the matching. If
>> > > later on this guess proves to be wrong and user adds a filter with a
>> > > different field to match, there's a problem. Mlxsw resolves it now with
>> > > couple of patterns. Those try to cover as many match fields as possible.
>> > > This aproach is far from optimal, both performance-wise and scale-wise.
>> > > Also, there is a combination of filters that in certain order won't
>> > > succeed.
>> > > 
>> > > 
>> > > Most of the time, when user inserts filters in chain, he knows right away
>> > > how the filters are going to look like - what type and option will they
>> > > have. For example, he knows that he will only insert filters of type
>> > > flower matching destination IP address. He can specify a template that
>> > > would cover all the filters in the chain.
>> > > 
>> > 
>> > Is this just restricted to hardware offload? Example it will make sense
>> > for u32 in s/ware as well (i.e flexible TCAM like TCAM based
>> > classification). i.e it is possible that rules the user enters
>> > end up being worst case a linked list lookup, yes? And allocating
>> > space for a tuple that is not in use is a waste of space.
>> 
>> I'm afraid I don't understand clearly what you say.
>
>Well - I was trying to understand what you said ;->
>
>I think what you are getting at is two issues:
>a) space in the tcams - if the user is just going to enter
>rules which use one tuple (dst ip for example) the hardware
>would be better off told that this is the case so it doesnt
>allocate space in anticipation that someone is going to
>specify src ip later on.

Yes.

>b) lookup speed in tcams - without the template hint a
>selection of rules may end up looking like a linked list
>which is not optimal for lookup

Well. Not really, but wider keys have bigger overheads in general. So
the motivation is to have the keys as small as possible for both
performance and capacity reasons.

>
>> This is not
>> restricted to hw offload. The templates apply to all filters, no matter
>> if they are offloaded or not.
>> 
>
>Do you save anything with flower(in s/w) if you only added a template
>with say dst ip/mask? I can see it will make sense for u32 which is more
>flexible and protocol independent.

No benefit for flower s/w path at this point. Perhaps the hashtables
could be organized in more optimal way with the hint. I didn't look at
it.

>
>cheers,
>jamal


Re: [PATCH v1 net-next 13/14] net/sched: Enforce usage of CLOCK_TAI for sch_etf

2018-06-28 Thread Willem de Bruijn
On Wed, Jun 27, 2018 at 8:45 PM Jesus Sanchez-Palencia
 wrote:
>
> The qdisc and the SO_TXTIME ABIs allow for a clockid to be configured,
> but it's been decided that usage of CLOCK_TAI should be enforced until
> we decide to allow for other clockids to be used. The rationale here is
> that PTP times are usually in the TAI scale, thus no other clocks should
> be necessary.
>
> For now, the qdisc will return EINVAL if any clocks other than
> CLOCK_TAI are used.
>
> Signed-off-by: Jesus Sanchez-Palencia 
> ---
>  net/sched/sch_etf.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
> index cd6cb5b69228..5514a8aa3bd5 100644
> --- a/net/sched/sch_etf.c
> +++ b/net/sched/sch_etf.c
> @@ -56,8 +56,8 @@ static inline int validate_input_params(struct tc_etf_qopt 
> *qopt,
> return -ENOTSUPP;
> }
>
> -   if (qopt->clockid >= MAX_CLOCKS) {
> -   NL_SET_ERR_MSG(extack, "Invalid clockid");
> +   if (qopt->clockid != CLOCK_TAI) {
> +   NL_SET_ERR_MSG(extack, "Invalid clockid. CLOCK_TAI must be 
> used");

Similar to the comment in patch 12, this should be squashed (into
patch 6) to avoid incorrect behavior in a range of SHA1s.


Re: [PATCH v1 net-next 14/14] net/sched: Make etf report drops on error_queue

2018-06-28 Thread Willem de Bruijn
On Wed, Jun 27, 2018 at 6:07 PM Jesus Sanchez-Palencia
 wrote:
>
> Use the socket error queue for reporting dropped packets if the
> socket has enabled that feature through the SO_TXTIME API.
>
> Packets are dropped either on enqueue() if they aren't accepted by the
> qdisc or on dequeue() if the system misses their deadline. Those are
> reported as different errors so applications can react accordingly.
>
> Userspace can retrieve the errors through the socket error queue and the
> corresponding cmsg interfaces. A struct sock_extended_err* is used for
> returning the error data, and the packet's timestamp can be retrieved by
> adding both ee_data and ee_info fields as e.g.:
>
> ((__u64) serr->ee_data << 32) + serr->ee_info
>
> This feature is disabled by default and must be explicitly enabled by
> applications. Enabling it can bring some overhead for the Tx cycles
> of the application.
>
> Signed-off-by: Jesus Sanchez-Palencia 
> ---

>  struct sock_txtime {
> clockid_t   clockid;/* reference clockid */
> -   u16 flags;  /* bit 0: txtime in deadline_mode */
> +   u16 flags;  /* bit 0: txtime in deadline_mode
> +* bit 1: report drops on sk err queue
> +*/
>  };

If this is shared with userspace, should be defined in an uapi header.
Same on the flag bits below. Self documenting code is preferable over
comments.

>  /*
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 73f4404e49e4..e681a45cfe7e 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -473,6 +473,7 @@ struct sock {
> u16 sk_clockid;
> u16 sk_txtime_flags;
>  #define SK_TXTIME_DEADLINE_MASKBIT(0)
> +#define SK_TXTIME_RECV_ERR_MASKBIT(1)

Integer bitfields are (arguably) more readable. There is no requirement
that the user interface be the same as the in-kernel implementation. Indeed
if you can save bits in struct sock, that is preferable (but not so for the ABI,
which cannot easily be extended).

>
> struct socket   *sk_socket;
> void*sk_user_data;
> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
> index dc64cfaf13da..66fd5e443c94 100644
> --- a/include/uapi/linux/errqueue.h
> +++ b/include/uapi/linux/errqueue.h
> @@ -25,6 +25,8 @@ struct sock_extended_err {
>  #define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1))
>
>  #define SO_EE_CODE_ZEROCOPY_COPIED 1
> +#define SO_EE_CODE_TXTIME_INVALID_PARAM2
> +#define SO_EE_CODE_TXTIME_MISSED   3
>
>  /**
>   * struct scm_timestamping - timestamps exposed through cmsg
> diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
> index 5514a8aa3bd5..166f4b72875b 100644
> --- a/net/sched/sch_etf.c
> +++ b/net/sched/sch_etf.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -124,6 +125,35 @@ static void reset_watchdog(struct Qdisc *sch)
> qdisc_watchdog_schedule_ns(>watchdog, ktime_to_ns(next));
>  }
>
> +static void report_sock_error(struct sk_buff *skb, u32 err, u8 code)
> +{
> +   struct sock_exterr_skb *serr;
> +   ktime_t txtime = skb->tstamp;
> +
> +   if (!skb->sk || !(skb->sk->sk_txtime_flags & SK_TXTIME_RECV_ERR_MASK))
> +   return;
> +
> +   skb = skb_clone_sk(skb);
> +   if (!skb)
> +   return;
> +
> +   sock_hold(skb->sk);

Why take an extra reference? The skb holds a ref on the sk.

> +
> +   serr = SKB_EXT_ERR(skb);
> +   serr->ee.ee_errno = err;
> +   serr->ee.ee_origin = SO_EE_ORIGIN_LOCAL;

I suggest adding a new SO_EE_ORIGIN_TXTIME as opposed to overloading
the existing
local origin. Then the EE_CODE can start at 1, as ee_code can be
demultiplexed by origin.

> +   serr->ee.ee_type = 0;
> +   serr->ee.ee_code = code;
> +   serr->ee.ee_pad = 0;
> +   serr->ee.ee_data = (txtime >> 32); /* high part of tstamp */
> +   serr->ee.ee_info = txtime; /* low part of tstamp */
> +
> +   if (sock_queue_err_skb(skb->sk, skb))
> +   kfree_skb(skb);
> +
> +   sock_put(skb->sk);
> +}


Re: [PATCH v1 net-next 02/14] net: Add a new socket option for a future transmit time.

2018-06-28 Thread Willem de Bruijn
On Thu, Jun 28, 2018 at 10:26 AM Willem de Bruijn
 wrote:
>
> On Wed, Jun 27, 2018 at 6:08 PM Jesus Sanchez-Palencia
>  wrote:
> >
> > From: Richard Cochran 
> >
> > This patch introduces SO_TXTIME. User space enables this option in
> > order to pass a desired future transmit time in a CMSG when calling
> > sendmsg(2). The argument to this socket option is a 6-bytes long struct
> > defined as:
> >
> > struct sock_txtime {
> > clockid_t   clockid;
> > u16 flags;
> > };
>
> clockid_t is __kernel_clockid_t is int is a variable length field.
> Please use fixed length fields.

Sorry, int is fine, of course, and clockid_t is used between userspace and
kernel already.

> Also, as MAX_CLOCKS is 16, only 4 bits are needed. A single u16
> is probably sufficient as cmsg argument. To future proof, a u32 will
> allow for more
> than 4 flags. But in struct sock, 16 bits should be sufficient to
> encode both clock id
> and flags.


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Linus Torvalds
On Thu, Jun 28, 2018 at 11:17 AM Al Viro  wrote:
>
> As for what can be salvaged out of the whole mess,
> * probably the most serious lesson is that INDIRECT CALLS ARE
> COSTLY NOW and shouldn't be used lightly.

Note that this has always been true, it's just _way_ more obvious now.

Indirect calls are costly not just because of the nasty 20+ cycle cost
of the stupid spectre overhead (which hopefully will be a thing of the
past in a few years as people upgrade), but because they are a pretty
basic barrier to optimization, both for compilers but also just for
_people_.

Look at a lot of vfs optimization we've done, like all the name
hashing optimization etc. We basically fall flat on our face if a
filesystem implements its own name hash function, not _just_ because
of the cost of the indirect function call, but because it suddenly
means that the filesystem is doing its own thing and all the clever
work we did to integrate name hashing with copying the name no longer
works.

So I really want to avoid indirect calls. And when they *do* happen, I
want to avoid the model where people think of them as low-level object
accessor functions - the C++ disease. I want indirect function calls
to make sense at a higher level, and do some real operation.

End result: I really despised the new poll() model. Yes, the
performance report was what made me *notice*, but then when I looked
at the code I went "No". Using an indirect call as some low-level
accessor function really is fundamentally wrong. Don't do it. It's
badly designed.

Out VFS operations are _high-level_ operations, where we do one single
call for a whole "read()" operation. "->poll()" used to be the same.
The new "->get_poll_head()" and "->poll_mask()" was just bad, bad,
bad.

> * having an optional pointer to wait_queue_head in struct file
> is probably a good idea; a lot of ->poll() instances do have the same
> form.  Even if sockets do not (and I'm not all that happy about the
> solution in the latest series), the instances that do are common and
> important enough.

Right. I don't hate the poll wait-queue pointer. That said, I do hope
that we can simply write things so as to not even need it.

> * a *LOT* of ->poll() instances only block in __pollwait()
> called (indirectly) on the first pass.

They are *all* supposed to do it.

The whole idea with "->poll()" is that the model of operation is:

 -  add_wait_queue() and return state on the first pass

 - on subsequent passes (or if somebody else already returned a state
that means we already know we're not going to wait), the poll table is
NULL, so you *CANNOT* add_wait_queue again, so you just return state.

Additionally, once _anybody_ has returned a "I already have the
event", we also clear the poll table queue, so subsequent accesses
will purely be for returning the poll state.

So I don't understand why you're asking for annotation. The whole "you
add yourself to the poll table" is *fundamentally* only done on the
first pass. You should never do it for later passes.

> How much do you intend to revert?  Untangling just the ->poll()-related
> parts from the rest of changes in fs/aio.c will be unpleasant; we might
> end up reverting the whole tail of the series and redoing the things
> that are not poll-related on top of the revert... ;-/

I pushed out my revert. It was fairly straightforward, it just
reverted all the poll_mask/get_poll_head changes, and the aio code
that depended on them.

Btw, I really don't understand the "aio has a race". The old poll()
interface was fundamentally race-free. There simply *is* no way to
race on it, exactly because of the whole "add yourself to the wait
queue first, then ask for state afterwards" model.  The model may be
_odd_, but it has literally worked well for a quarter century exactly
because it's really simple and fundamentally cannot have races.

So I think it's the aio code that needs fixing, not the polling code.

I do want that explanation for why AIO is _so_ special that it can
introduce a race in poll().

Because I suspect it's not so special, and it's just buggy. Maybe
Christoph didn't understand the two-phase model (how you call ->poll()
_twice_ - first to add yourself to the queue, later to check status).
Or maybe AIO interfaces are just shit (again!) and don't work right.

   Linus


Re: [PATCH net-next 1/1] tc-testing: initial version of tunnel_key unit tests

2018-06-28 Thread Keara Leibovitz
>> Until I'm able to submit everything, I'd be OK with having Keara add
>> the non-zero exit codes to the teardown on her tests.  In the meantime
>> we'll get the README updated and config file added as well.
>>
>> How does this sound?
>
> it sounds good to me, but at this point we can also leave the code of
> tunnel_key as-is. there are many other items failing in this script:
>
> for act in $ACT; do
> while IFS=':' read -r id _ ; do modprobe -r act_${act} ; sleep 1 ; [ -n "$id" 
> ] && ./tdc.py -p /home/davide/iproute2/tc/tc -e $id ; done < `./tdc.py -l | grep ${act}`
> EOF
> done
>
> So, it's ok for me if they are fixed all together in a series, and I
> volunteer for testing it when they land on netdev list.

Hi Davide,

I will add the non-zero exit codes and resubmit version 2 tomorrow.

Keara


Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw

2018-06-28 Thread Jiri Pirko
Oh, this is v3 already. The changelog should be:

---
v2->v3:
- patch 5:
  - rebase on top of the reoffload pathset
- patch 6:
  - rebase on top of the reoffload pathset
- patch 9:
  - adjust to the userspace cmdline changes
v1->v2:
- patch 6:
  - remove leftover extack arg in fl_hw_create_tmplt()


Re: [PATCH v1 net-next 02/14] net: Add a new socket option for a future transmit time.

2018-06-28 Thread Willem de Bruijn
On Wed, Jun 27, 2018 at 6:08 PM Jesus Sanchez-Palencia
 wrote:
>
> From: Richard Cochran 
>
> This patch introduces SO_TXTIME. User space enables this option in
> order to pass a desired future transmit time in a CMSG when calling
> sendmsg(2). The argument to this socket option is a 6-bytes long struct
> defined as:
>
> struct sock_txtime {
> clockid_t   clockid;
> u16 flags;
> };

clockid_t is __kernel_clockid_t is int is a variable length field.
Please use fixed
length fields. Also, as MAX_CLOCKS is 16, only 4 bits are needed. A single u16
is probably sufficient as cmsg argument. To future proof, a u32 will
allow for more
than 4 flags. But in struct sock, 16 bits should be sufficient to
encode both clock id
and flags.

> Note that two new fields were added to struct sock by filling a 4-bytes
> hole found in the struct. For that reason, neither the struct size or
> number of cachelines were altered.
>
> Signed-off-by: Richard Cochran 
> Signed-off-by: Jesus Sanchez-Palencia 
> ---

> +#include 
>  #include 
>  #include 
>  #include 
> @@ -697,6 +698,7 @@ EXPORT_SYMBOL(sk_mc_loop);
>  int sock_setsockopt(struct socket *sock, int level, int optname,
> char __user *optval, unsigned int optlen)
>  {
> +   struct sock_txtime sk_txtime;
> struct sock *sk = sock->sk;
> int val;
> int valbool;
> @@ -1070,6 +1072,22 @@ int sock_setsockopt(struct socket *sock, int level, 
> int optname,
> }
> break;
>
> +   case SO_TXTIME:
> +   if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
> +   ret = -EPERM;
> +   } else if (optlen != sizeof(struct sock_txtime)) {
> +   ret = -EINVAL;
> +   } else if (copy_from_user(_txtime, optval,
> +  sizeof(struct sock_txtime))) {
> +   ret = -EFAULT;
> +   sock_valbool_flag(sk, SOCK_TXTIME, false);

Why change sk state on failure? This is not customary.

> +   } else {
> +   sock_valbool_flag(sk, SOCK_TXTIME, true);
> +   sk->sk_clockid = sk_txtime.clockid;
> +   sk->sk_txtime_flags = sk_txtime.flags;

Validate input and fail on undefined flags.

> @@ -2137,6 +2162,13 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr 
> *msg, struct cmsghdr *cmsg,
> sockc->tsflags &= ~SOF_TIMESTAMPING_TX_RECORD_MASK;
> sockc->tsflags |= tsflags;
> break;
> +   case SCM_TXTIME:
> +   if (!sock_flag(sk, SOCK_TXTIME))
> +   return -EINVAL;

Note that on lockfree datapaths like udp this test can race with the
setsockopt above.
It seems harmless here.

> +   if (cmsg->cmsg_len != CMSG_LEN(sizeof(u64)))
> +   return -EINVAL;
> +   sockc->transmit_time = get_unaligned((u64 *)CMSG_DATA(cmsg));
> +   break;
> /* SCM_RIGHTS and SCM_CREDENTIALS are semantically in SOL_UNIX. */
> case SCM_RIGHTS:
> case SCM_CREDENTIALS:
> --
> 2.17.1
>


Re: [PATCH v1 net-next 03/14] net: ipv4: Hook into time based transmission

2018-06-28 Thread Willem de Bruijn
On Wed, Jun 27, 2018 at 6:07 PM Jesus Sanchez-Palencia
 wrote:
>
> Add a transmit_time field to struct inet_cork, then copy the
> timestamp from the CMSG cookie at ip_setup_cork() so we can
> safely copy it into the skb later during __ip_make_skb().
>
> For the raw fast path, just perform the copy at raw_send_hdrinc().
>
> Signed-off-by: Richard Cochran 
> Signed-off-by: Jesus Sanchez-Palencia 
> ---
>  include/net/inet_sock.h | 1 +
>  net/ipv4/ip_output.c| 3 +++
>  net/ipv4/raw.c  | 2 ++
>  net/ipv4/udp.c  | 1 +

Also support the feature for ipv6

>  4 files changed, 7 insertions(+)
>
> diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> index 83d5b3c2ac42..314be484c696 100644
> --- a/include/net/inet_sock.h
> +++ b/include/net/inet_sock.h
> @@ -148,6 +148,7 @@ struct inet_cork {
> __s16   tos;
> charpriority;
> __u16   gso_size;
> +   u64 transmit_time;
>  };
>
>  struct inet_cork_full {
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index b3308e9d9762..904a54a090e9 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1153,6 +1153,7 @@ static int ip_setup_cork(struct sock *sk, struct 
> inet_cork *cork,
> cork->tos = ipc->tos;
> cork->priority = ipc->priority;
> cork->tx_flags = ipc->tx_flags;
> +   cork->transmit_time = ipc->sockc.transmit_time;

Initialize ipc->sockc.transmit_time in all possible paths to avoid bugs like the
one fixed in commit 9887cba19978 ("ip: limit use of gso_size to udp").

> return 0;
>  }
> @@ -1413,6 +1414,7 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
>
> skb->priority = (cork->tos != -1) ? cork->priority: sk->sk_priority;
> skb->mark = sk->sk_mark;
> +   skb->tstamp = cork->transmit_time;
> /*
>  * Steal rt from cork.dst to avoid a pair of atomic_inc/atomic_dec
>  * on dst refcount
> @@ -1495,6 +1497,7 @@ struct sk_buff *ip_make_skb(struct sock *sk,
> cork->flags = 0;
> cork->addr = 0;
> cork->opt = NULL;
> +   cork->transmit_time = 0;

Not needed when unconditionally overwriting the field in ip_setup_cork.

> err = ip_setup_cork(sk, cork, ipc, rtp);
> if (err)
> return ERR_PTR(err);


Re: [PATCH] bpfilter: fix user mode helper cross compilation

2018-06-28 Thread Alexei Starovoitov
On Wed, Jun 27, 2018 at 11:17:09PM -0700, Andrew Morton wrote:
> On Wed, 20 Jun 2018 16:04:34 +0200 Matteo Croce  wrote:
> 
> > Use $(OBJDUMP) instead of literal 'objdump' to avoid
> > using host toolchain when cross compiling.
> > 
> 
> I'm still having issues here, with ld.
> 
> x86_64 machine, ARCH=i386:
> 
> y:/usr/src/25> make V=1 M=net/bpfilter
> test -e include/generated/autoconf.h -a -e include/config/auto.conf || (  
>  \
> echo >&2;   \
> echo >&2 "  ERROR: Kernel configuration is invalid.";   \
> echo >&2 " include/generated/autoconf.h or include/config/auto.conf 
> are missing.";\
> echo >&2 " Run 'make oldconfig && make prepare' on kernel src to fix 
> it.";  \
> echo >&2 ;  \
> /bin/false)
> mkdir -p net/bpfilter/.tmp_versions ; rm -f net/bpfilter/.tmp_versions/*
> make -f ./scripts/Makefile.build obj=net/bpfilter
> (cat /dev/null;   echo kernel/net/bpfilter/bpfilter.ko;) > 
> net/bpfilter/modules.order
>   ld -m elf_i386   -r -o net/bpfilter/bpfilter.o net/bpfilter/bpfilter_kern.o 
> net/bpfilter/bpfilter_umh.o ; scripts/mod/modpost net/bpfilter/bpfilter.o
> ld: i386:x86-64 architecture of input file `net/bpfilter/bpfilter_umh.o' is 
> incompatible with i386 output

could you please try with this patch
https://patchwork.ozlabs.org/patch/935246/
that is already in net tree?



Re: [PATCH net] net: fib_rules: add protocol check in rule_find

2018-06-28 Thread David Ahern
On 6/27/18 7:27 PM, Roopa Prabhu wrote:
> From: Roopa Prabhu 
> 
> After commit f9d4b0c1e969 ("fib_rules: move common handling of newrule
> delrule msgs into fib_nl2rule"), rule_find is strict about checking
> for an existing rule. rule_find must check against all
> user given attributes, else it may match against a subset
> of attributes and return an existing rule.
> 
> In the below case, without support for protocol match, rule_find
> will match only against 'table main' and return an existing rule.
> 
> $ip -4 rule add table main protocol boot
> RTNETLINK answers: File exists
> 
> This patch adds protocol support to rule_find, forcing it to
> check protocol match if given by the user.
> 
> Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs 
> into fib_nl2rule")
> Signed-off-by: Roopa Prabhu 
> ---

Reviewed-by: David Ahern 





[PATCH net-next] net: phy: realtek: add support for RTL8211

2018-06-28 Thread Heiner Kallweit
In preparation of adding phylib support to the r8169 driver we need
PHY drivers for all chip-internal PHY types. Fortunately almost all
of them are either supported by the Realtek PHY driver already or work
with the genphy driver.
Still missing is support for the PHY of RTL8169s, it requires a quirk
to properly support 100Mbit-fixed mode. The quirk was copied from
r8169 driver which copied it from the vendor driver.
Based on the PHYID the internal PHY seems to be a RTL8211.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/phy/realtek.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
index 082fb40c..9757b162 100644
--- a/drivers/net/phy/realtek.c
+++ b/drivers/net/phy/realtek.c
@@ -128,6 +128,28 @@ static int rtl8211f_config_intr(struct phy_device *phydev)
return phy_write_paged(phydev, 0xa42, RTL821x_INER, val);
 }
 
+static int rtl8211_config_aneg(struct phy_device *phydev)
+{
+   int ret;
+
+   ret = genphy_config_aneg(phydev);
+   if (ret < 0)
+   return ret;
+
+   /* Quirk was copied from vendor driver. Unfortunately it includes no
+* description of the magic numbers.
+*/
+   if (phydev->speed == SPEED_100 && phydev->autoneg == AUTONEG_DISABLE) {
+   phy_write(phydev, 0x17, 0x2138);
+   phy_write(phydev, 0x0e, 0x0260);
+   } else {
+   phy_write(phydev, 0x17, 0x2108);
+   phy_write(phydev, 0x0e, 0x);
+   }
+
+   return 0;
+}
+
 static int rtl8211f_config_init(struct phy_device *phydev)
 {
int ret;
@@ -178,6 +200,14 @@ static struct phy_driver realtek_drvs[] = {
.resume = genphy_resume,
.read_page  = rtl821x_read_page,
.write_page = rtl821x_write_page,
+   }, {
+   .phy_id = 0x001cc910,
+   .name   = "RTL8211 Gigabit Ethernet",
+   .phy_id_mask= 0x001f,
+   .features   = PHY_GBIT_FEATURES,
+   .config_aneg= rtl8211_config_aneg,
+   .read_mmd   = _read_mmd_unsupported,
+   .write_mmd  = _write_mmd_unsupported,
}, {
.phy_id = 0x001cc912,
.name   = "RTL8211B Gigabit Ethernet",
-- 
2.18.0



[PATCH net-next] r8169: use standard debug output functions

2018-06-28 Thread Heiner Kallweit
I see no need to define a private debug output symbol, let's use the
standard debug output functions instead. In this context also remove
the deprecated PFX define.

The one assertion is wrong IMO anyway, this code path is used also
by chip version 01.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 36 ++--
 1 file changed, 12 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 70c13cc2..21ffaf10 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -34,7 +34,6 @@
 
 #define RTL8169_VERSION "2.3LK-NAPI"
 #define MODULENAME "r8169"
-#define PFX MODULENAME ": "
 
 #define FIRMWARE_8168D_1   "rtl_nic/rtl8168d-1.fw"
 #define FIRMWARE_8168D_2   "rtl_nic/rtl8168d-2.fw"
@@ -56,19 +55,6 @@
 #define FIRMWARE_8107E_1   "rtl_nic/rtl8107e-1.fw"
 #define FIRMWARE_8107E_2   "rtl_nic/rtl8107e-2.fw"
 
-#ifdef RTL8169_DEBUG
-#define assert(expr) \
-   if (!(expr)) {  \
-   printk( "Assertion failed! %s,%s,%s,line=%d\n", \
-   #expr,__FILE__,__func__,__LINE__);  \
-   }
-#define dprintk(fmt, args...) \
-   do { printk(KERN_DEBUG PFX fmt, ## args); } while (0)
-#else
-#define assert(expr) do {} while (0)
-#define dprintk(fmt, args...)  do {} while (0)
-#endif /* RTL8169_DEBUG */
-
 #define R8169_MSG_DEFAULT \
(NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN)
 
@@ -2552,7 +2538,7 @@ static void rtl8169_get_mac_version(struct 
rtl8169_private *tp,
 
 static void rtl8169_print_mac_version(struct rtl8169_private *tp)
 {
-   dprintk("mac_version = 0x%02x\n", tp->mac_version);
+   netif_dbg(tp, drv, tp->dev, "mac_version = 0x%02x\n", tp->mac_version);
 }
 
 struct phy_reg {
@@ -4409,8 +4395,6 @@ static void rtl_phy_work(struct rtl8169_private *tp)
struct timer_list *timer = >timer;
unsigned long timeout = RTL8169_PHY_TIMEOUT;
 
-   assert(tp->mac_version > RTL_GIGA_MAC_VER_01);
-
if (tp->phy_reset_pending(tp)) {
/*
 * A busy loop could burn quite a few cycles on nowadays CPU.
@@ -4467,7 +4451,8 @@ static void rtl8169_init_phy(struct net_device *dev, 
struct rtl8169_private *tp)
rtl_hw_phy_config(dev);
 
if (tp->mac_version <= RTL_GIGA_MAC_VER_06) {
-   dprintk("Set MAC Reg C+CR Offset 0x82h = 0x01h\n");
+   netif_dbg(tp, drv, dev,
+ "Set MAC Reg C+CR Offset 0x82h = 0x01h\n");
RTL_W8(tp, 0x82, 0x01);
}
 
@@ -4477,9 +4462,11 @@ static void rtl8169_init_phy(struct net_device *dev, 
struct rtl8169_private *tp)
pci_write_config_byte(tp->pci_dev, PCI_CACHE_LINE_SIZE, 0x08);
 
if (tp->mac_version == RTL_GIGA_MAC_VER_02) {
-   dprintk("Set MAC Reg C+CR Offset 0x82h = 0x01h\n");
+   netif_dbg(tp, drv, dev,
+ "Set MAC Reg C+CR Offset 0x82h = 0x01h\n");
RTL_W8(tp, 0x82, 0x01);
-   dprintk("Set PHY Reg 0x0bh = 0x00h\n");
+   netif_dbg(tp, drv, dev,
+ "Set PHY Reg 0x0bh = 0x00h\n");
rtl_writephy(tp, 0x0b, 0x); //w 0x0b 15 0 0
}
 
@@ -5171,8 +5158,8 @@ static void rtl_hw_start_8169(struct rtl8169_private *tp)
 
if (tp->mac_version == RTL_GIGA_MAC_VER_02 ||
tp->mac_version == RTL_GIGA_MAC_VER_03) {
-   dprintk("Set MAC Reg C+CR Offset 0xe0. "
-   "Bit-3 and bit-14 MUST be 1\n");
+   netif_dbg(tp, drv, tp->dev,
+ "Set MAC Reg C+CR Offset 0xe0. Bit 3 and Bit 14 MUST 
be 1\n");
tp->cp_cmd |= (1 << 14);
}
 
@@ -6017,8 +6004,9 @@ static void rtl_hw_start_8168(struct rtl8169_private *tp)
break;
 
default:
-   printk(KERN_ERR PFX "%s: unknown chipset (mac_version = %d).\n",
-  tp->dev->name, tp->mac_version);
+   netif_err(tp, drv, tp->dev,
+ "unknown chipset (mac_version = %d)\n",
+ tp->mac_version);
break;
}
 }
-- 
2.18.0



Re: [PATCH bpf-next 2/7] lib: reciprocal_div: implement the improved algorithm on the paper mentioned

2018-06-28 Thread Jiong Wang
On Tue, Jun 26, 2018 at 7:21 AM, Song Liu  wrote:
> On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski
>  wrote:
>> From: Jiong Wang 



>> +
>> +struct reciprocal_value_adv reciprocal_value_adv(u32 d, u8 prec)
>> +{
>> +   struct reciprocal_value_adv R;
>> +   u32 l, post_shift;
>> +   u64 mhigh, mlow;
>> +
>> +   l = fls(d - 1);
>> +   post_shift = l;
>> +   /* NOTE: mlow/mhigh could overflow u64 when l == 32 which means d has
>> +* MSB set. This case needs to be handled before calling
>> +* "reciprocal_value_adv", please see the comment at
>> +* include/linux/reciprocal_div.h.
>> +*/
>
> Shall we handle l == 32 case better? I guess the concern here is extra
> handling may
> slow down the fast path?

The implementation of "reciprocal_value_adv" hasn't considered l  ==
32 which will make the code more complex.

As described at the pseudo code about how to call
"reciprocal_value_adv" in include/linux/reciprocal_div.h, l == 32
means the MSB of dividend is set, so the result of unsigned
divisor/dividend could only be 0 or 1, so the divide result could be
easily get by a comparison then conditional move 0 or 1 to the result.

> If that's the case, we should at least add a WARNING on the slow path.

OK, I will add a pr_warn inside "reciprocal_value_adv" when l == 32 is
triggered.

Thanks,
Jiong


Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw

2018-06-28 Thread Cong Wang
On Wed, Jun 27, 2018 at 9:48 PM David Miller  wrote:
>
> This series doesn't apply cleanly to net-next, and also there seems to still
> be some discussion about how the iproute2 command line should look.
>

I am sure you know this, so just to be clear:

A redesign of "how iproute2 command line should look" usually means a
redesign in the kernel code too. Apparently, 'tc chaintemplate' is a new
subsystem under TC, while a 'tc filter template' is merely a new TC filter
attribute.


Re: [PATCH v1 net-next 02/14] net: Add a new socket option for a future transmit time.

2018-06-28 Thread Jesus Sanchez-Palencia
Hi Willem,


On 06/28/2018 07:40 AM, Willem de Bruijn wrote:
> On Thu, Jun 28, 2018 at 10:26 AM Willem de Bruijn
>  wrote:
>>
>> On Wed, Jun 27, 2018 at 6:08 PM Jesus Sanchez-Palencia
>>  wrote:
>>>
>>> From: Richard Cochran 
>>>
>>> This patch introduces SO_TXTIME. User space enables this option in
>>> order to pass a desired future transmit time in a CMSG when calling
>>> sendmsg(2). The argument to this socket option is a 6-bytes long struct
>>> defined as:
>>>
>>> struct sock_txtime {
>>> clockid_t   clockid;
>>> u16 flags;
>>> };
>>
>> clockid_t is __kernel_clockid_t is int is a variable length field.
>> Please use fixed length fields.
> 
> Sorry, int is fine, of course, and clockid_t is used between userspace and
> kernel already.


Great. So, in addition to the other feedback in sock.c, what I'm thinking here
for the v2 is:

- move this struct to and the flags definition (as enums) to
include/uapi/linux/net_tstamp.h;

- keep clockid as a clockid_t and increase flags to u32 since this already takes
8 bytes in total anyway;

- reduce sk_clockid and sk_txtime_flags from struct sock from a u16 to a u8 
each.


Thanks,
Jesus



> 
>> Also, as MAX_CLOCKS is 16, only 4 bits are needed. A single u16
>> is probably sufficient as cmsg argument. To future proof, a u32 will
>> allow for more
>> than 4 flags. But in struct sock, 16 bits should be sufficient to
>> encode both clock id
>> and flags.


Re: [patch net-next 0/9] net: sched: introduce chain templates support with offloading to mlxsw

2018-06-28 Thread Cong Wang
On Wed, Jun 27, 2018 at 11:19 PM Jiri Pirko  wrote:
>
> Wed, Jun 27, 2018 at 07:04:32PM CEST, xiyou.wangc...@gmail.com wrote:
> >On Wed, Jun 27, 2018 at 9:46 AM Samudrala, Sridhar
> > wrote:
> >>
> >> On 6/27/2018 12:50 AM, Jiri Pirko wrote:
> >> > if you don't like "tc filter template add dev dummy0 ingress", how
> >> > about:
> >> > "tc template add dev dummy0 ingress ..."
> >> > "tc template add dev dummy0 ingress chain 22 ..."
> >> > that makes more sense I think.
> >
> >Better than 'tc filter template', but this doesn't reflect 'template'
> >is a template of tc filter, it could be an action etc., since it is in the
>
> It's a template of filter per chain. I don't understand how it could be
> an action...

It's because you have that in your mind from very beginning.

Think about what a new TC user's reaction is to 'tc template'
after he/she learns 'tc qdisc/filter/action'. It could be a template
of either of these 3 literately...


>
>
> >same position with 'tc action/filter/qdisc'.
> >
> >
> >>
> >> Isn't it possible to avoid introducing another keyword 'template',
> >>
> >> Can't we just do
> >>tc chain add dev dummy0 ingress flower chain_index 0
> >> to create a chain that takes any types of flower rules with index 0
> >> and
> >>   tc chain add dev dummy0 ingress flower chain_index 22
> >>  dst_mac 00:00:00:00:00:00/00:00:00:00:FF:FF
> >>   tc chain add dev dummy0 ingress flower chain_index 23
> >>  dst_ip 192.168.0.0/16
> >> to create 2 chains 22 and 23 that allow rules with specific fields.
> >
> >Sounds good too. Since filter chain can be shared by qdiscs,
> >a 'tc chain' sub-command makes sense, and would probably make
> >it easier to be shared.
>
> We don't have such specific object. It is implicit. We create it
> whenever someone users it. Either filter of chain. I don't like new "tc
> chain" object in cmdline. It really isn't.

I discussed this with you at netconf, it is similar to tc actions,
tc actions can be shared not because they are implicitly created,
but because they could be created alone via `tc action add ...`.

If you don't share the chain, it is perfectly fine to create it
implicitly. If you do share, as in current code base, making it
standalone is reasonable.


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Al Viro
On Thu, Jun 28, 2018 at 09:40:21AM -0700, Linus Torvalds wrote:
> On Thu, Jun 28, 2018 at 7:21 AM Christoph Hellwig  wrote:
> >
> > Note that for this removes the possibility of actually returning an
> > error before waiting in poll.  We could still do this with an ERR_PTR
> > in f_poll_head with a little bit of WRITE_ONCE/READ_ONCE magic, but
> > I'd like to defer that until actually required.
> 
> I'm still going to just revert the whole poll mess for now.
>
> It's still completely broken. This helps things, but it doesn't fix
> the fundamental issue: the new interface is strictly worse than the
> old interface ever was.
> 
> So Christoph, if you don't like the tradoitional ->poll() method, and
> you want something else for aio polling, I seriously will suggest that
> you introduce a new f_op for *that*. Don't mess with the existing
> ->poll() function, don't make select() and poll() use a slower and
> less capable function just because aio wants something else.
> 
> In other words, you need to see AIO as the less important case, not as
> the driver for this.
> 
> I also want to understand what the AIO race was, and what the problem
> with the poll() thing was. You claimed it was racy. I don't see it,
> and it was never ever explained in the whole series. I should never
> have pulled it in the first place if only for that reason, but I tend
> to trust Al when it comes to the VFS layer, so I did. My bad.

... and I should have pushed back harder, rather than getting sidetracked
into fixing the fs/aio.c-side races in this series ;-/

As for what can be salvaged out of the whole mess,
* probably the most serious lesson is that INDIRECT CALLS ARE
COSTLY NOW and shouldn't be used lightly.  That had been slow to sink
in and we'd better all realize how much the things have changed.
That, BTW, has implications going a lot further than poll-related stuff -
e.g. the whole "we'll deal with revoke-like issues in procfs/sysfs/debugfs
by wrapping method calls" needs to be reexamined.  And in poll-related
area, note that we have a lot of indirection levels for socket poll.
* having an optional pointer to wait_queue_head in struct file
is probably a good idea; a lot of ->poll() instances do have the same
form.  Even if sockets do not (and I'm not all that happy about the
solution in the latest series), the instances that do are common and
important enough.
* a *LOT* of ->poll() instances only block in __pollwait()
called (indirectly) on the first pass.  If we annotate those in some
way (flag set in ->open(), presence of a new method, whatever) and
limit aio-poll to just those, we could deal with the aio side without
disrupting select/poll at all; just use (in place of __pollwait)
a different callback that wouldn't try to allocate poll_table_entry
and worked with the stuff embedded into aio-poll iocb.

How much do you intend to revert?  Untangling just the ->poll()-related
parts from the rest of changes in fs/aio.c will be unpleasant; we might
end up reverting the whole tail of the series and redoing the things
that are not poll-related on top of the revert... ;-/


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Al Viro
On Thu, Jun 28, 2018 at 02:11:17PM -0700, Linus Torvalds wrote:
> On Thu, Jun 28, 2018 at 1:28 PM Al Viro  wrote:
> >
> >
> > Sure, but...
> >
> > static __poll_t binder_poll(struct file *filp,
> > struct poll_table_struct *wait)
> > {
> > struct binder_proc *proc = filp->private_data;
> > struct binder_thread *thread = NULL;
> > bool wait_for_proc_work;
> >
> > thread = binder_get_thread(proc);
> > if (!thread)
> > return POLLERR;
> 
> That's actually fine.
> 
> In particular, it's ok to *not* add yourself to the wait-queues if you
> already return the value that you will always return.

Sure (and that's one of the problems I mentioned with ->get_poll_head() model).
But in this case I was refering to GFP_KERNEL allocation down there.

> > And that's hardly unique - we have instances playing with timers,
> > allocations, whatnot.  Even straight mutex_lock(), as in
> 
> So?
> 
> Again, locking is permitted. It's not great, but it's not against the rules.

Me: a *LOT* of ->poll() instances only block in __pollwait() called (indirectly)
on the first pass.
 
You: They are *all* supposed to do it.

Me: 

I'm not saying that blocking on other things is a bug; some of such *are* bogus,
but a lot aren't really broken.  What I said is that in a lot of cases we really
have hard "no blocking other than in callback" (and on subsequent passes there's
no callback at all).  Which is just about perfect for AIO purposes, so *IF* we
go for "new method just for AIO, those who don't have it can take a hike", we 
might
as well indicate that "can take a hike" in some way (be it opt-in or opt-out) 
and
use straight unchanged ->poll(), with alternative callback.

Looks like we were talking past each other for the last couple of rounds...

> So none of the things you point to are excuses for changing interfaces
> or adding any flags.

> Anybody who thinks "select cannot block" or "->poll() musn't block" is
> just confused. It has *never* been about that. It waits asynchronously
> for IO, but it may well wait synchronously for locks or memory or just
> "lazy implementation".

Obviously.  I *do* understand how poll() works, really.

> The fact is, those interface changes were just broken shit. They were
> confused. I don't actually believe that AIO even needed them.
> 
> Christoph, do you have a test program for IOCB_CMD_POLL and what it's
> actually supposed to do?
> 
> Because I think that what it can do is simply to do the ->poll() calls
> outside the iocb locks, and then just attach the poll table to the
> kioctx afterwards.

I'd do a bit more - embed the first poll_table_entry into poll iocb itself,
so that the instances that use only one queue wouldn't need any allocations
at all.


[PATCH bpf-next 3/8] tools: libbpf: allow setting ifindex for programs and maps

2018-06-28 Thread Jakub Kicinski
Users of bpf_object__open()/bpf_object__load() APIs may want to
load the programs and maps onto a device for offload.  Allow
setting ifindex on those sub-objects.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/lib/bpf/libbpf.c | 10 ++
 tools/lib/bpf/libbpf.h |  2 ++
 2 files changed, 12 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index a1491e95edd0..7bc02d93e277 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1896,6 +1896,11 @@ void *bpf_program__priv(struct bpf_program *prog)
return prog ? prog->priv : ERR_PTR(-EINVAL);
 }
 
+void bpf_program__set_ifindex(struct bpf_program *prog, __u32 ifindex)
+{
+   prog->prog_ifindex = ifindex;
+}
+
 const char *bpf_program__title(struct bpf_program *prog, bool needs_copy)
 {
const char *title;
@@ -2122,6 +2127,11 @@ void *bpf_map__priv(struct bpf_map *map)
return map ? map->priv : ERR_PTR(-EINVAL);
 }
 
+void bpf_map__set_ifindex(struct bpf_map *map, __u32 ifindex)
+{
+   map->map_ifindex = ifindex;
+}
+
 struct bpf_map *
 bpf_map__next(struct bpf_map *prev, struct bpf_object *obj)
 {
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 09976531aa74..564f4be9bae0 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -109,6 +109,7 @@ int bpf_program__set_priv(struct bpf_program *prog, void 
*priv,
  bpf_program_clear_priv_t clear_priv);
 
 void *bpf_program__priv(struct bpf_program *prog);
+void bpf_program__set_ifindex(struct bpf_program *prog, __u32 ifindex);
 
 const char *bpf_program__title(struct bpf_program *prog, bool needs_copy);
 
@@ -251,6 +252,7 @@ typedef void (*bpf_map_clear_priv_t)(struct bpf_map *, void 
*);
 int bpf_map__set_priv(struct bpf_map *map, void *priv,
  bpf_map_clear_priv_t clear_priv);
 void *bpf_map__priv(struct bpf_map *map);
+void bpf_map__set_ifindex(struct bpf_map *map, __u32 ifindex);
 int bpf_map__pin(struct bpf_map *map, const char *path);
 
 long libbpf_get_error(const void *ptr);
-- 
2.17.1



[PATCH bpf-next 8/8] tools: bpftool: deal with options upfront

2018-06-28 Thread Jakub Kicinski
Remove options (in getopt() sense, i.e. starting with a dash like
-n or --NAME) while parsing arguments for bash completions.  This
allows us to refer to position-dependent parameters better, and
complete options at any point.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/bpf/bpftool/bash-completion/bpftool | 32 +++
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index b0b8022d3570..fffd76f4998b 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -153,6 +153,13 @@ _bpftool()
 local cur prev words objword
 _init_completion || return
 
+# Deal with options
+if [[ ${words[cword]} == -* ]]; then
+local c='--version --json --pretty --bpffs'
+COMPREPLY=( $( compgen -W "$c" -- "$cur" ) )
+return 0
+fi
+
 # Deal with simplest keywords
 case $prev in
 help|hex|opcodes|visual)
@@ -172,20 +179,23 @@ _bpftool()
 ;;
 esac
 
-# Search for object and command
-local object command cmdword
-for (( cmdword=1; cmdword < ${#words[@]}-1; cmdword++ )); do
-[[ -n $object ]] && command=${words[cmdword]} && break
-[[ ${words[cmdword]} != -* ]] && object=${words[cmdword]}
+# Remove all options so completions don't have to deal with them.
+local i
+for (( i=1; i < ${#words[@]}; )); do
+if [[ ${words[i]::1} == - ]]; then
+words=( "${words[@]:0:i}" "${words[@]:i+1}" )
+[[ $i -le $cword ]] && cword=$(( cword - 1 ))
+else
+i=$(( ++i ))
+fi
 done
+cur=${words[cword]}
+prev=${words[cword - 1]}
 
-if [[ -z $object ]]; then
+local object=${words[1]} command=${words[2]}
+
+if [[ -z $object || $cword -eq 1 ]]; then
 case $cur in
--*)
-local c='--version --json --pretty --bpffs'
-COMPREPLY=( $( compgen -W "$c" -- "$cur" ) )
-return 0
-;;
 *)
 COMPREPLY=( $( compgen -W "$( bpftool help 2>&1 | \
 command sed \
-- 
2.17.1



[PATCH bpf-next 6/8] tools: bpftool: drop unnecessary Author comments

2018-06-28 Thread Jakub Kicinski
Drop my author comments, those are from the early days of
bpftool and make little sense in tree, where we have quite
a few people contributing and git to attribute the work.

While at it bump some copyrights.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/bpf/bpftool/common.c | 2 --
 tools/bpf/bpftool/main.c   | 4 +---
 tools/bpf/bpftool/main.h   | 2 --
 tools/bpf/bpftool/map.c| 2 --
 tools/bpf/bpftool/prog.c   | 4 +---
 5 files changed, 2 insertions(+), 12 deletions(-)

diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
index 32f9e397a6c0..b432daea4520 100644
--- a/tools/bpf/bpftool/common.c
+++ b/tools/bpf/bpftool/common.c
@@ -31,8 +31,6 @@
  * SOFTWARE.
  */
 
-/* Author: Jakub Kicinski  */
-
 #include 
 #include 
 #include 
diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c
index eea7f14355f3..d15a62be6cf0 100644
--- a/tools/bpf/bpftool/main.c
+++ b/tools/bpf/bpftool/main.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2017 Netronome Systems, Inc.
+ * Copyright (C) 2017-2018 Netronome Systems, Inc.
  *
  * This software is dual licensed under the GNU General License Version 2,
  * June 1991 as shown in the file COPYING in the top-level directory of this
@@ -31,8 +31,6 @@
  * SOFTWARE.
  */
 
-/* Author: Jakub Kicinski  */
-
 #include 
 #include 
 #include 
diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index 63fdb310b9a4..d39f7ef01d23 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -31,8 +31,6 @@
  * SOFTWARE.
  */
 
-/* Author: Jakub Kicinski  */
-
 #ifndef __BPF_TOOL_H
 #define __BPF_TOOL_H
 
diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 097b1a5e046b..5989e1575ae4 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -31,8 +31,6 @@
  * SOFTWARE.
  */
 
-/* Author: Jakub Kicinski  */
-
 #include 
 #include 
 #include 
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 05f42a46d6ed..fd8cd9b51621 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2017 Netronome Systems, Inc.
+ * Copyright (C) 2017-2018 Netronome Systems, Inc.
  *
  * This software is dual licensed under the GNU General License Version 2,
  * June 1991 as shown in the file COPYING in the top-level directory of this
@@ -31,8 +31,6 @@
  * SOFTWARE.
  */
 
-/* Author: Jakub Kicinski  */
-
 #include 
 #include 
 #include 
-- 
2.17.1



[PATCH bpf-next 7/8] tools: bpftool: add missing --bpffs to completions

2018-06-28 Thread Jakub Kicinski
--bpffs is not suggested by bash completions.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/bpf/bpftool/bash-completion/bpftool | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index 1e1083321643..b0b8022d3570 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -182,7 +182,7 @@ _bpftool()
 if [[ -z $object ]]; then
 case $cur in
 -*)
-local c='--version --json --pretty'
+local c='--version --json --pretty --bpffs'
 COMPREPLY=( $( compgen -W "$c" -- "$cur" ) )
 return 0
 ;;
-- 
2.17.1



[net-next 01/12] net/mlx5e: Add UDP GSO support

2018-06-28 Thread Saeed Mahameed
From: Boris Pismenny 

This patch enables UDP GSO support. We enable this by using two WQEs
the first is a UDP LSO WQE for all segments with equal length, and the
second is for the last segment in case it has different length.
Due to HW limitation, before sending, we must adjust the packet length fields.

We measure performance between two Intel(R) Xeon(R) CPU E5-2643 v2 @3.50GHz
machines connected back-to-back with Connectx4-Lx (40Gbps) NICs.
We compare single stream UDP, UDP GSO and UDP GSO with offload.
Performance:
| MSS (bytes)   | Throughput (Gbps) | CPU utilization (%)
UDP GSO offload | 1472  | 35.6  | 8%
UDP GSO | 1472  | 25.5  | 17%
UDP | 1472  | 10.2  | 17%
UDP GSO offload | 1024  | 35.6  | 8%
UDP GSO | 1024  | 19.2  | 17%
UDP | 1024  | 5.7   | 17%
UDP GSO offload | 512   | 33.8  | 16%
UDP GSO | 512   | 10.4  | 17%
UDP | 512   | 3.5   | 17%

Signed-off-by: Boris Pismenny 
Signed-off-by: Yossi Kuperman 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   4 +-
 .../mellanox/mlx5/core/en_accel/en_accel.h|  11 +-
 .../mellanox/mlx5/core/en_accel/rxtx.c| 108 ++
 .../mellanox/mlx5/core/en_accel/rxtx.h|  14 +++
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   3 +
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   |   8 +-
 6 files changed, 139 insertions(+), 9 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 9efbf193ad5a..d923f2f58608 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -14,8 +14,8 @@ mlx5_core-$(CONFIG_MLX5_FPGA) += fpga/cmd.o fpga/core.o 
fpga/conn.o fpga/sdk.o \
fpga/ipsec.o fpga/tls.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o 
\
-   en_tx.o en_rx.o en_dim.o en_txrx.o en_stats.o vxlan.o \
-   en_arfs.o en_fs_ethtool.o en_selftest.o en/port.o
+   en_tx.o en_rx.o en_dim.o en_txrx.o en_accel/rxtx.o en_stats.o  \
+   vxlan.o en_arfs.o en_fs_ethtool.o en_selftest.o en/port.o
 
 mlx5_core-$(CONFIG_MLX5_MPFS) += lib/mpfs.o
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
index f20074dbef32..39a5d13ba459 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
@@ -34,12 +34,11 @@
 #ifndef __MLX5E_EN_ACCEL_H__
 #define __MLX5E_EN_ACCEL_H__
 
-#ifdef CONFIG_MLX5_ACCEL
-
 #include 
 #include 
 #include "en_accel/ipsec_rxtx.h"
 #include "en_accel/tls_rxtx.h"
+#include "en_accel/rxtx.h"
 #include "en.h"
 
 static inline struct sk_buff *mlx5e_accel_handle_tx(struct sk_buff *skb,
@@ -64,9 +63,13 @@ static inline struct sk_buff *mlx5e_accel_handle_tx(struct 
sk_buff *skb,
}
 #endif
 
+   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) {
+   skb = mlx5e_udp_gso_handle_tx_skb(dev, sq, skb, wqe, pi);
+   if (unlikely(!skb))
+   return NULL;
+   }
+
return skb;
 }
 
-#endif /* CONFIG_MLX5_ACCEL */
-
 #endif /* __MLX5E_EN_ACCEL_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
new file mode 100644
index ..4bb1f3b12b96
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
@@ -0,0 +1,108 @@
+#include "en_accel/rxtx.h"
+
+static void mlx5e_udp_gso_prepare_last_skb(struct sk_buff *skb,
+  struct sk_buff *nskb,
+  int remaining)
+{
+   int bytes_needed = remaining, remaining_headlen, remaining_page_offset;
+   int headlen = skb_transport_offset(skb) + sizeof(struct udphdr);
+   int payload_len = remaining + sizeof(struct udphdr);
+   int k = 0, i, j;
+
+   skb_copy_bits(skb, 0, nskb->data, headlen);
+   nskb->dev = skb->dev;
+   skb_reset_mac_header(nskb);
+   skb_set_network_header(nskb, skb_network_offset(skb));
+   skb_set_transport_header(nskb, skb_transport_offset(skb));
+   skb_set_tail_pointer(nskb, headlen);
+
+   /* How many frags do we need? */
+   for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) {
+   bytes_needed -= skb_frag_size(_shinfo(skb)->frags[i]);
+   k++;
+   if (bytes_needed <= 0)
+   break;
+   

[net-next 05/12] net/mlx5e: Add TX completions statistics

2018-06-28 Thread Saeed Mahameed
From: Tariq Toukan 

Add per-ring and global ethtool counters for TX completions.
This helps us monitor and analyze TX flow performance.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c | 3 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 4 +++-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c| 9 +++--
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 7e7155b4e0f0..d35361b1b3fe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -67,6 +67,7 @@ static const struct counter_desc sw_stats_desc[] = {
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_queue_dropped) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xmit_more) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_recover) },
+   { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_cqes) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_queue_wake) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_udp_seg_rem) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_cqe_err) },
@@ -172,6 +173,7 @@ void mlx5e_grp_sw_update_stats(struct mlx5e_priv *priv)
s->tx_tls_ooo   += sq_stats->tls_ooo;
s->tx_tls_resync_bytes  += sq_stats->tls_resync_bytes;
 #endif
+   s->tx_cqes  += sq_stats->cqes;
}
}
 
@@ -1142,6 +1144,7 @@ static const struct counter_desc sq_stats_desc[] = {
{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, dropped) },
{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, xmit_more) },
{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, recover) },
+   { MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, cqes) },
{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, wake) },
{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, cqe_err) },
 };
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index d416bb86e747..8f2dfe56fdef 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -78,6 +78,7 @@ struct mlx5e_sw_stats {
u64 tx_queue_dropped;
u64 tx_xmit_more;
u64 tx_recover;
+   u64 tx_cqes;
u64 tx_queue_wake;
u64 tx_udp_seg_rem;
u64 tx_cqe_err;
@@ -208,7 +209,8 @@ struct mlx5e_sq_stats {
u64 dropped;
u64 recover;
/* dirtied @completion */
-   u64 wake cacheline_aligned_in_smp;
+   u64 cqes cacheline_aligned_in_smp;
+   u64 wake;
u64 cqe_err;
 };
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f450d9ca31fb..f0739dae7b56 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -468,6 +468,7 @@ static void mlx5e_dump_error_cqe(struct mlx5e_txqsq *sq,
 
 bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 {
+   struct mlx5e_sq_stats *stats;
struct mlx5e_txqsq *sq;
struct mlx5_cqe64 *cqe;
u32 dma_fifo_cc;
@@ -485,6 +486,8 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
if (!cqe)
return false;
 
+   stats = sq->stats;
+
npkts = 0;
nbytes = 0;
 
@@ -513,7 +516,7 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
queue_work(cq->channel->priv->wq,
   >recover.recover_work);
}
-   sq->stats->cqe_err++;
+   stats->cqe_err++;
}
 
do {
@@ -558,6 +561,8 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 
} while ((++i < MLX5E_TX_CQ_POLL_BUDGET) && (cqe = 
mlx5_cqwq_get_cqe(>wq)));
 
+   stats->cqes += i;
+
mlx5_cqwq_update_db_record(>wq);
 
/* ensure cq space is freed before enabling more cqes */
@@ -573,7 +578,7 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
   MLX5E_SQ_STOP_ROOM) &&
!test_bit(MLX5E_SQ_STATE_RECOVERING, >state)) {
netif_tx_wake_queue(sq->txq);
-   sq->stats->wake++;
+   stats->wake++;
}
 
return (i == MLX5E_TX_CQ_POLL_BUDGET);
-- 
2.17.0



[net-next 03/12] net/mlx5e: Convert large order kzalloc allocations to kvzalloc

2018-06-28 Thread Saeed Mahameed
From: Tariq Toukan 

Replace calls to kzalloc_node with kvzalloc_node, as it fallsback
to lower-order pages if the higher-order trials fail.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 44 +--
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index e2ef68b1daa2..42ef8c818544 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -352,8 +352,8 @@ static int mlx5e_rq_alloc_mpwqe_info(struct mlx5e_rq *rq,
 {
int wq_sz = mlx5_wq_ll_get_size(>mpwqe.wq);
 
-   rq->mpwqe.info = kcalloc_node(wq_sz, sizeof(*rq->mpwqe.info),
- GFP_KERNEL, cpu_to_node(c->cpu));
+   rq->mpwqe.info = kvzalloc_node(wq_sz * sizeof(*rq->mpwqe.info),
+  GFP_KERNEL, cpu_to_node(c->cpu));
if (!rq->mpwqe.info)
return -ENOMEM;
 
@@ -670,7 +670,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 err_free:
switch (rq->wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
-   kfree(rq->mpwqe.info);
+   kvfree(rq->mpwqe.info);
mlx5_core_destroy_mkey(mdev, >umr_mkey);
break;
default: /* MLX5_WQ_TYPE_CYCLIC */
@@ -702,7 +702,7 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
 
switch (rq->wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
-   kfree(rq->mpwqe.info);
+   kvfree(rq->mpwqe.info);
mlx5_core_destroy_mkey(rq->mdev, >umr_mkey);
break;
default: /* MLX5_WQ_TYPE_CYCLIC */
@@ -965,15 +965,15 @@ static void mlx5e_close_rq(struct mlx5e_rq *rq)
 
 static void mlx5e_free_xdpsq_db(struct mlx5e_xdpsq *sq)
 {
-   kfree(sq->db.di);
+   kvfree(sq->db.di);
 }
 
 static int mlx5e_alloc_xdpsq_db(struct mlx5e_xdpsq *sq, int numa)
 {
int wq_sz = mlx5_wq_cyc_get_size(>wq);
 
-   sq->db.di = kcalloc_node(wq_sz, sizeof(*sq->db.di),
-GFP_KERNEL, numa);
+   sq->db.di = kvzalloc_node(sizeof(*sq->db.di) * wq_sz,
+ GFP_KERNEL, numa);
if (!sq->db.di) {
mlx5e_free_xdpsq_db(sq);
return -ENOMEM;
@@ -1024,15 +1024,15 @@ static void mlx5e_free_xdpsq(struct mlx5e_xdpsq *sq)
 
 static void mlx5e_free_icosq_db(struct mlx5e_icosq *sq)
 {
-   kfree(sq->db.ico_wqe);
+   kvfree(sq->db.ico_wqe);
 }
 
 static int mlx5e_alloc_icosq_db(struct mlx5e_icosq *sq, int numa)
 {
u8 wq_sz = mlx5_wq_cyc_get_size(>wq);
 
-   sq->db.ico_wqe = kcalloc_node(wq_sz, sizeof(*sq->db.ico_wqe),
- GFP_KERNEL, numa);
+   sq->db.ico_wqe = kvzalloc_node(sizeof(*sq->db.ico_wqe) * wq_sz,
+  GFP_KERNEL, numa);
if (!sq->db.ico_wqe)
return -ENOMEM;
 
@@ -1077,8 +1077,8 @@ static void mlx5e_free_icosq(struct mlx5e_icosq *sq)
 
 static void mlx5e_free_txqsq_db(struct mlx5e_txqsq *sq)
 {
-   kfree(sq->db.wqe_info);
-   kfree(sq->db.dma_fifo);
+   kvfree(sq->db.wqe_info);
+   kvfree(sq->db.dma_fifo);
 }
 
 static int mlx5e_alloc_txqsq_db(struct mlx5e_txqsq *sq, int numa)
@@ -1086,10 +1086,10 @@ static int mlx5e_alloc_txqsq_db(struct mlx5e_txqsq *sq, 
int numa)
int wq_sz = mlx5_wq_cyc_get_size(>wq);
int df_sz = wq_sz * MLX5_SEND_WQEBB_NUM_DS;
 
-   sq->db.dma_fifo = kcalloc_node(df_sz, sizeof(*sq->db.dma_fifo),
-  GFP_KERNEL, numa);
-   sq->db.wqe_info = kcalloc_node(wq_sz, sizeof(*sq->db.wqe_info),
-  GFP_KERNEL, numa);
+   sq->db.dma_fifo = kvzalloc_node(df_sz * sizeof(*sq->db.dma_fifo),
+   GFP_KERNEL, numa);
+   sq->db.wqe_info = kvzalloc_node(wq_sz * sizeof(*sq->db.wqe_info),
+   GFP_KERNEL, numa);
if (!sq->db.dma_fifo || !sq->db.wqe_info) {
mlx5e_free_txqsq_db(sq);
return -ENOMEM;
@@ -1893,7 +1893,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, 
int ix,
int err;
int eqn;
 
-   c = kzalloc_node(sizeof(*c), GFP_KERNEL, cpu_to_node(cpu));
+   c = kvzalloc_node(sizeof(*c), GFP_KERNEL, cpu_to_node(cpu));
if (!c)
return -ENOMEM;
 
@@ -1979,7 +1979,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, 
int ix,
 
 err_napi_del:
netif_napi_del(>napi);
-   kfree(c);
+   kvfree(c);
 
return err;
 }
@@ -2018,7 +2018,7 @@ static void mlx5e_close_channel(struct mlx5e_channel *c)
mlx5e_close_cq(>icosq.cq);
netif_napi_del(>napi);
 
-   kfree(c);
+   kvfree(c);
 }
 
 #define 

Re: [bpf-next PATCH 0/2] xdp/bpf: extend XDP samples/bpf xdp_rxq_info

2018-06-28 Thread Daniel Borkmann
On 06/25/2018 04:27 PM, Jesper Dangaard Brouer wrote:
> While writing an article about XDP, the samples/bpf xdp_rxq_info
> program were extended to cover some more use-cases.

Applied to bpf-next, thanks guys!


Re: [PATCH] test_bpf: flag tests that cannot be jited on s390

2018-06-28 Thread Daniel Borkmann
On 06/27/2018 05:19 PM, Kleber Sacilotto de Souza wrote:
> Flag with FLAG_EXPECTED_FAIL the BPF_MAXINSNS tests that cannot be jited
> on s390 because they exceed BPF_SIZE_MAX and fail when
> CONFIG_BPF_JIT_ALWAYS_ON is set. Also set .expected_errcode to -ENOTSUPP
> so the tests pass in that case.
> 
> Signed-off-by: Kleber Sacilotto de Souza 

Applied to bpf, thanks Kleber!


Re: [PATCH v3 bpf-net] bpf: Change bpf_fib_lookup to return lookup status

2018-06-28 Thread Daniel Borkmann
On 06/27/2018 01:21 AM, dsah...@kernel.org wrote:
> From: David Ahern 
> 
> For ACLs implemented using either FIB rules or FIB entries, the BPF
> program needs the FIB lookup status to be able to drop the packet.
> Since the bpf_fib_lookup API has not reached a released kernel yet,
> change the return code to contain an encoding of the FIB lookup
> result and return the nexthop device index in the params struct.
> 
> In addition, inform the BPF program of any post FIB lookup reason as
> to why the packet needs to go up the stack.
> 
> The fib result for unicast routes must have an egress device, so remove
> the check that it is non-NULL.
> 
> Signed-off-by: David Ahern 

Applied to bpf, thanks David!


Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw

2018-06-28 Thread Cong Wang
On Thu, Jun 28, 2018 at 6:10 AM Jiri Pirko  wrote:
> Add a template of type flower allowing to insert rules matching on last
> 2 bytes of destination mac address:
> # tc chaintemplate add dev dummy0 ingress proto ip flower dst_mac 
> 00:00:00:00:00:00/00:00:00:00:FF:FF
>
> The template is now showed in the list:
> # tc chaintemplate show dev dummy0 ingress
> chaintemplate flower chain 0
>   dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff
>   eth_type ipv4
>
> Add another template, this time for chain number 22:
> # tc chaintemplate add dev dummy0 ingress proto ip chain 22 flower dst_ip 
> 0.0.0.0/16
> # tc chaintemplate show dev dummy0 ingress
> chaintemplate flower chain 0
>   dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff
>   eth_type ipv4
> chaintemplate flower chain 22
>   eth_type ipv4
>   dst_ip 0.0.0.0/16

So, if I want to check the template of a chain, I have to use
'tc chaintemplate... chain X'.

If I want to check the filters in a chain, I have to use
'tc filter show  chain X'.

If you introduce 'tc chain', it would just need one command:
`tc chain show ... X` which could list its template first and
followed by filters in this chain, something like:

# tc chain show dev eth0 chain X
template: # could be none

filter1
...
filter2
...

Isn't it more elegant?


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Linus Torvalds
On Thu, Jun 28, 2018 at 3:20 PM Al Viro  wrote:
>
> The rules for drivers change only in one respect - if your ->poll() is going 
> to
> need to block, check poll_requested_events(pt) & EPOLL_ATOMIC and return 
> EPOLLNVAL
> in such case.

OI still don't even understand why you care.

Yes, the AIO poll implementation did it under the spinlock.

But there's no good *reason* for that.  The "aio_poll()" function
itself is called in perfectly fine blocking context.

The only reason it does it under the spinlock is that apparently
Christoph didn't understand how poll() worked.

As far as I can tell, Christoph could have just done the first pass
'->poll()' *without* taking a spinlock, and that adds the table entry
to the table. Then, *under the spinlock*, you associate the table the
the kioctx. And then *after* the spinlock, you can call "->poll()"
again (now with a NULL table pointer), to verify that the state is
still not triggered. That's the whole point of the two-phgase poll
thing - the first phase adds the entry to the wait queues, and the
second phase checks for the race of "did it the event happen in the
meantime".

There is absolutely no excuse for calling '->poll()' itself under the
spinlock. I don't see any reason for it. The whole "AIO needs this to
avoid races" was always complete and utter bullshit, as far as I can
tell.

So stop it with this crazy and pointless "poll() might block".

IT DAMN WELL SHOULD BE ABLE TO BLOCK, AND NOBODY SANE WILL EVER CARE!

If somebody cares, they are doing things wrong. So fix the AIO code,
don't look at the poll() code, for chrissake!

   Linus


Re: [PATCH v12 03/10] netdev: cavium: octeon: Add Octeon III BGX Ethernet Nexus

2018-06-28 Thread Carlos Munoz



On 06/28/2018 01:41 AM, Andrew Lunn wrote:
> External Email
>
>> +static char *mix_port;
>> +module_param(mix_port, charp, 0444);
>> +MODULE_PARM_DESC(mix_port, "Specifies which ports connect to MIX 
>> interfaces.");
>> +
>> +static char *pki_port;
>> +module_param(pki_port, charp, 0444);
>> +MODULE_PARM_DESC(pki_port, "Specifies which ports connect to the PKI.");
> Module parameters are generally not liked. Can you do without them?

These parameters change the kernel port assignment required by user space 
applications. We rather keep them as they simplify the process.

>
>> + /* One time request driver module */
>> + if (is_mix) {
>> + if (atomic_cmpxchg(_mgmt_once, 0, 1) == 0)
>> + request_module_nowait("octeon_mgmt");
> Why is this needed? So long as the driver has the needed properties,
> udev should load the module.
>
>  Andrew

The thing is the management module is only loaded when a port is assigned to it 
(determined by the above module parameter "mix_port").

Best regards,
Carlos


[PATCH bpf 3/3] bpf: undo prog rejection on read-only lock failure

2018-06-28 Thread Daniel Borkmann
Partially undo commit 9facc336876f ("bpf: reject any prog that failed
read-only lock") since it caused a regression, that is, syzkaller was
able to manage to cause a panic via fault injection deep in set_memory_ro()
path by letting an allocation fail: In x86's __change_page_attr_set_clr()
it was able to change the attributes of the primary mapping but not in
the alias mapping via cpa_process_alias(), so the second, inner call
to the __change_page_attr() via __change_page_attr_set_clr() had to split
a larger page and failed in the alloc_pages() with the artifically triggered
allocation error which is then propagated down to the call site.

Thus, for set_memory_ro() this means that it returned with an error, but
from debugging a probe_kernel_write() revealed EFAULT on that memory since
the primary mapping succeeded to get changed. Therefore the subsequent
hdr->locked = 0 reset triggered the panic as it was performed on read-only
memory, so call-site assumptions were infact wrong to assume that it would
either succeed /or/ not succeed at all since there's no such rollback in
set_memory_*() calls from partial change of mappings, in other words, we're
left in a state that is "half done". A later undo via set_memory_rw() is
succeeding though due to matching permissions on that part (aka due to the
try_preserve_large_page() succeeding). While reproducing locally with
explicitly triggering this error, the initial splitting only happens on
rare occasions and in real world it would additionally need oom conditions,
but that said, it could partially fail. Therefore, it is definitely wrong
to bail out on set_memory_ro() error and reject the program with the
set_memory_*() semantics we have today. Shouldn't have gone the extra mile
since no other user in tree today infact checks for any set_memory_*()
errors, e.g. neither module_enable_ro() / module_disable_ro() for module
RO/NX handling which is mostly default these days nor kprobes core with
alloc_insn_page() / free_insn_page() as examples that could be invoked long
after bootup and original 314beb9bcabf ("x86: bpf_jit_comp: secure bpf jit
against spraying attacks") did neither when it got first introduced to BPF
so "improving" with bailing out was clearly not right when set_memory_*()
cannot handle it today.

Kees suggested that if set_memory_*() can fail, we should annotate it with
__must_check, and all callers need to deal with it gracefully given those
set_memory_*() markings aren't "advisory", but they're expected to actually
do what they say. This might be an option worth to move forward in future
but would at the same time require that set_memory_*() calls from supporting
archs are guaranteed to be "atomic" in that they provide rollback if part
of the range fails, once that happened, the transition from RW -> RO could
be made more robust that way, while subsequent RO -> RW transition /must/
continue guaranteeing to always succeed the undo part.

Reported-by: syzbot+a4eb8c7766952a1ca...@syzkaller.appspotmail.com
Reported-by: syzbot+d866d1925855328ea...@syzkaller.appspotmail.com
Fixes: 9facc336876f ("bpf: reject any prog that failed read-only lock")
Cc: Laura Abbott 
Cc: Kees Cook 
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/filter.h | 56 --
 kernel/bpf/core.c  | 30 +--
 2 files changed, 9 insertions(+), 77 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 20f2659..300baad 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -470,9 +470,7 @@ struct sock_fprog_kern {
 };
 
 struct bpf_binary_header {
-   u16 pages;
-   u16 locked:1;
-
+   u32 pages;
/* Some arches need word alignment for their instructions */
u8 image[] __aligned(4);
 };
@@ -481,7 +479,7 @@ struct bpf_prog {
u16 pages;  /* Number of allocated pages */
u16 jited:1,/* Is our filter JIT'ed? */
jit_requested:1,/* archs need to JIT the prog */
-   locked:1,   /* Program image locked? */
+   undo_set_mem:1, /* Passed set_memory_ro() 
checkpoint */
gpl_compatible:1, /* Is filter GPL compatible? 
*/
cb_access:1,/* Is control block accessed? */
dst_needed:1,   /* Do we need dst entry? */
@@ -677,46 +675,24 @@ bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 
size_default)
 
 static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
 {
-#ifdef CONFIG_ARCH_HAS_SET_MEMORY
-   fp->locked = 1;
-   if (set_memory_ro((unsigned long)fp, fp->pages))
-   fp->locked = 0;
-#endif
+   fp->undo_set_mem = 1;
+   set_memory_ro((unsigned long)fp, fp->pages);
 }
 
 static inline void bpf_prog_unlock_ro(struct bpf_prog *fp)
 {
-#ifdef 

[PATCH bpf 1/3] bpf, arm32: fix to use bpf_jit_binary_lock_ro api

2018-06-28 Thread Daniel Borkmann
Any eBPF JIT that where its underlying arch supports ARCH_HAS_SET_MEMORY
would need to use bpf_jit_binary_{un,}lock_ro() pair instead of the
set_memory_{ro,rw}() pair directly as otherwise changes to the former
might break. arm32's eBPF conversion missed to change it, so fix this
up here.

Fixes: 39c13c204bb1 ("arm: eBPF JIT compiler")
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 arch/arm/net/bpf_jit_32.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index 6e8b716..f6a62ae 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -1844,7 +1844,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog 
*prog)
/* there are 2 passes here */
bpf_jit_dump(prog->len, image_size, 2, ctx.target);
 
-   set_memory_ro((unsigned long)header, header->pages);
+   bpf_jit_binary_lock_ro(header);
prog->bpf_func = (void *)ctx.target;
prog->jited = 1;
prog->jited_len = image_size;
-- 
2.9.5



[PATCH bpf 0/3] Three BPF fixes

2018-06-28 Thread Daniel Borkmann
This set contains three fixes that are mostly JIT and set_memory_*()
related. The third in the series in particular fixes the syzkaller
bugs that were still pending; aside from local reproduction & testing,
also 'syz test' wasn't able to trigger them anymore. I've tested this
series on x86_64, arm64 and s390x, and kbuild bot wasn't yelling either
for the rest. For details, please see patches as usual, thanks!

Daniel Borkmann (3):
  bpf, arm32: fix to use bpf_jit_binary_lock_ro api
  bpf, s390: fix potential memleak when later bpf_jit_prog fails
  bpf: undo prog rejection on read-only lock failure

 arch/arm/net/bpf_jit_32.c|  2 +-
 arch/s390/net/bpf_jit_comp.c |  1 +
 include/linux/filter.h   | 56 +++-
 kernel/bpf/core.c| 30 +---
 4 files changed, 11 insertions(+), 78 deletions(-)

-- 
2.9.5



[PATCH bpf 2/3] bpf, s390: fix potential memleak when later bpf_jit_prog fails

2018-06-28 Thread Daniel Borkmann
If we would ever fail in the bpf_jit_prog() pass that writes the
actual insns to the image after we got header via bpf_jit_binary_alloc()
then we also need to make sure to free it through bpf_jit_binary_free()
again when bailing out. Given we had prior bpf_jit_prog() passes to
initially probe for clobbered registers, program size and to fill in
addrs arrray for jump targets, this is more of a theoretical one,
but at least make sure this doesn't break with future changes.

Fixes: 054623105728 ("s390/bpf: Add s390x eBPF JIT compiler backend")
Signed-off-by: Daniel Borkmann 
Cc: Martin Schwidefsky 
Acked-by: Alexei Starovoitov 
---
 arch/s390/net/bpf_jit_comp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
index d2db8ac..5f0234e 100644
--- a/arch/s390/net/bpf_jit_comp.c
+++ b/arch/s390/net/bpf_jit_comp.c
@@ -1286,6 +1286,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
goto free_addrs;
}
if (bpf_jit_prog(, fp)) {
+   bpf_jit_binary_free(header);
fp = orig_fp;
goto free_addrs;
}
-- 
2.9.5



Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Linus Torvalds
On Thu, Jun 28, 2018 at 2:30 PM Al Viro  wrote:
>
> > Again, locking is permitted. It's not great, but it's not against the rules.
>
> Me: a *LOT* of ->poll() instances only block in __pollwait() called 
> (indirectly)
> on the first pass.
>
> You: They are *all* supposed to do it.
>
> Me: 

Oh, I thought you were talking about the whole "first pass" adding to
wait queues, as opposed to doing it on the second pass.

The *blocking* is entirely immaterial. I didn't even react to it,
because it's simply not an issue.

I don't understand why you're even hung up about it.

The only reason "blocking" seems to be an issu eis because AIO has
shit-for-brains and wanted to do poll() under the spinlock.

But that's literally just AIO being confused garbage. It has zero
relevance for anything else.

Linus


[PATCH bpf-next 1/8] tools: bpftool: use correct make variable type to improve compilation time

2018-06-28 Thread Jakub Kicinski
Commit 4bfe3bd3cc35 ("tools/bpftool: use version from the kernel
source tree") added version to bpftool.  The version used is
equal to the kernel version and obtained by running make kernelversion
against kernel source tree.  Version is then communicated
to the sources with a command line define set in CFLAGS.

Use a simply expanded variable for the version, otherwise the
recursive make will run every time CFLAGS are used.

This brings the single-job compilation time for me from almost
16 sec down to less than 4 sec.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/bpf/bpftool/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
index 892dbf095bff..0911b00b25cc 100644
--- a/tools/bpf/bpftool/Makefile
+++ b/tools/bpf/bpftool/Makefile
@@ -23,7 +23,7 @@ endif
 
 LIBBPF = $(BPF_PATH)libbpf.a
 
-BPFTOOL_VERSION=$(shell make --no-print-directory -sC ../../.. kernelversion)
+BPFTOOL_VERSION := $(shell make --no-print-directory -sC ../../.. 
kernelversion)
 
 $(LIBBPF): FORCE
$(Q)$(MAKE) -C $(BPF_DIR) OUTPUT=$(OUTPUT) $(OUTPUT)libbpf.a 
FEATURES_DUMP=$(FEATURE_DUMP_EXPORT)
-- 
2.17.1



[PATCH bpf-next 2/8] tools: libbpf: add section names for missing program types

2018-06-28 Thread Jakub Kicinski
Specify default section names for BPF_PROG_TYPE_LIRC_MODE2
and BPF_PROG_TYPE_LWT_SEG6LOCAL, these are the only two
missing right now.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/lib/bpf/libbpf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index a1e96b5de5ff..a1491e95edd0 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2037,9 +2037,11 @@ static const struct {
BPF_PROG_SEC("lwt_in",  BPF_PROG_TYPE_LWT_IN),
BPF_PROG_SEC("lwt_out", BPF_PROG_TYPE_LWT_OUT),
BPF_PROG_SEC("lwt_xmit",BPF_PROG_TYPE_LWT_XMIT),
+   BPF_PROG_SEC("lwt_seg6local",   BPF_PROG_TYPE_LWT_SEG6LOCAL),
BPF_PROG_SEC("sockops", BPF_PROG_TYPE_SOCK_OPS),
BPF_PROG_SEC("sk_skb",  BPF_PROG_TYPE_SK_SKB),
BPF_PROG_SEC("sk_msg",  BPF_PROG_TYPE_SK_MSG),
+   BPF_PROG_SEC("lirc_mode2",  BPF_PROG_TYPE_LIRC_MODE2),
BPF_SA_PROG_SEC("cgroup/bind4", BPF_CGROUP_INET4_BIND),
BPF_SA_PROG_SEC("cgroup/bind6", BPF_CGROUP_INET6_BIND),
BPF_SA_PROG_SEC("cgroup/connect4", BPF_CGROUP_INET4_CONNECT),
-- 
2.17.1



[PATCH bpf-next 0/8] tools: bpf: updates to bpftool and libbpf

2018-06-28 Thread Jakub Kicinski
Hi!

Set of random updates to bpftool and libbpf.  I'm preparing for
extending bpftool prog load, but there is a good number of
improvements that can be made before bpf -> bpf-next merge
helping to keep the later patch set to a manageable size as well.

First patch is a bpftool build speed improvement.  Next missing
program types are added to libbpf program type detection by section
name.  The ability to load programs from '.text' section is restored
when ELF file doesn't contain any pseudo calls.

In bpftool I remove my Author comments as unnecessary sign of vanity.
Last but not least missing option is added to bash completions and
processing of options in bash completions is improved.

Jakub Kicinski (8):
  tools: bpftool: use correct make variable type to improve compilation
time
  tools: libbpf: add section names for missing program types
  tools: libbpf: allow setting ifindex for programs and maps
  tools: libbpf: restore the ability to load programs from .text section
  tools: libbpf: don't return '.text' as a program for multi-function
programs
  tools: bpftool: drop unnecessary Author comments
  tools: bpftool: add missing --bpffs to completions
  tools: bpftool: deal with options upfront

 tools/bpf/bpftool/Makefile|  2 +-
 tools/bpf/bpftool/bash-completion/bpftool | 32 ++-
 tools/bpf/bpftool/common.c|  2 -
 tools/bpf/bpftool/main.c  |  4 +-
 tools/bpf/bpftool/main.h  |  2 -
 tools/bpf/bpftool/map.c   |  2 -
 tools/bpf/bpftool/prog.c  |  4 +-
 tools/lib/bpf/libbpf.c| 49 ++-
 tools/lib/bpf/libbpf.h|  2 +
 9 files changed, 66 insertions(+), 33 deletions(-)

-- 
2.17.1



[PATCH bpf-next 4/8] tools: libbpf: restore the ability to load programs from .text section

2018-06-28 Thread Jakub Kicinski
libbpf used to be able to load programs from the default section
called '.text'.  It's not very common to leave sections unnamed,
but if it happens libbpf will fail to load the programs reporting
-EINVAL from the kernel.  The -EINVAL comes from bpf_obj_name_cpy()
because since 48cca7e44f9f ("libbpf: add support for bpf_call")
libbpf does not resolve program names for programs in '.text',
defaulting to '.text'.  '.text', however, does not pass the
(isalnum(*src) || *src == '_') check in bpf_obj_name_cpy().

With few extra lines of code we can limit the pseudo call
assumptions only to objects which actually contain code relocations.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/lib/bpf/libbpf.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 7bc02d93e277..e2401b95f08d 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -234,6 +234,7 @@ struct bpf_object {
size_t nr_maps;
 
bool loaded;
+   bool has_pseudo_calls;
 
/*
 * Information when doing elf related work. Only valid if fd
@@ -400,10 +401,6 @@ bpf_object__init_prog_names(struct bpf_object *obj)
const char *name = NULL;
 
prog = >programs[pi];
-   if (prog->idx == obj->efile.text_shndx) {
-   name = ".text";
-   goto skip_search;
-   }
 
for (si = 0; si < symbols->d_size / sizeof(GElf_Sym) && !name;
 si++) {
@@ -426,12 +423,15 @@ bpf_object__init_prog_names(struct bpf_object *obj)
}
}
 
+   if (!name && prog->idx == obj->efile.text_shndx)
+   name = ".text";
+
if (!name) {
pr_warning("failed to find sym for prog %s\n",
   prog->section_name);
return -EINVAL;
}
-skip_search:
+
prog->name = strdup(name);
if (!prog->name) {
pr_warning("failed to allocate memory for prog sym 
%s\n",
@@ -981,6 +981,7 @@ bpf_program__collect_reloc(struct bpf_program *prog, 
GElf_Shdr *shdr,
prog->reloc_desc[i].type = RELO_CALL;
prog->reloc_desc[i].insn_idx = insn_idx;
prog->reloc_desc[i].text_off = sym.st_value;
+   obj->has_pseudo_calls = true;
continue;
}
 
@@ -1426,6 +1427,12 @@ bpf_program__load(struct bpf_program *prog,
return err;
 }
 
+static bool bpf_program__is_function_storage(struct bpf_program *prog,
+struct bpf_object *obj)
+{
+   return prog->idx == obj->efile.text_shndx && obj->has_pseudo_calls;
+}
+
 static int
 bpf_object__load_progs(struct bpf_object *obj)
 {
@@ -1433,7 +1440,7 @@ bpf_object__load_progs(struct bpf_object *obj)
int err;
 
for (i = 0; i < obj->nr_programs; i++) {
-   if (obj->programs[i].idx == obj->efile.text_shndx)
+   if (bpf_program__is_function_storage(>programs[i], obj))
continue;
err = bpf_program__load(>programs[i],
obj->license,
@@ -2247,7 +2254,7 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr 
*attr,
bpf_program__set_expected_attach_type(prog,
  expected_attach_type);
 
-   if (prog->idx != obj->efile.text_shndx && !first_prog)
+   if (!bpf_program__is_function_storage(prog, obj) && !first_prog)
first_prog = prog;
}
 
-- 
2.17.1



[PATCH bpf-next 5/8] tools: libbpf: don't return '.text' as a program for multi-function programs

2018-06-28 Thread Jakub Kicinski
Make bpf_program__next() skip over '.text' section if object file
has pseudo calls.  The '.text' section is hardly a program in that
case, it's more of a storage for code of functions other than main.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/lib/bpf/libbpf.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index e2401b95f08d..38ed3e92e393 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1865,8 +1865,8 @@ void *bpf_object__priv(struct bpf_object *obj)
return obj ? obj->priv : ERR_PTR(-EINVAL);
 }
 
-struct bpf_program *
-bpf_program__next(struct bpf_program *prev, struct bpf_object *obj)
+static struct bpf_program *
+__bpf_program__next(struct bpf_program *prev, struct bpf_object *obj)
 {
size_t idx;
 
@@ -1887,6 +1887,18 @@ bpf_program__next(struct bpf_program *prev, struct 
bpf_object *obj)
return >programs[idx];
 }
 
+struct bpf_program *
+bpf_program__next(struct bpf_program *prev, struct bpf_object *obj)
+{
+   struct bpf_program *prog = prev;
+
+   do {
+   prog = __bpf_program__next(prog, obj);
+   } while (prog && bpf_program__is_function_storage(prog, obj));
+
+   return prog;
+}
+
 int bpf_program__set_priv(struct bpf_program *prog, void *priv,
  bpf_program_clear_priv_t clear_priv)
 {
-- 
2.17.1



[net-next 02/12] net/mlx5e: Add UDP GSO remaining counter

2018-06-28 Thread Saeed Mahameed
From: Boris Pismenny 

This patch adds a counter for tx UDP GSO packets that contain a segment
that is not aligned to MSS - remaining segment.

Signed-off-by: Boris Pismenny 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c  | 2 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h  | 2 ++
 3 files changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
index 4bb1f3b12b96..7b7ec3998e84 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
@@ -92,6 +92,7 @@ struct sk_buff *mlx5e_udp_gso_handle_tx_skb(struct net_device 
*netdev,
if (!remaining)
return skb;
 
+   sq->stats->udp_seg_rem++;
nskb = alloc_skb(max_t(int, headlen, headlen + remaining - 
skb->data_len), GFP_ATOMIC);
if (unlikely(!nskb)) {
sq->stats->dropped++;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 1646859974ce..7e7155b4e0f0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -68,6 +68,7 @@ static const struct counter_desc sw_stats_desc[] = {
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xmit_more) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_recover) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_queue_wake) },
+   { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_udp_seg_rem) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_cqe_err) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_wqe_err) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_mpwqe_filler) },
@@ -159,6 +160,7 @@ void mlx5e_grp_sw_update_stats(struct mlx5e_priv *priv)
s->tx_added_vlan_packets += 
sq_stats->added_vlan_packets;
s->tx_queue_stopped += sq_stats->stopped;
s->tx_queue_wake+= sq_stats->wake;
+   s->tx_udp_seg_rem   += sq_stats->udp_seg_rem;
s->tx_queue_dropped += sq_stats->dropped;
s->tx_cqe_err   += sq_stats->cqe_err;
s->tx_recover   += sq_stats->recover;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 643153bb3607..d416bb86e747 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -79,6 +79,7 @@ struct mlx5e_sw_stats {
u64 tx_xmit_more;
u64 tx_recover;
u64 tx_queue_wake;
+   u64 tx_udp_seg_rem;
u64 tx_cqe_err;
u64 rx_wqe_err;
u64 rx_mpwqe_filler;
@@ -196,6 +197,7 @@ struct mlx5e_sq_stats {
u64 csum_partial_inner;
u64 added_vlan_packets;
u64 nop;
+   u64 udp_seg_rem;
 #ifdef CONFIG_MLX5_EN_TLS
u64 tls_ooo;
u64 tls_resync_bytes;
-- 
2.17.0



  1   2   >