Re: [PATCH v3 00/10] New network driver for Amiga X-Surf 100 (m68k)

2018-04-18 Thread Andrew Lunn
On Wed, Apr 18, 2018 at 05:10:45PM +1200, Michael Schmitz wrote:
> All,
> 
> just noticed belatedly that the Makefile hunk of patch 9 does no
> longer apply cleanly in 4.17-rc1, sorry. My series was based on 4.16.
> I'll resend that one, OK?

Hi Michael

You should be based on DaveM net-next tree:

git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git

Please also have "net-next" in the patch subject. See
Documentation/networking/netdev-FAQ.txt

Andrew


Re: [RFC PATCH] net: bridge: multicast querier per VLAN support

2018-04-18 Thread Joachim Nilsson
On Wed, Apr 18, 2018 at 04:14:26PM +0300, Nikolay Aleksandrov wrote:
> We want to avoid sysfs in general, all of networking config and stats
> are moving to netlink. It is better controlled and structured for such
> changes, also provides nice interfaces for automatic  type checks etc.

Aha, didn't know that. Thanks! :)

> Also (but a minor reason) there is no tree/entity in sysfs for the vlans
> where to add this. It will either have to be a file which does some
> format string hack (like us currently) or will need to add new tree for
> them which I'd really like to avoid for the bridge.

Yup, I did some ugly sysfs patches to read queriers per VLAN like that, just
for some basic feedback.  Really awful, although easy to debug because of it
being a simple file ... (I guess I'll have to make friends withe Netlink.)

> [..]
> Also after my vlan rhastable change, we have per-vlan context even today
> (e.g. per-vlan stats use it) so we'll just extend that.

Interesting, this I'll have to look at in more detail!


Re: [PATCH RFC net-next 00/11] udp gso

2018-04-18 Thread Sowmini Varadhan
On (04/18/18 06:35), Eric Dumazet wrote:
> 
> There is no change at all.
> 
> This will only be used as a mechanism to send X packets of same size.
> 
> So instead of X system calls , one system call.
> 
> One traversal of some expensive part of the host stack.
> 
> The content on the wire should be the same.

I'm sorry that's not how I interpret Willem's email below
(and maybe I misunderstood)

the following taken from https://www.spinics.net/lists/netdev/msg496150.html

Sowmini> If yes, how will the recvmsg differentiate between the case
Sowmini> (2000 byte message followed by 512 byte message) and
Sowmini> (1472 byte message, 526 byte message, then 512 byte message),
Sowmini> in other words, how are UDP message boundary semantics preserved?

Willem> They aren't. This is purely an optimization to amortize the cost of
Willem> repeated tx stack traversal. Unlike UFO, which would preserve the
Willem> boundaries of the original larger than MTU datagram.

As I understand Willem's explanation, if I do a sendmsg of 2000 bytes,
- classic UDP will send 2 IP fragments, the first one with a full UDP
  header, and the IP header indicating that this is the first frag for
  that ipid, with more frags to follow. The second frag will have the
  rest with the same ipid, it will not have a udp header,
  and it will indicatet that it is the last frag (no more frags).

  The receiver can thus use the ipid, "more-frags" bit, frag offset etc
  to stitch the 2000 byte udp message together and pass it up on the udp
  socket.

- in the "GSO" proposal my 2000  bytes of data are sent as *two*
  udp packets, each of them with a unique udp header, and uh_len set
  to 1476 (for first) and 526 (for second). The receiver has no clue
  that they are both part of the same UDP datagram, So wire format
  is not the same, am I mistaken?

--Sowmini




Re: [PATCH net-next] team: account for oper state

2018-04-18 Thread George Wilkie
On Wed, Apr 18, 2018 at 02:56:44PM +0200, Jiri Pirko wrote:
> Wed, Apr 18, 2018 at 12:29:50PM CEST, gwil...@vyatta.att-mail.com wrote:
> >Account for operational state when determining port linkup state,
> >as per Documentation/networking/operstates.txt.
> 
> Could you please point me to the exact place in the document where this
> is suggested?
> 

Various places cover it I think.

In 1. Introduction:
"interface is not usable just because the admin enabled it"
"userspace must be granted the possibility to
influence operational state"

In 4. Setting from userspace:
"the userspace application can set IFLA_OPERSTATE
to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set
netif_carrier_off() or netif_dormant_on()"

We have a use case where we want to set the oper state of the team ports based
on whether they are actually usable or not (as opposed to just admin up).

Cheers.

> 
> >
> >Signed-off-by: George Wilkie 
> >---
> > drivers/net/team/team.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> >diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
> >index a6c6ce19..231264a05e55 100644
> >--- a/drivers/net/team/team.c
> >+++ b/drivers/net/team/team.c
> >@@ -2918,7 +2918,8 @@ static int team_device_event(struct notifier_block 
> >*unused,
> > case NETDEV_CHANGE:
> > if (netif_running(port->dev))
> > team_port_change_check(port,
> >-   !!netif_carrier_ok(port->dev));
> >+   !!(netif_carrier_ok(port->dev) &&
> >+  netif_oper_up(port->dev)));
> > break;
> > case NETDEV_UNREGISTER:
> > team_del_slave(port->team->dev, dev);
> >-- 
> >2.11.0
> >


Re: [PATCH bpf-next v4 07/10] bpf: btf: Add pretty print support to the basic arraymap

2018-04-18 Thread Daniel Borkmann
Hi Martin,

first of all great work on the set! One issue that puzzled me
while digesting it further below.

On 04/17/2018 10:42 PM, Martin KaFai Lau wrote:
> This patch adds pretty print support to the basic arraymap.
> Support for other bpf maps can be added later.
> 
> This patch adds new attrs to the BPF_MAP_CREATE command to allow
> specifying the btf_fd, btf_key_id and btf_value_id.  The
> BPF_MAP_CREATE can then associate the btf to the map if
> the creating map supports BTF.

Feels like this patch is doing two things at once, meaning i)
attaching btf object to map object through bpf syscall at map
creation time, and ...

> A BTF supported map needs to implement two new map ops,
> map_seq_show_elem() and map_check_btf().  This patch has
> implemented these new map ops for the basic arraymap.
> 
> It also adds file_operations to the pinned map
> such that the pinned map can be opened and read.

... ii) pretty print map dump via bpf fs for array map.

> Here is a sample output when reading a pinned arraymap
> with the following map's value:
> 
> struct map_value {
>   int count_a;
>   int count_b;
> };
> 
> cat /sys/fs/bpf/pinned_array_map:
> 
> 0: {1,2}
> 1: {3,4}
> 2: {5,6}
> ...

Rather than adding this to the bpf fs itself, why not add full BTF
support for the main debugging and introspection tool we have and
ship with the kernel for BPF, namely bpftool? You can already dump
the whole map's key/value pairs via the following command from a
pinned file:

  bpftool map dump pinned /sys/fs/bpf/pinned_array_map

And given we already export the BTF info in your earlier patch through
the BPF_OBJ_GET_INFO_BY_FD, this would fit perfectly for bpftool
integration instead where the pretty-print which is done through the
extra cb map_seq_show_elem (which only does a map lookup and print
anyway) in this patch can simply all be done in user space w/o any
additional kernel complexity.

Aside that this would be very valuable there it would also nicely
demonstrate usage of it, but more importantly we could avoid implementing
such pretty-print callback in the kernel for every other map type and
then having two locations where a user now needs to go for debugging
(bpftool being one, and cat of pinned file the other; this split seems
confusing from a user perspective, imho, but also single key lookup +
pretty-print cannot be realized with the latter whereas it's trivial
with bpftool).

The same could be done for bpftool map lookup, updates, deletions, etc
where the key resp. key/value pair can be specified through a struct
like initializer from cmdline. (But dump/lookup should be good enough
starting point initially.) Thoughts?

Thanks again,
Daniel

> Signed-off-by: Martin KaFai Lau 
> Acked-by: Alexei Starovoitov 
> ---
>  include/linux/bpf.h  |  20 ++-
>  include/uapi/linux/bpf.h |   3 +
>  kernel/bpf/arraymap.c|  50 
>  kernel/bpf/inode.c   | 146 
> ++-
>  kernel/bpf/syscall.c |  32 ++-
>  5 files changed, 244 insertions(+), 7 deletions(-)



Re: [PATCH bpf-next v3 8/8] bpf: add documentation for eBPF helpers (58-64)

2018-04-18 Thread Jesper Dangaard Brouer
On Wed, 18 Apr 2018 15:09:41 +0100
Quentin Monnet  wrote:

> 2018-04-18 15:34 UTC+0200 ~ Jesper Dangaard Brouer 
> > On Tue, 17 Apr 2018 15:34:38 +0100
> > Quentin Monnet  wrote:
> >   
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index 350459c583de..3d329538498f 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -1276,6 +1276,50 @@ union bpf_attr {
> >>   *Return
> >>   *0 on success, or a negative error in case of failure.
> >>   *
> >> + * int bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
> >> + *Description
> >> + *Redirect the packet to the endpoint referenced by *map* 
> >> at
> >> + *index *key*. Depending on its type, his *map* can 
> >> contain  
> > ^^^
> > 
> > "his" -> "this"  
> 
> Thanks!
> 
> >> + *references to net devices (for forwarding packets 
> >> through other
> >> + *ports), or to CPUs (for redirecting XDP frames to 
> >> another CPU;
> >> + *but this is only implemented for native XDP (with driver
> >> + *support) as of this writing).
> >> + *
> >> + *All values for *flags* are reserved for future usage, 
> >> and must
> >> + *be left at zero.
> >> + *Return
> >> + ***XDP_REDIRECT** on success, or **XDP_ABORT** on error.
> >> + *  
> > 
> > "XDP_ABORT" -> "XDP_ABORTED"  
> 
> Ouch. And I did the same for bpf_redirect(). Thanks for the catch.
> 
> > 
> > I don't know if it's worth mentioning in the doc/man-page; that for XDP
> > using bpf_redirect_map() is a HUGE performance advantage, compared to
> > the bpf_redirect() call ?  
> 
> It seems worth to me. How would you simply explain the reason for this
> difference?

The basic reason is "bulking effect", as devmap avoids the NIC
tailptr/doorbell update on every packet... how to write that in a doc
format?

I wrote about why XDP_REDIRECT with maps are smart here:
 http://vger.kernel.org/netconf2017_files/XDP_devel_update_NetConf2017_Seoul.pdf

Using maps for redirect, hopefully makes XDP_REDIRECT the last driver
XDP action code we need.  As new types of redirect can be introduced
without driver changes. See that AF_XDP also uses a map.

It is more subtle, but maps also function as a sorting step. Imagine
your XDP program need to redirect out different interfaces (or CPUs in
cpumap case), and packets arrive intermixed.  Packets get sorted into
the different map indexes, and the xdp_do_flush_map() will trigger the
flush operation.


Happened to have an i40e NIC benchmark setup, and ran a single flow pktgen test.

Results with 'xdp_redirect_map'
 13589297 pps (13,589,297) 

Results with 'xdp_redirect' NOT using devmap:
  7567575 pps (7,567,575)

Just to point out the performance benefit of devmap...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net-next 0/5] virtio-net: Add SCTP checksum offload support

2018-04-18 Thread Michael S. Tsirkin
On Tue, Apr 17, 2018 at 04:35:18PM -0400, Vlad Yasevich wrote:
> On 04/02/2018 10:47 AM, Marcelo Ricardo Leitner wrote:
> > On Mon, Apr 02, 2018 at 09:40:01AM -0400, Vladislav Yasevich wrote:
> >> Now that we have SCTP offload capabilities in the kernel, we can add
> >> them to virtio as well.  First step is SCTP checksum.
> > 
> > Thanks.
> > 
> >> As for GSO, the way sctp GSO is currently implemented buys us nothing
> >> in added support to virtio.  To add true GSO, would require a lot of
> >> re-work inside of SCTP and would require extensions to the virtio
> >> net header to carry extra sctp data.
> > 
> > Can you please elaborate more on this? Is this because SCTP GSO relies
> > on the gso skb format for knowing how to segment it instead of having
> > a list of sizes?
> > 
> 
> it's mainly because all the true segmentation, placing data into chunks,
> has already happened.  All that GSO does is allow for higher bundling
> rate between VMs. If that is all SCTP GSO ever going to do, that fine,
> but the goal is to do real GSO eventually and potentially reduce the
> amount of memory copying we are doing.
> If we do that, any current attempt at GSO in virtio would have to be
> depricated and we'd need GSO2 or something like that.

Batching helps virtualization *a lot* though.
Are there actual plans for GSO2? Is it just for SCTP?

> 
> This is why, after doing the GSO support, I decided not to include it.
> 
> -vlad
> >   Marcelo
> > 


Re: [PATCH net-next] team: account for oper state

2018-04-18 Thread Jiri Pirko
Wed, Apr 18, 2018 at 03:35:49PM CEST, gwil...@vyatta.att-mail.com wrote:
>On Wed, Apr 18, 2018 at 02:56:44PM +0200, Jiri Pirko wrote:
>> Wed, Apr 18, 2018 at 12:29:50PM CEST, gwil...@vyatta.att-mail.com wrote:
>> >Account for operational state when determining port linkup state,
>> >as per Documentation/networking/operstates.txt.
>> 
>> Could you please point me to the exact place in the document where this
>> is suggested?
>> 
>
>Various places cover it I think.
>
>In 1. Introduction:
>"interface is not usable just because the admin enabled it"
>"userspace must be granted the possibility to
>influence operational state"
>
>In 4. Setting from userspace:
>"the userspace application can set IFLA_OPERSTATE
>to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set
>netif_carrier_off() or netif_dormant_on()"
>
>We have a use case where we want to set the oper state of the team ports based
>on whether they are actually usable or not (as opposed to just admin up).

Are you running a supplicant there or what is the use-case?

How is this handle in other drivers like bond, openvswitch, bridge, etc?

>
>Cheers.
>
>> 
>> >
>> >Signed-off-by: George Wilkie 
>> >---
>> > drivers/net/team/team.c | 3 ++-
>> > 1 file changed, 2 insertions(+), 1 deletion(-)
>> >
>> >diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>> >index a6c6ce19..231264a05e55 100644
>> >--- a/drivers/net/team/team.c
>> >+++ b/drivers/net/team/team.c
>> >@@ -2918,7 +2918,8 @@ static int team_device_event(struct notifier_block 
>> >*unused,
>> >case NETDEV_CHANGE:
>> >if (netif_running(port->dev))
>> >team_port_change_check(port,
>> >-  !!netif_carrier_ok(port->dev));
>> >+  !!(netif_carrier_ok(port->dev) &&
>> >+ netif_oper_up(port->dev)));
>> >break;
>> >case NETDEV_UNREGISTER:
>> >team_del_slave(port->team->dev, dev);
>> >-- 
>> >2.11.0
>> >


Re: [PATCH bpf-next v3 8/8] bpf: add documentation for eBPF helpers (58-64)

2018-04-18 Thread Quentin Monnet
2018-04-18 15:34 UTC+0200 ~ Jesper Dangaard Brouer 
> On Tue, 17 Apr 2018 15:34:38 +0100
> Quentin Monnet  wrote:
> 
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 350459c583de..3d329538498f 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -1276,6 +1276,50 @@ union bpf_attr {
>>   *  Return
>>   *  0 on success, or a negative error in case of failure.
>>   *
>> + * int bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
>> + *  Description
>> + *  Redirect the packet to the endpoint referenced by *map* at
>> + *  index *key*. Depending on its type, his *map* can contain
> ^^^
> 
> "his" -> "this"

Thanks!

>> + *  references to net devices (for forwarding packets through other
>> + *  ports), or to CPUs (for redirecting XDP frames to another CPU;
>> + *  but this is only implemented for native XDP (with driver
>> + *  support) as of this writing).
>> + *
>> + *  All values for *flags* are reserved for future usage, and must
>> + *  be left at zero.
>> + *  Return
>> + *  **XDP_REDIRECT** on success, or **XDP_ABORT** on error.
>> + *
> 
> "XDP_ABORT" -> "XDP_ABORTED"

Ouch. And I did the same for bpf_redirect(). Thanks for the catch.

> 
> I don't know if it's worth mentioning in the doc/man-page; that for XDP
> using bpf_redirect_map() is a HUGE performance advantage, compared to
> the bpf_redirect() call ?

It seems worth to me. How would you simply explain the reason for this
difference?

Quentin


Re: SRIOV switchdev mode BoF minutes

2018-04-18 Thread Andy Gospodarek
On Tue, Apr 17, 2018 at 04:19:15PM -0700, Jakub Kicinski wrote:
> On Tue, 17 Apr 2018 10:47:00 -0400, Andy Gospodarek wrote:
> > There is also a school of thought that the VF reps could be
> > pre-allocated on the SmartNIC so that any application processing that
> > traffic would sit idle when no traffic arrives on the rep, but could
> > process frames that do arrive when the VFs were created on the host.
> > This implementation will depend on how resources are allocated on a
> > given bit of hardware, but can really work well.
> 
> +1 if there is no FW resource allocation issues IMHO it's okay to
> just show all reprs for "remote PCIes (PFs and VFs)" on the SmartNIC/
> controller.  The reprs should just show link down as if PCIe cable
> was unpluged until host actually enables them.  

Yes we are on the same page on this.

> A similar issue exists on multi-host for PFs, right?  If one of the
> hosts is down do we still show their PF repr?  IMHO yes.

I would agree with that as well.  With today's model the VF reps are
created once a PF is put into switchdev mode, but I'm still working out
how we want to consider whether or not a PF rep for the other domains is
created locally or not and also how one can determine which domain is in
control.

Permanent config options (like NVRAM settings) could easily handle which
domain is in control, but that still does not mean that PF reps must be
created automatically, does it?

> That makes the thing looks more like a switch with cables being plugged
> in and out.

Yes, that's exactly how I view it as well.


[PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks

2018-04-18 Thread Christian Brauner
Now that it's possible to have a different set of uevents in different
network namespaces, per-network namespace uevent sequence numbers are
introduced. This increases performance as locking is now restricted to the
network namespace affected by the uevent rather than locking everything.

Since commit 692ec06 ("netns: send uevent messages") network namespaces not
owned by the intial user namespace can be sent uevents from a sufficiently
privileged userspace process.
In order to send a uevent into a network namespace not owned by the initial
user namespace we currently still need to take the *global mutex* that
locks the uevent socket list even though the list *only contains network
namespaces owned by the initial user namespace*. This needs to be done
because the uevent counter is a global variable. Taking the global lock is
performance sensitive since a user on the host can spawn a pool of n
process that each create their own new user and network namespaces and then
go on to inject uevents in parallel into the network namespace of all of
these processes. This can have a significant performance impact for the
host's udevd since it means that there can be a lot of delay between a
device being added and the corresponding uevent being sent out and
available for processing by udevd. It also means that each network
namespace not owned by the initial user namespace which userspace has sent
a uevent to will need to wait until the lock becomes available.

Implementation:
This patch gives each network namespace its own uevent sequence number.
Each network namespace not owned by the initial user namespace receives its
own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
file it is clearly documented which lock has to be taken. All network
namespaces owned by the initial user namespace will still share the same
lock since they are all served sequentially via the uevent socket list.
This decouples the locking and ensures that the host retrieves uevents as
fast as possible even if there are a lot of uevents injected into network
namespaces not owned by the initial user namespace.  In addition, each
network namespace not owned by the initial user namespace does not have to
wait on any other network namespace not sharing the same user namespace.

Signed-off-by: Christian Brauner 
---
 include/linux/kobject.h |   3 --
 include/net/net_namespace.h |   3 ++
 kernel/ksysfs.c |   3 +-
 lib/kobject_uevent.c| 100 
 net/core/net_namespace.c|  13 +
 5 files changed, 98 insertions(+), 24 deletions(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 7f6f93c3df9c..776391aea247 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -36,9 +36,6 @@
 extern char uevent_helper[];
 #endif
 
-/* counter to tag the uevent, read only except for the kobject core */
-extern u64 uevent_seqnum;
-
 /*
  * The actions here must match the index to the string array
  * in lib/kobject_uevent.c
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 47e35cce3b64..e4e171b1ba69 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -85,6 +85,8 @@ struct net {
struct sock *genl_sock;
 
struct uevent_sock  *uevent_sock;   /* uevent socket */
+   /* counter to tag the uevent, read only except for the kobject core */
+   u64 uevent_seqnum;
 
struct list_headdev_base_head;
struct hlist_head   *dev_name_head;
@@ -189,6 +191,7 @@ extern struct list_head net_namespace_list;
 
 struct net *get_net_ns_by_pid(pid_t pid);
 struct net *get_net_ns_by_fd(int fd);
+u64 get_ns_uevent_seqnum_by_vpid(void);
 
 #ifdef CONFIG_SYSCTL
 void ipx_register_sysctl(void);
diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index 46ba853656f6..83264edcecda 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include /* rcu_expedited and rcu_normal */
 
@@ -33,7 +34,7 @@ static struct kobj_attribute _name##_attr = \
 static ssize_t uevent_seqnum_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf)
 {
-   return sprintf(buf, "%llu\n", (unsigned long long)uevent_seqnum);
+   return sprintf(buf, "%llu\n", (unsigned long 
long)get_ns_uevent_seqnum_by_vpid());
 }
 KERNEL_ATTR_RO(uevent_seqnum);
 
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index f5f5038787ac..796fd502c227 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -29,21 +29,38 @@
 #include 
 
 
-u64 uevent_seqnum;
 #ifdef CONFIG_UEVENT_HELPER
 char uevent_helper[UEVENT_HELPER_PATH_LEN] = CONFIG_UEVENT_HELPER_PATH;
 #endif
 
+/*
+ * Size a buffer needs to be in order to hold the largest possible sequence
+ * number stored 

[PATCH net-next 1/2] netns: restrict uevents

2018-04-18 Thread Christian Brauner
commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")

enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk a little. We have now reached the point where hotplug events for all
devices that carry a namespace tag are filtered according to that
namespace. Specifically, they are filtered whenever the namespace tag of
the kobject does not match the namespace tag of the netlink socket. One
example are network devices. Uevents for network devices only show up in
the network namespaces these devices are moved to or created in.

However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.

This patch restricts uevents to the initial user namespace for a couple of
reasons that have been extensively discusses on the mailing list [1].
- Thundering herd:
  Broadcasting uevents into all network namespaces introduces significant
  overhead.
  All processes that listen to uevents running in non-initial user
  namespaces will end up responding to uevents that will be meaningless to
  them. Mainly, because non-initial user namespaces cannot easily manage
  devices unless they have a privileged host-process helping them out. This
  means that there will be a thundering herd of activity when there
  shouldn't be any.
- Uevents from non-root users are already filtered in userspace:
  Uevents are filtered by userspace in a user namespace because the
  received uid != 0. Instead the uid associated with the event will be
  65534 == "nobody" because the global root uid is not mapped.
  This means we can safely and without introducing regressions modify the
  kernel to not send uevents into all network namespaces whose owning user
  namespace is not the initial user namespace because we know that
  userspace will ignore the message because of the uid anyway. I have
  a) verified that is is true for every udev implementation out there b)
  that this behavior has been present in all udev implementations from the
  very beginning.
- Removing needless overhead/Increasing performance:
  Currently, the uevent socket for each network namespace is added to the
  global variable uevent_sock_list. The list itself needs to be protected
  by a mutex. So everytime a uevent is generated the mutex is taken on the
  list. The mutex is held *from the creation of the uevent (memory
  allocation, string creation etc. until all uevent sockets have been
  handled*. This is aggravated by the fact that for each uevent socket that
  has listeners the mc_list must be walked as well which means we're
  talking O(n^2) here. Given that a standard Linux workload usually has
  quite a lot of network namespaces and - in the face of containers - a lot
  of user namespaces this quickly becomes a performance problem (see
  "Thundering herd" above). By just recording uevent sockets of network
  namespaces that are owned by the initial user namespace we significantly
  increase performance in this codepath.
- Injecting uevents:
  There's a valid argument that containers might be interested in receiving
  device events especially if they are delegated to them by a privileged
  userspace process. One prime example are SR-IOV enabled devices that are
  explicitly designed to be handed of to other users such as VMs or
  containers.
  This use-case can now be correctly handled since
  commit 692ec06d7c92 ("netns: send uevent messages"). This commit
  introduced the ability to send uevents from userspace. As such we can let
  a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
  the network namespace of the netlink socket) userspace process make a
  decision what uevents should be sent. This removes the need to blindly
  broadcast uevents into all user namespaces and provides a performant and
  safe solution to this problem.
- Filtering logic:
  This patch filters by *owning user namespace of the network namespace a
  given task resides in* and not by user namespace of the task per se. This
  means if the user namespace of a given task is unshared but the network
  namespace is kept and is owned by the initial user namespace a listener
  that is opening the uevent socket in that network namespace can still
  listen to uevents.

[1]: https://lkml.org/lkml/2018/4/4/739
Signed-off-by: Christian Brauner 
---
 lib/kobject_uevent.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 15ea216a67ce..f5f5038787ac 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
 
net->uevent_sock = ue_sk;
 
-   mutex_lock(_sock_mutex);
-   list_add_tail(_sk->list, 

[PATCH net-next 0/2] netns: uevent performance tweaks

2018-04-18 Thread Christian Brauner
Hey,

This series deals with a bunch of performance improvements when sending out
uevents that have been extensively discussed here:
https://lkml.org/lkml/2018/4/10/592

- Only record uevent sockets from network namespaces owned by the
  initial user namespace in the global uevent socket list.
  Eric, this is the exact patch we agreed upon in
  https://lkml.org/lkml/2018/4/10/592.
  **A very detailed rationale is present in the commit message for
[PATCH 1/2] netns: restrict uevents**
- Decouple the locking for network namespaces in the global uevent socket
  list from the locking for network namespaces not in the global uevent
  socket list.
  **A very detailed rationale is present in the commit message
[PATCH 2/2] netns: isolate seqnums to use per-netns locks**

Thanks!
Christian

Christian Brauner (2):
  netns: restrict uevents
  netns: isolate seqnums to use per-netns locks

 include/linux/kobject.h |   3 -
 include/net/net_namespace.h |   3 +
 kernel/ksysfs.c |   3 +-
 lib/kobject_uevent.c| 118 
 net/core/net_namespace.c|  13 
 5 files changed, 110 insertions(+), 30 deletions(-)

-- 
2.17.0



Re: [PATCH RFC net-next 00/11] udp gso

2018-04-18 Thread Willem de Bruijn
On Wed, Apr 18, 2018 at 9:59 AM, Willem de Bruijn
 wrote:
>> One thing that was not clear to me about the API: shouldn't UDP_SEGMENT
>> just be automatically determined in the stack from the pmtu? Whats
>> the motivation for the socket option for this? also AIUI this can be
>> either a per-socket or a per-packet option?

I forgot to respond to the last point: yes, it is set either as a setsockopt or
passed as a cmsg for a given send call.

Especially when using unconnected sockets to communicate with many
clients, it is likely that this value will vary per call.


Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel

2018-04-18 Thread Eric W. Biederman
Rahul Lakkireddy  writes:

> On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
>> Hi Rahul,
>> On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
>> > On production servers running variety of workloads over time, kernel
>> > panic can happen sporadically after days or even months. It is
>> > important to collect as much debug logs as possible to root cause
>> > and fix the problem, that may not be easy to reproduce. Snapshot of
>> > underlying hardware/firmware state (like register dump, firmware
>> > logs, adapter memory, etc.), at the time of kernel panic will be very
>> > helpful while debugging the culprit device driver.
>> > 
>> > This series of patches add new generic framework that enable device
>> > drivers to collect device specific snapshot of the hardware/firmware
>> > state of the underlying device in the crash recovery kernel. In crash
>> > recovery kernel, the collected logs are added as elf notes to
>> > /proc/vmcore, which is copied by user space scripts for post-analysis.
>> > 
>> > The sequence of actions done by device drivers to append their device
>> > specific hardware/firmware logs to /proc/vmcore are as follows:
>> > 
>> > 1. During probe (before hardware is initialized), device drivers
>> > register to the vmcore module (via vmcore_add_device_dump()), with
>> > callback function, along with buffer size and log name needed for
>> > firmware/hardware log collection.
>> 
>> I assumed the elf notes info should be prepared while kexec_[file_]load
>> phase. But I did not read the old comment, not sure if it has been discussed
>> or not.
>> 
>
> We must not collect dumps in crashing kernel. Adding more things in
> crash dump path risks not collecting vmcore at all. Eric had
> discussed this in more detail at:
>
> https://lkml.org/lkml/2018/3/24/319
>
> We are safe to collect dumps in the second kernel. Each device dump
> will be exported as an elf note in /proc/vmcore.

It just occurred to me there is one variation that is worth
considering.

Is the area you are looking at dumping part of a huge mmio area?
I think someone said 2GB?

If that is the case it could be worth it to simply add the needed
addresses to the range of memory we need to dump, and simply having a
elf note saying that is what happened.

>> If do this in 2nd kernel a question is driver can be loaded later than 
>> vmcore init.
>
> Yes, drivers will add their device dumps after vmcore init.
>
>> How to guarantee the function works if vmcore reading happens before
>> the driver is loaded?
>> 
>> Also it is possible that kdump initramfs does not contains the driver
>> module.
>> 
>> Am I missing something?
>> 
>
> Yes, driver must be in initramfs if it wants to collect and add device
> dump to /proc/vmcore in second kernel.

Eric


Re: [PATCH 1/6] rhashtable: remove outdated comments about grow_decision etc

2018-04-18 Thread Herbert Xu
On Wed, Apr 18, 2018 at 04:47:01PM +1000, NeilBrown wrote:
> grow_decision and shink_decision no longer exist, so remove
> the remaining references to them.
> 
> Signed-off-by: NeilBrown 

Acked-by: Herbert Xu 
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH 2/6] rhashtable: remove incorrect comment on r{hl, hash}table_walk_enter()

2018-04-18 Thread Herbert Xu
On Wed, Apr 18, 2018 at 04:47:01PM +1000, NeilBrown wrote:
> Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, so
> remove the comments which suggest that they do.
> 
> Signed-off-by: NeilBrown 
> ---
>  include/linux/rhashtable.h |3 ---
>  lib/rhashtable.c   |3 ---
>  2 files changed, 6 deletions(-)
> 
> diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
> index 87d443a5b11d..b01d88e196c2 100644
> --- a/include/linux/rhashtable.h
> +++ b/include/linux/rhashtable.h
> @@ -1268,9 +1268,6 @@ static inline int rhashtable_walk_init(struct 
> rhashtable *ht,
>   * For a completely stable walk you should construct your own data
>   * structure outside the hash table.
>   *
> - * This function may sleep so you must not call it from interrupt
> - * context or with spin locks held.

It does a naked spin lock so even though we removed the memory
allocation you still mustn't call it from interrupt context.

Why do you need to do that anyway?

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[PATCH] net: don't use kvzalloc for DMA memory

2018-04-18 Thread Mikulas Patocka
The patch 74d332c13b21 changes alloc_netdev_mqs to use vzalloc if kzalloc
fails (later patches change it to kvzalloc).

The problem with this is that if the vzalloc function is actually used, 
virtio_net doesn't work (because it expects that the extra memory should 
be accessible with DMA-API and memory allocated with vzalloc isn't).

This patch changes it back to kzalloc and adds a warning if the allocated
size is too large (the allocation is unreliable in this case).

Signed-off-by: Mikulas Patocka 
Fixes: 74d332c13b21 ("net: extend net_device allocation to vmalloc()")

---
 net/core/dev.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6/net/core/dev.c
===
--- linux-2.6.orig/net/core/dev.c   2018-04-16 21:08:36.0 +0200
+++ linux-2.6/net/core/dev.c2018-04-18 16:24:43.0 +0200
@@ -8366,7 +8366,8 @@ struct net_device *alloc_netdev_mqs(int
/* ensure 32-byte alignment of whole construct */
alloc_size += NETDEV_ALIGN - 1;
 
-   p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+   WARN_ON(alloc_size > PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
+   p = kzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
if (!p)
return NULL;
 


Re: [PATCH] net: qmi_wwan: add Wistron Neweb D19Q1

2018-04-18 Thread Bjørn Mork
Pawel Dembicki  writes:

> This modem is embedded on dlink dwr-960 router.
> The oem configuration states:
>
> T: Bus=01 Lev=01 Prnt=01 Port=00 Cnt=01 Dev#= 2 Spd=480 MxCh= 0
> D: Ver= 2.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1
> P: Vendor=1435 ProdID=d191 Rev=ff.ff
> S: Manufacturer=Android
> S: Product=Android
> S: SerialNumber=0123456789ABCDEF
> C:* #Ifs= 6 Cfg#= 1 Atr=80 MxPwr=500mA
> I:* If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=(none)
> E: Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
> E: Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
> I:* If#= 1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=42 Prot=01 Driver=(none)
> E: Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
> E: Ad=82(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
> I:* If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=(none)
> E: Ad=84(I) Atr=03(Int.) MxPS= 10 Ivl=32ms
> E: Ad=83(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
> E: Ad=03(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
> I:* If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=(none)
> E: Ad=86(I) Atr=03(Int.) MxPS= 10 Ivl=32ms
> E: Ad=85(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
> E: Ad=04(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
> I:* If#= 4 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=qmi_wwan
> E: Ad=88(I) Atr=03(Int.) MxPS= 8 Ivl=32ms
> E: Ad=87(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
> E: Ad=05(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
> I:* If#= 5 Alt= 0 #EPs= 2 Cls=08(stor.) Sub=06 Prot=50 Driver=(none)
> E: Ad=89(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
> E: Ad=06(O) Atr=02(Bulk) MxPS= 512 Ivl=125us
>
> Tested on openwrt distribution
>
> Signed-off-by: Pawel Dembicki 

Acked-by: Bjørn Mork 


Re: [PATCH net-next 2/2] openvswitch: Support conntrack zone limit

2018-04-18 Thread Gregory Rose

On 4/17/2018 5:30 PM, Yi-Hung Wei wrote:

s/to commit/from committing/
s/entry/entries/

Thanks, will fix that in both patches in v2.



I think this is a great idea but I suggest porting to the iproute2 package
so everyone can use it.  Then git rid of the OVS specific prefixes.
Presuming of course that the conntrack connection
limit backend works there as well I guess.  If it doesn't, then I'd suggest
extending
it.  This is a nice feature for all users in my opinion and then OVS
can take advantage of it as well.

Thanks for the comment.  And yes, I think currently, iptables’s
connlimit extension does support limiting the # of connections.  Users
need to configure the zone properly, and the iptable’s connlimit
extension is using netfilter's nf_conncount backend already.

The main goal for this patch is to utilize netfilter backend
(nf_conncount) to count and limit the number of connections. OVS needs
the proposed OVS_CT_LIMIT netlink API and the corresponding booking
data structure because the current nf_conncount backend only counts
the # of connections, but it does not keep track of the connection
limit in nf_conncount.

Thanks,

-Yi-Hung


Thanks Yi-hung, I figured I was just missing something there.  I 
appreciate the explanation.


- Greg


Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel

2018-04-18 Thread Rahul Lakkireddy
On Wednesday, April 04/18/18, 2018 at 19:58:01 +0530, Eric W. Biederman wrote:
> Rahul Lakkireddy  writes:
> 
> > On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
> >> Hi Rahul,
> >> On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> >> > On production servers running variety of workloads over time, kernel
> >> > panic can happen sporadically after days or even months. It is
> >> > important to collect as much debug logs as possible to root cause
> >> > and fix the problem, that may not be easy to reproduce. Snapshot of
> >> > underlying hardware/firmware state (like register dump, firmware
> >> > logs, adapter memory, etc.), at the time of kernel panic will be very
> >> > helpful while debugging the culprit device driver.
> >> > 
> >> > This series of patches add new generic framework that enable device
> >> > drivers to collect device specific snapshot of the hardware/firmware
> >> > state of the underlying device in the crash recovery kernel. In crash
> >> > recovery kernel, the collected logs are added as elf notes to
> >> > /proc/vmcore, which is copied by user space scripts for post-analysis.
> >> > 
> >> > The sequence of actions done by device drivers to append their device
> >> > specific hardware/firmware logs to /proc/vmcore are as follows:
> >> > 
> >> > 1. During probe (before hardware is initialized), device drivers
> >> > register to the vmcore module (via vmcore_add_device_dump()), with
> >> > callback function, along with buffer size and log name needed for
> >> > firmware/hardware log collection.
> >> 
> >> I assumed the elf notes info should be prepared while kexec_[file_]load
> >> phase. But I did not read the old comment, not sure if it has been 
> >> discussed
> >> or not.
> >> 
> >
> > We must not collect dumps in crashing kernel. Adding more things in
> > crash dump path risks not collecting vmcore at all. Eric had
> > discussed this in more detail at:
> >
> > https://lkml.org/lkml/2018/3/24/319
> >
> > We are safe to collect dumps in the second kernel. Each device dump
> > will be exported as an elf note in /proc/vmcore.
> 
> It just occurred to me there is one variation that is worth
> considering.
> 
> Is the area you are looking at dumping part of a huge mmio area?
> I think someone said 2GB?
> 
> If that is the case it could be worth it to simply add the needed
> addresses to the range of memory we need to dump, and simply having a
> elf note saying that is what happened.
> 

We are _not_ dumping mmio area. However, one part of the dump
collection involves reading 2 GB on-chip memory via PIO access,
which is compressed and stored.

Thanks,
Rahul


Re: [PATCH 6/6] rhashtable: add rhashtable_walk_prev()

2018-04-18 Thread Herbert Xu
On Wed, Apr 18, 2018 at 04:47:02PM +1000, NeilBrown wrote:
> rhashtable_walk_prev() returns the object returned by
> the previous rhashtable_walk_next(), providing it is still in the
> table (or was during this grace period).
> This works even if rhashtable_walk_stop() and rhashtable_talk_start()
> have been called since the last rhashtable_walk_next().
> 
> If there have been no calls to rhashtable_walk_next(), or if the
> object is gone from the table, then NULL is returned.
> 
> This can usefully be used in a seq_file ->start() function.
> If the pos is the same as was returned by the last ->next() call,
> then rhashtable_walk_prev() can be used to re-establish the
> current location in the table.  If it returns NULL, then
> rhashtable_walk_next() should be used.
> 
> Signed-off-by: NeilBrown 

Can you explain the need for this function and its difference
from the existing rhashtable_walk_peek?

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH RFC net-next 00/11] udp gso

2018-04-18 Thread Samudrala, Sridhar

On 4/18/2018 6:51 AM, Willem de Bruijn wrote:

On Wed, Apr 18, 2018 at 9:47 AM, Sowmini Varadhan
 wrote:

On (04/18/18 06:35), Eric Dumazet wrote:

There is no change at all.

This will only be used as a mechanism to send X packets of same size.

So instead of X system calls , one system call.

One traversal of some expensive part of the host stack.

The content on the wire should be the same.

I'm sorry that's not how I interpret Willem's email below
(and maybe I misunderstood)

the following taken from https://www.spinics.net/lists/netdev/msg496150.html

Sowmini> If yes, how will the recvmsg differentiate between the case
Sowmini> (2000 byte message followed by 512 byte message) and
Sowmini> (1472 byte message, 526 byte message, then 512 byte message),
Sowmini> in other words, how are UDP message boundary semantics preserved?

Willem> They aren't. This is purely an optimization to amortize the cost of
Willem> repeated tx stack traversal. Unlike UFO, which would preserve the
Willem> boundaries of the original larger than MTU datagram.

As I understand Willem's explanation, if I do a sendmsg of 2000 bytes,
- classic UDP will send 2 IP fragments, the first one with a full UDP
   header, and the IP header indicating that this is the first frag for
   that ipid, with more frags to follow. The second frag will have the
   rest with the same ipid, it will not have a udp header,
   and it will indicatet that it is the last frag (no more frags).

   The receiver can thus use the ipid, "more-frags" bit, frag offset etc
   to stitch the 2000 byte udp message together and pass it up on the udp
   socket.

- in the "GSO" proposal my 2000  bytes of data are sent as *two*
   udp packets, each of them with a unique udp header, and uh_len set
   to 1476 (for first) and 526 (for second). The receiver has no clue
   that they are both part of the same UDP datagram, So wire format
   is not the same, am I mistaken?

Eric is correct. If the application sets a segment size with UDP_SEGMENT
this is an instruction to the kernel to split the payload along that border into
separate discrete datagrams.


OK. So the sender app is passing the message boundary info to the kernel via 
the socket
option and letting the kernel split the large payload into multiple UDP 
segments.




It does not matter what the behavior is without setting this option. If a
process wants to send a larger than MTU datagram and rely on the
kernel to fragment, then it should not set the option.




Re: [PATCH net-next] team: account for oper state

2018-04-18 Thread George Wilkie
On Wed, Apr 18, 2018 at 04:58:22PM +0200, Jiri Pirko wrote:
> Wed, Apr 18, 2018 at 03:35:49PM CEST, gwil...@vyatta.att-mail.com wrote:
> >On Wed, Apr 18, 2018 at 02:56:44PM +0200, Jiri Pirko wrote:
> >> Wed, Apr 18, 2018 at 12:29:50PM CEST, gwil...@vyatta.att-mail.com wrote:
> >> >Account for operational state when determining port linkup state,
> >> >as per Documentation/networking/operstates.txt.
> >> 
> >> Could you please point me to the exact place in the document where this
> >> is suggested?
> >> 
> >
> >Various places cover it I think.
> >
> >In 1. Introduction:
> >"interface is not usable just because the admin enabled it"
> >"userspace must be granted the possibility to
> >influence operational state"
> >
> >In 4. Setting from userspace:
> >"the userspace application can set IFLA_OPERSTATE
> >to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set
> >netif_carrier_off() or netif_dormant_on()"
> >
> >We have a use case where we want to set the oper state of the team ports 
> >based
> >on whether they are actually usable or not (as opposed to just admin up).
> 
> Are you running a supplicant there or what is the use-case?
> 

We are using tun/tap interfaces for the team ports with the physical interfaces
under the control of a user process.

> How is this handle in other drivers like bond, openvswitch, bridge, etc?

It looks like bridge is using it, looking at br_port_carrier_check() and
br_add_if().

Cheers.

> 
> >
> >Cheers.
> >
> >> 
> >> >
> >> >Signed-off-by: George Wilkie 
> >> >---
> >> > drivers/net/team/team.c | 3 ++-
> >> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >> >
> >> >diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
> >> >index a6c6ce19..231264a05e55 100644
> >> >--- a/drivers/net/team/team.c
> >> >+++ b/drivers/net/team/team.c
> >> >@@ -2918,7 +2918,8 @@ static int team_device_event(struct notifier_block 
> >> >*unused,
> >> >  case NETDEV_CHANGE:
> >> >  if (netif_running(port->dev))
> >> >  team_port_change_check(port,
> >> >-!!netif_carrier_ok(port->dev));
> >> >+!!(netif_carrier_ok(port->dev) &&
> >> >+   netif_oper_up(port->dev)));
> >> >  break;
> >> >  case NETDEV_UNREGISTER:
> >> >  team_del_slave(port->team->dev, dev);
> >> >-- 
> >> >2.11.0
> >> >


Re: [PATCH net-next] team: account for oper state

2018-04-18 Thread George Wilkie
On Wed, Apr 18, 2018 at 04:33:12PM +0100, George Wilkie wrote:
> On Wed, Apr 18, 2018 at 04:58:22PM +0200, Jiri Pirko wrote:
> > Wed, Apr 18, 2018 at 03:35:49PM CEST, gwil...@vyatta.att-mail.com wrote:
> > >On Wed, Apr 18, 2018 at 02:56:44PM +0200, Jiri Pirko wrote:
> > >> Wed, Apr 18, 2018 at 12:29:50PM CEST, gwil...@vyatta.att-mail.com wrote:
> > >> >Account for operational state when determining port linkup state,
> > >> >as per Documentation/networking/operstates.txt.
> > >> 
> > >> Could you please point me to the exact place in the document where this
> > >> is suggested?
> > >> 
> > >
> > >Various places cover it I think.
> > >
> > >In 1. Introduction:
> > >"interface is not usable just because the admin enabled it"
> > >"userspace must be granted the possibility to
> > >influence operational state"
> > >
> > >In 4. Setting from userspace:
> > >"the userspace application can set IFLA_OPERSTATE
> > >to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set
> > >netif_carrier_off() or netif_dormant_on()"
> > >
> > >We have a use case where we want to set the oper state of the team ports 
> > >based
> > >on whether they are actually usable or not (as opposed to just admin up).
> > 
> > Are you running a supplicant there or what is the use-case?
> > 
> 
> We are using tun/tap interfaces for the team ports with the physical 
> interfaces
> under the control of a user process.
> 
> > How is this handle in other drivers like bond, openvswitch, bridge, etc?
> 
> It looks like bridge is using it, looking at br_port_carrier_check() and
> br_add_if().
> 

commit 576eb62598f10c8c7fd75703fe89010cdcfff596
Author: stephen hemminger 
AuthorDate: Fri Dec 28 18:15:22 2012 +
Commit: David S. Miller 
CommitDate: Sun Dec 30 02:31:43 2012 -0800

bridge: respect RFC2863 operational state

The bridge link detection should follow the operational state
of the lower device, rather than the carrier bit. This allows devices
like tunnels that are controlled by userspace control plane to work
with bridge STP link management.

> Cheers.
> 
> > 
> > >
> > >Cheers.
> > >
> > >> 
> > >> >
> > >> >Signed-off-by: George Wilkie 
> > >> >---
> > >> > drivers/net/team/team.c | 3 ++-
> > >> > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >> >
> > >> >diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
> > >> >index a6c6ce19..231264a05e55 100644
> > >> >--- a/drivers/net/team/team.c
> > >> >+++ b/drivers/net/team/team.c
> > >> >@@ -2918,7 +2918,8 @@ static int team_device_event(struct 
> > >> >notifier_block *unused,
> > >> >case NETDEV_CHANGE:
> > >> >if (netif_running(port->dev))
> > >> >team_port_change_check(port,
> > >> >-  
> > >> >!!netif_carrier_ok(port->dev));
> > >> >+  
> > >> >!!(netif_carrier_ok(port->dev) &&
> > >> >+ 
> > >> >netif_oper_up(port->dev)));
> > >> >break;
> > >> >case NETDEV_UNREGISTER:
> > >> >team_del_slave(port->team->dev, dev);
> > >> >-- 
> > >> >2.11.0
> > >> >


Re: [PATCH bpf-next v2 03/11] bpf: make mlx4 compatible w/ bpf_xdp_adjust_tail

2018-04-18 Thread Tariq Toukan



On 18/04/2018 7:29 AM, Nikita V. Shirokov wrote:

w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for mlx4 driver we will just calculate packet's length unconditionally
(the same way as it's already being done in mlx5)

Acked-by: Alexei Starovoitov 
---
  drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 5c613c6663da..efc55feddc5c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -775,8 +775,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
  
  			act = bpf_prog_run_xdp(xdp_prog, );
  
+			length = xdp.data_end - xdp.data;

if (xdp.data != orig_data) {
-   length = xdp.data_end - xdp.data;
frags[0].page_offset = xdp.data -
xdp.data_hard_start;
va = xdp.data;



Acked-by: Tariq Toukan 

Thanks.


Re: [PATCH bpf-next v2 00/11] introduction of bpf_xdp_adjust_tail

2018-04-18 Thread Daniel Borkmann
On 04/18/2018 06:29 AM, Nikita V. Shirokov wrote:
> In this patch series i'm add new bpf helper which allow to manupulate
> xdp's data_end pointer. right now only "shrinking" (reduce packet's size
> by moving pointer) is supported (and i see no use case for "growing").
> Main use case for such helper is to be able to generate controll (ICMP)
> messages from XDP context. such messages usually contains first N bytes
> from original packets as a payload, and this is exactly what this helper
> would allow us to do (see patch 3 for sample program, where we generate
> ICMP "packet too big" message). This helper could be usefull for load
> balancing applications where after additional encapsulation, resulting
> packet could be bigger then interface MTU.
> Aside from new helper this patch series contains minor changes in device
> drivers (for ones which requires), so they would recal packet's length
> not only when head pointer was adjusted, but if tail's one as well.

The whole set doesn't have any SoBs from you which is mandatory before
applying anything. Please add.

Thanks,
Daniel

> v1->v2:
>  * fixed kbuild warning
>  * made offset eq 0 invalid for xdp_bpf_adjust_tail
>  * splitted bpf_prog_test_run fix and selftests in sep commits
>  * added SPDX licence where applicable
>  * some reshuffling in patches order (tests now in the end)
> 
> 
> Nikita V. Shirokov (11):
>   bpf: making bpf_prog_test run aware of possible data_end ptr change
>   bpf: adding tests for bpf_xdp_adjust_tail
>   bpf: adding bpf_xdp_adjust_tail helper
>   bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail
>   bpf: make mlx4 compatible w/ bpf_xdp_adjust_tail
>   bpf: make bnxt compatible w/ bpf_xdp_adjust_tail
>   bpf: make cavium thunder compatible w/ bpf_xdp_adjust_tail
>   bpf: make netronome nfp compatible w/ bpf_xdp_adjust_tail
>   bpf: make tun compatible w/ bpf_xdp_adjust_tail
>   bpf: make virtio compatible w/ bpf_xdp_adjust_tail
>   bpf: add bpf_xdp_adjust_tail sample prog
> 
>  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c  |   2 +-
>  drivers/net/ethernet/cavium/thunder/nicvf_main.c   |   2 +-
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |   2 +-
>  .../net/ethernet/netronome/nfp/nfp_net_common.c|   2 +-
>  drivers/net/tun.c  |   3 +-
>  drivers/net/virtio_net.c   |   7 +-
>  include/uapi/linux/bpf.h   |  10 +-
>  net/bpf/test_run.c |   3 +-
>  net/core/dev.c |  10 +-
>  net/core/filter.c  |  29 +++-
>  samples/bpf/Makefile   |   4 +
>  samples/bpf/xdp_adjust_tail_kern.c | 152 
> +
>  samples/bpf/xdp_adjust_tail_user.c | 142 +++
>  tools/include/uapi/linux/bpf.h |  10 +-
>  tools/testing/selftests/bpf/Makefile   |   2 +-
>  tools/testing/selftests/bpf/bpf_helpers.h  |   5 +
>  tools/testing/selftests/bpf/test_adjust_tail.c |  30 
>  tools/testing/selftests/bpf/test_progs.c   |  32 +
>  18 files changed, 435 insertions(+), 12 deletions(-)
>  create mode 100644 samples/bpf/xdp_adjust_tail_kern.c
>  create mode 100644 samples/bpf/xdp_adjust_tail_user.c
>  create mode 100644 tools/testing/selftests/bpf/test_adjust_tail.c
> 



Re: [PATCH] samples/bpf: correct comment in sock_example.c

2018-04-18 Thread Daniel Borkmann
On 04/17/2018 04:25 AM, Wang Sheng-Hui wrote:
> The program run against loopback interace "lo", not "eth0".
> Correct the comment.
> 
> Signed-off-by: Wang Sheng-Hui 

Applied to bpf-next, thanks Wang!


Re: [PATCH net-next] team: account for oper state

2018-04-18 Thread Jiri Pirko
Wed, Apr 18, 2018 at 12:29:50PM CEST, gwil...@vyatta.att-mail.com wrote:
>Account for operational state when determining port linkup state,
>as per Documentation/networking/operstates.txt.

Could you please point me to the exact place in the document where this
is suggested?


>
>Signed-off-by: George Wilkie 
>---
> drivers/net/team/team.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>index a6c6ce19..231264a05e55 100644
>--- a/drivers/net/team/team.c
>+++ b/drivers/net/team/team.c
>@@ -2918,7 +2918,8 @@ static int team_device_event(struct notifier_block 
>*unused,
>   case NETDEV_CHANGE:
>   if (netif_running(port->dev))
>   team_port_change_check(port,
>- !!netif_carrier_ok(port->dev));
>+ !!(netif_carrier_ok(port->dev) &&
>+netif_oper_up(port->dev)));
>   break;
>   case NETDEV_UNREGISTER:
>   team_del_slave(port->team->dev, dev);
>-- 
>2.11.0
>


Re: [Regression] net/phy/micrel.c v4.9.94

2018-04-18 Thread Andrew Lunn
On Wed, Apr 18, 2018 at 09:34:16AM +0800, Chris Ruehl wrote:
> Hello,
> 
> I like to get your heads up at a regression introduced in 4.9.94
> commitment lead to a kernel ops and make the network unusable on my MX6DL
> customized board.
> 
> Race condition resume is called on startup and the phy not yet initialized.

Hi Chris

Please could you try

bfe72442578b ("net: phy: micrel: fix crash when statistic requested for KSZ9031 
phy")

 Andrew


Re: [PATCH bpf-next v2 02/11] bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail

2018-04-18 Thread Nikita V. Shirokov
On Wed, Apr 18, 2018 at 02:48:18PM +0200, Jesper Dangaard Brouer wrote:
> On Tue, 17 Apr 2018 21:29:42 -0700
> "Nikita V. Shirokov"  wrote:
> 
> > w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
> > well (only "decrease" of pointer's location is going to be supported).
> > changing of this pointer will change packet's size.
> > for generic XDP we need to reflect this packet's length change by
> > adjusting skb's tail pointer
> > 
> > Acked-by: Alexei Starovoitov 
> 
> You are missing your own Signed-off-by: line on all of the patches.
> 
yeah, somehow lost it between v1 and v2 :) thanks !
> BTW, thank you for working on this! It have been on my todo-list for a
> while now!
> 
> _After_ this patchset, I would like to see adding support for
> "increasing" the data_end location to create a larger packet.  For that
> we should likely add a data_hard_end pointer.  This, would also be
> helpful in cpu_map_build_skb() to know the data_hard_end, to determine
> the frame size (as some driver doesn't use PAGE_SIZE frames, ixgbe).
> 
yeah, increasing the size would be nice to have, but will require more
thinking / rework on drivers side (as you pointed out it's not as easy
as "every driver have at least PAGE_SIZE of data available for xdp".).
will add to my TODO
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
--
Nikita


[PATCH] net: caif: fix spelling mistake "UKNOWN" -> "UNKNOWN"

2018-04-18 Thread Colin King
From: Colin Ian King 

Trivial fix to spelling mistake

Signed-off-by: Colin Ian King 
---
 net/caif/chnl_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/caif/chnl_net.c b/net/caif/chnl_net.c
index 53ecda10b790..13e2ae6be620 100644
--- a/net/caif/chnl_net.c
+++ b/net/caif/chnl_net.c
@@ -174,7 +174,7 @@ static void chnl_flowctrl_cb(struct cflayer *layr, enum 
caif_ctrlcmd flow,
flow == CAIF_CTRLCMD_DEINIT_RSP ? "CLOSE/DEINIT" :
flow == CAIF_CTRLCMD_INIT_FAIL_RSP ? "OPEN_FAIL" :
flow == CAIF_CTRLCMD_REMOTE_SHUTDOWN_IND ?
-"REMOTE_SHUTDOWN" : "UKNOWN CTRL COMMAND");
+"REMOTE_SHUTDOWN" : "UNKNOWN CTRL COMMAND");
 
 
 
-- 
2.17.0



Re: tcp hang when socket fills up ?

2018-04-18 Thread Jozsef Kadlecsik
On Wed, 18 Apr 2018, Dominique Martinet wrote:

> Jozsef Kadlecsik wrote on Wed, Apr 18, 2018:
> > Thanks for the testing! One more line is required, however: we have to get 
> > the assured bit set for the connection, see the new patch below.
> 
> I think it actually was better before. If I understand things correctly
> at this point (when we get in the case TCP_CONNTRACK_SYN_RECV) we will
> have seen SYN(out) SYN(in) SYNACK(out), but not the final ACK(in) yet.
> 
> Leaving old state as it was will not set the assured bit, but that will 
> be set on the next packet because old_state == new_state == established 
> at that point and the connection will really be setup then.

Yes, you are right: the first patch is better than the second one. 
Overthinking :-)

Best regards,
Jozsef
-
E-mail  : kad...@blackhole.kfki.hu, kadlecsik.joz...@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
  H-1525 Budapest 114, POB. 49, Hungary


[RFC PATCH] net: bridge: multicast querier per VLAN support

2018-04-18 Thread Joachim Nilsson
This RFC patch¹ is an attempt to add multicast querier per VLAN support
to a VLAN aware bridge.  I'm posting it as RFC for now since non-VLAN
aware bridges are not handled, and one of my questions is if that is
complexity we need to continue supporting?

>From what I understand, multicast join/report already support per VLAN
operation, and the MDB as well support filtering per VLAN, but queries
are currently limited to per-port operation on VLAN-aware bridges.

The naive² approach of this patch relocates query timers from the bridge
to operate per VLAN, on timer expiry we send queries to all bridge ports
in the same VLAN.  Tagged port members have tagged VLAN queries.

Unlike the original patch¹, which uses a sysfs entry to set the querier
address of each VLAN, this use the IP address of the VLAN interface when
initiating a per VLAN query.  A version of inet_select_addr() is used
for this, called inet_select_dev_addr(), not included in this patch.

Open questions/TODO:

- First of all, is this patch useful to anyone?
- The current br_multicast.c is very complex.  The support for both IPv4
  and IPv6 is a no-brainer, but it also has #ifdef VLAN_FILTERING and
  'br->vlan_enabled' ... this has likely been discussed before, but if
  we could remove those code paths I believe what's left would be quite
  a bit easier to read and maintain.
- Many per-bridge specific multicast sysfs settings may need to have a
  corresponding per-VLAN setting, e.g. snooping, query_interval, etc.
  How should we go about that? (For status reporting I have a proposal)
- Dito per-port specific multicast sysfs settings, e.g. multicast_router
- The MLD support has been kept in sync with the rest but is completely
  untested.  In particular I suspect the wrong source IP will be used.

¹) Initially based on a patch by Cumulus Networks
   
http://repo3.cumulusnetworks.com/repo/pool/cumulus/l/linux/linux-source-4.1_4.1.33-1+cl3u11_all.deb
²) This patch is currently limited to work only on bridges with VLAN
   enabled.  Care has been taken to support MLD snooping, but it is
   completely untested.

Thank you for reading this far!

Signed-off-by: Joachim Nilsson 
---
 net/bridge/br_device.c|   2 +-
 net/bridge/br_input.c |   2 +-
 net/bridge/br_multicast.c | 456 --
 net/bridge/br_private.h   |  38 +++-
 net/bridge/br_stp.c   |   5 +-
 net/bridge/br_vlan.c  |   3 +
 6 files changed, 327 insertions(+), 179 deletions(-)

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 02f9f8aab047..ba35485032d8 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -98,7 +98,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct 
net_device *dev)
 
mdst = br_mdb_get(br, skb, vid);
if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
-   br_multicast_querier_exists(br, eth_hdr(skb)))
+   br_multicast_querier_exists(br, vid, eth_hdr(skb)))
br_multicast_flood(mdst, skb, false, true);
else
br_flood(br, skb, BR_PKT_MULTICAST, false, true);
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 56bb9189c374..13d48489e0e1 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -137,7 +137,7 @@ int br_handle_frame_finish(struct net *net, struct sock 
*sk, struct sk_buff *skb
mdst = br_mdb_get(br, skb, vid);
if ((mdst && mdst->addr.proto == htons(ETH_P_ALL)) ||
((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
-br_multicast_querier_exists(br, eth_hdr(skb {
+br_multicast_querier_exists(br, vid, eth_hdr(skb {
if ((mdst && mdst->host_joined) ||
br_multicast_is_router(br)) {
local_rcv = true;
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 277ecd077dc4..72e47d500972 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -37,7 +38,7 @@
 
 #include "br_private.h"
 
-static void br_multicast_start_querier(struct net_bridge *br,
+static void br_multicast_start_querier(struct net_bridge_vlan *vlan,
   struct bridge_mcast_own_query *query);
 static void br_multicast_add_router(struct net_bridge *br,
struct net_bridge_port *port);
@@ -46,13 +47,14 @@ static void br_ip4_multicast_leave_group(struct net_bridge 
*br,
 __be32 group,
 __u16 vid,
 const unsigned char *src);
-
+static void br_ip4_multicast_query_expired(struct timer_list *t);
 static void __del_port_router(struct net_bridge_port *p);
 #if IS_ENABLED(CONFIG_IPV6)
 

Re: [PATCH net-next 3/3] net: phy: Enable C45 PHYs with vendor specific address space

2018-04-18 Thread Andrew Lunn
On Wed, Apr 18, 2018 at 09:38:47AM +, Vicenţiu Galanopulo wrote:
> 
> 
> > > Having dev-addr stored in devices_addrs, in get_phy_c45_ids(), when
> > > probing the identifiers, dev-addr can be extracted from devices_addrs
> > > and probed if devices_addrs[current_identifier] is not 0.
> > 
> > I must clearly be missing something, but why are you introducing all these
> > conditionals instead of updating the existing code to be able to operate 
> > against
> > an arbitrary dev-addr value, and then just making sure the first thing you 
> > do is
> > fetch that property from Device Tree? There is no way someone is going to be
> > testing with your specific use case in the future (except yourselves) so 
> > unless you
> > make supporting an arbitrary "dev-addr" value become part of how the code
> > works, this is going to be breaking badly.
> >
> 
> Hi Florian,
> 
> My intention was to have this patch as "plugin" and modify the existing 
> kernel API little to none.

Hi Vicenţiu

In Linux, kernel APIs are not sacred. If you need to change them, do
so.

We want a clear, well integrated solution, with minimal
duplication.

Andrew


Re: [Regression] net/phy/micrel.c v4.9.94

2018-04-18 Thread Andrew Lunn
On Wed, Apr 18, 2018 at 02:56:01PM +0200, Andrew Lunn wrote:
> On Wed, Apr 18, 2018 at 09:34:16AM +0800, Chris Ruehl wrote:
> > Hello,
> > 
> > I like to get your heads up at a regression introduced in 4.9.94
> > commitment lead to a kernel ops and make the network unusable on my MX6DL
> > customized board.
> > 
> > Race condition resume is called on startup and the phy not yet initialized.
> 
> Hi Chris
> 
> Please could you try
> 
> bfe72442578b ("net: phy: micrel: fix crash when statistic requested for 
> KSZ9031 phy")

I don't think it is a complete fix. I suspect "Micrel KSZ8795",
"Micrel KSZ886X Switch", "Micrel KSZ8061", and "Micrel KS8737" will
still have problems.

Those four probably need a:

.probe  = kszphy_probe,

Andrew


[PATCH bpf-next v3 10/11] bpf: adding tests for bpf_xdp_adjust_tail

2018-04-18 Thread Nikita V. Shirokov
adding selftests for bpf_xdp_adjust_tail helper. in this synthetic test
we are testing that 1) if data_end < data helper will return EINVAL
2) for normal use case packet's length would be reduced.

Signed-off-by: Nikita V. Shirokov 
---
 tools/include/uapi/linux/bpf.h | 10 +++-
 tools/testing/selftests/bpf/Makefile   |  2 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |  3 +++
 tools/testing/selftests/bpf/test_adjust_tail.c | 30 
 tools/testing/selftests/bpf/test_progs.c   | 32 ++
 5 files changed, 75 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_adjust_tail.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9d07465023a2..56bf493ba7ed 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -755,6 +755,13 @@ union bpf_attr {
  * @addr: pointer to struct sockaddr to bind socket to
  * @addr_len: length of sockaddr structure
  * Return: 0 on success or negative error code
+ *
+ * int bpf_xdp_adjust_tail(xdp_md, delta)
+ * Adjust the xdp_md.data_end by delta. Only shrinking of packet's
+ * size is supported.
+ * @xdp_md: pointer to xdp_md
+ * @delta: A negative integer to be added to xdp_md.data_end
+ * Return: 0 on success or negative on error
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -821,7 +828,8 @@ union bpf_attr {
FN(msg_apply_bytes),\
FN(msg_cork_bytes), \
FN(msg_pull_data),  \
-   FN(bind),
+   FN(bind),   \
+   FN(xdp_adjust_tail),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 0a315ddabbf4..3e819dc70bee 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -31,7 +31,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
-   sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o
+   sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index d8223d99f96d..50c607014b22 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -96,6 +96,9 @@ static int (*bpf_msg_pull_data)(void *ctx, int start, int 
end, int flags) =
(void *) BPF_FUNC_msg_pull_data;
 static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
(void *) BPF_FUNC_bind;
+static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) =
+   (void *) BPF_FUNC_xdp_adjust_tail;
+
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/test_adjust_tail.c 
b/tools/testing/selftests/bpf/test_adjust_tail.c
new file mode 100644
index ..4cd5e860c903
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_adjust_tail.c
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * Copyright (c) 2018 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include "bpf_helpers.h"
+
+int _version SEC("version") = 1;
+
+SEC("xdp_adjust_tail")
+int _xdp_adjust_tail(struct xdp_md *xdp)
+{
+   void *data_end = (void *)(long)xdp->data_end;
+   void *data = (void *)(long)xdp->data;
+   int offset = 0;
+
+   if (data_end - data == 54)
+   offset = 256;
+   else
+   offset = 20;
+   if (bpf_xdp_adjust_tail(xdp, 0 - offset))
+   return XDP_DROP;
+   return XDP_TX;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index faadbe233966..eedda98d7bb1 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -166,6 +166,37 @@ static void test_xdp(void)
bpf_object__close(obj);
 }
 
+static void test_xdp_adjust_tail(void)
+{
+   const char *file = "./test_adjust_tail.o";
+   struct bpf_object *obj;
+   char buf[128];
+   __u32 duration, retval, size;
+   int err, prog_fd;
+
+   err = bpf_prog_load(file, BPF_PROG_TYPE_XDP, , _fd);
+   if (err) {
+   

[PATCH bpf-next v3 08/11] bpf: make virtio compatible w/ bpf_xdp_adjust_tail

2018-04-18 Thread Nikita V. Shirokov
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for virtio driver we need to adjust XDP_PASS handling by recalculating
length of the packet if it was passed to the TCP/IP stack

Reviewed-by: Jason Wang 
Signed-off-by: Nikita V. Shirokov 
---
 drivers/net/virtio_net.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 01694e26f03e..779a4f798522 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -606,6 +606,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
case XDP_PASS:
/* Recalculate length in case bpf program changed it */
delta = orig_data - xdp.data;
+   len = xdp.data_end - xdp.data;
break;
case XDP_TX:
xdpf = convert_to_xdp_frame();
@@ -642,7 +643,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
goto err;
}
skb_reserve(skb, headroom - delta);
-   skb_put(skb, len + delta);
+   skb_put(skb, len);
if (!delta) {
buf += header_offset;
memcpy(skb_vnet_hdr(skb), buf, vi->hdr_len);
@@ -757,6 +758,10 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
offset = xdp.data -
page_address(xdp_page) - vi->hdr_len;
 
+   /* recalculate len if xdp.data or xdp.data_end were
+* adjusted
+*/
+   len = xdp.data_end - xdp.data;
/* We can only create skb based on xdp_page. */
if (unlikely(xdp_page != page)) {
rcu_read_unlock();
-- 
2.15.1



[PATCH bpf-next v3 11/11] bpf: add bpf_xdp_adjust_tail sample prog

2018-04-18 Thread Nikita V. Shirokov
adding bpf's sample program which is using bpf_xdp_adjust_tail helper
by generating ICMPv4 "packet to big" message if ingress packet's size is
bigger then 600 bytes

Signed-off-by: Nikita V. Shirokov 
---
 samples/bpf/Makefile  |   4 +
 samples/bpf/xdp_adjust_tail_kern.c| 152 ++
 samples/bpf/xdp_adjust_tail_user.c| 142 
 tools/testing/selftests/bpf/bpf_helpers.h |   2 +
 4 files changed, 300 insertions(+)
 create mode 100644 samples/bpf/xdp_adjust_tail_kern.c
 create mode 100644 samples/bpf/xdp_adjust_tail_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4d6a6edd4bf6..aa8c392e2e52 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -44,6 +44,7 @@ hostprogs-y += xdp_monitor
 hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 hostprogs-y += cpustat
+hostprogs-y += xdp_adjust_tail
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
@@ -95,6 +96,7 @@ xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
 xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
 cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
+xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -148,6 +150,7 @@ always += xdp_rxq_info_kern.o
 always += xdp2skb_meta_kern.o
 always += syscall_tp_kern.o
 always += cpustat_kern.o
+always += xdp_adjust_tail_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -193,6 +196,7 @@ HOSTLOADLIBES_xdp_monitor += -lelf
 HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
 HOSTLOADLIBES_cpustat += -lelf
+HOSTLOADLIBES_xdp_adjust_tail += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdp_adjust_tail_kern.c 
b/samples/bpf/xdp_adjust_tail_kern.c
new file mode 100644
index ..411fdb21f8bc
--- /dev/null
+++ b/samples/bpf/xdp_adjust_tail_kern.c
@@ -0,0 +1,152 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * Copyright (c) 2018 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program shows how to use bpf_xdp_adjust_tail() by
+ * generating ICMPv4 "packet to big" (unreachable/ df bit set frag needed
+ * to be more preice in case of v4)" where receiving packets bigger then
+ * 600 bytes.
+ */
+#define KBUILD_MODNAME "foo"
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+#define DEFAULT_TTL 64
+#define MAX_PCKT_SIZE 600
+#define ICMP_TOOBIG_SIZE 98
+#define ICMP_TOOBIG_PAYLOAD_SIZE 92
+
+struct bpf_map_def SEC("maps") icmpcnt = {
+   .type = BPF_MAP_TYPE_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = sizeof(__u64),
+   .max_entries = 1,
+};
+
+static __always_inline void count_icmp(void)
+{
+   u64 key = 0;
+   u64 *icmp_count;
+
+   icmp_count = bpf_map_lookup_elem(, );
+   if (icmp_count)
+   *icmp_count += 1;
+}
+
+static __always_inline void swap_mac(void *data, struct ethhdr *orig_eth)
+{
+   struct ethhdr *eth;
+
+   eth = data;
+   memcpy(eth->h_source, orig_eth->h_dest, ETH_ALEN);
+   memcpy(eth->h_dest, orig_eth->h_source, ETH_ALEN);
+   eth->h_proto = orig_eth->h_proto;
+}
+
+static __always_inline __u16 csum_fold_helper(__u32 csum)
+{
+   return ~((csum & 0x) + (csum >> 16));
+}
+
+static __always_inline void ipv4_csum(void *data_start, int data_size,
+ __u32 *csum)
+{
+   *csum = bpf_csum_diff(0, 0, data_start, data_size, *csum);
+   *csum = csum_fold_helper(*csum);
+}
+
+static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
+{
+   int headroom = (int)sizeof(struct iphdr) + (int)sizeof(struct icmphdr);
+
+   if (bpf_xdp_adjust_head(xdp, 0 - headroom))
+   return XDP_DROP;
+   void *data = (void *)(long)xdp->data;
+   void *data_end = (void *)(long)xdp->data_end;
+
+   if (data + (ICMP_TOOBIG_SIZE + headroom) > data_end)
+   return XDP_DROP;
+
+   struct iphdr *iph, *orig_iph;
+   struct icmphdr *icmp_hdr;
+   struct ethhdr *orig_eth;
+   __u32 csum = 0;
+   __u64 off = 0;
+
+   orig_eth = data + headroom;
+   swap_mac(data, orig_eth);
+   off += sizeof(struct ethhdr);
+   iph = data + off;
+   off += sizeof(struct iphdr);
+   icmp_hdr = data + off;
+   off += sizeof(struct icmphdr);
+   orig_iph = data + off;
+   icmp_hdr->type = ICMP_DEST_UNREACH;
+   icmp_hdr->code = 

[PATCH bpf-next v3 04/11] bpf: make bnxt compatible w/ bpf_xdp_adjust_tail

2018-04-18 Thread Nikita V. Shirokov
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for bnxt driver we will just calculate packet's length unconditionally

Acked-by: Alexei Starovoitov 
Signed-off-by: Nikita V. Shirokov 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 1389ab5e05df..1f0e872d0667 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -113,10 +113,10 @@ bool bnxt_rx_xdp(struct bnxt *bp, struct 
bnxt_rx_ring_info *rxr, u16 cons,
if (tx_avail != bp->tx_ring_size)
*event &= ~BNXT_RX_EVENT;
 
+   *len = xdp.data_end - xdp.data;
if (orig_data != xdp.data) {
offset = xdp.data - xdp.data_hard_start;
*data_ptr = xdp.data_hard_start + offset;
-   *len = xdp.data_end - xdp.data;
}
switch (act) {
case XDP_PASS:
-- 
2.15.1



[PATCH bpf-next v3 06/11] bpf: make netronome nfp compatible w/ bpf_xdp_adjust_tail

2018-04-18 Thread Nikita V. Shirokov
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for nfp driver we will just calculate packet's length unconditionally

Acked-by: Alexei Starovoitov 
Acked-by: Jakub Kicinski 
Signed-off-by: Nikita V. Shirokov 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 1eb6549f2a54..d9111c077699 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1722,7 +1722,7 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, 
int budget)
 
act = bpf_prog_run_xdp(xdp_prog, );
 
-   pkt_len -= xdp.data - orig_data;
+   pkt_len = xdp.data_end - xdp.data;
pkt_off += xdp.data - orig_data;
 
switch (act) {
-- 
2.15.1



[PATCH bpf-next v3 01/11] bpf: adding bpf_xdp_adjust_tail helper

2018-04-18 Thread Nikita V. Shirokov
Adding new bpf helper which would allow us to manipulate
xdp's data_end pointer, and allow us to reduce packet's size
indended use case: to generate ICMP messages from XDP context,
where such message would contain truncated original packet.

Signed-off-by: Nikita V. Shirokov 
---
 include/uapi/linux/bpf.h | 10 +-
 net/core/filter.c| 29 -
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c5ec89732a8d..9a2d1a04eb24 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -755,6 +755,13 @@ union bpf_attr {
  * @addr: pointer to struct sockaddr to bind socket to
  * @addr_len: length of sockaddr structure
  * Return: 0 on success or negative error code
+ *
+ * int bpf_xdp_adjust_tail(xdp_md, delta)
+ * Adjust the xdp_md.data_end by delta. Only shrinking of packet's
+ * size is supported.
+ * @xdp_md: pointer to xdp_md
+ * @delta: A negative integer to be added to xdp_md.data_end
+ * Return: 0 on success or negative on error
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -821,7 +828,8 @@ union bpf_attr {
FN(msg_apply_bytes),\
FN(msg_cork_bytes), \
FN(msg_pull_data),  \
-   FN(bind),
+   FN(bind),   \
+   FN(xdp_adjust_tail),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index a374b8560bc4..29318598fd60 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2725,6 +2725,30 @@ static const struct bpf_func_proto 
bpf_xdp_adjust_head_proto = {
.arg2_type  = ARG_ANYTHING,
 };
 
+BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
+{
+   void *data_end = xdp->data_end + offset;
+
+   /* only shrinking is allowed for now. */
+   if (unlikely(offset >= 0))
+   return -EINVAL;
+
+   if (unlikely(data_end < xdp->data + ETH_HLEN))
+   return -EINVAL;
+
+   xdp->data_end = data_end;
+
+   return 0;
+}
+
+static const struct bpf_func_proto bpf_xdp_adjust_tail_proto = {
+   .func   = bpf_xdp_adjust_tail,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+   .arg2_type  = ARG_ANYTHING,
+};
+
 BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
 {
void *meta = xdp->data_meta + offset;
@@ -3074,7 +3098,8 @@ bool bpf_helper_changes_pkt_data(void *func)
func == bpf_l4_csum_replace ||
func == bpf_xdp_adjust_head ||
func == bpf_xdp_adjust_meta ||
-   func == bpf_msg_pull_data)
+   func == bpf_msg_pull_data ||
+   func == bpf_xdp_adjust_tail)
return true;
 
return false;
@@ -3888,6 +3913,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct 
bpf_prog *prog)
return _xdp_redirect_proto;
case BPF_FUNC_redirect_map:
return _xdp_redirect_map_proto;
+   case BPF_FUNC_xdp_adjust_tail:
+   return _xdp_adjust_tail_proto;
default:
return bpf_base_func_proto(func_id);
}
-- 
2.15.1



[PATCH bpf-next v3 00/11] introduction of bpf_xdp_adjust_tail

2018-04-18 Thread Nikita V. Shirokov
In this patch series i'm add new bpf helper which allow to manupulate
xdp's data_end pointer. right now only "shrinking" (reduce packet's size
by moving pointer) is supported (and i see no use case for "growing").
Main use case for such helper is to be able to generate controll (ICMP)
messages from XDP context. such messages usually contains first N bytes
from original packets as a payload, and this is exactly what this helper
would allow us to do (see patch 3 for sample program, where we generate
ICMP "packet too big" message). This helper could be usefull for load
balancing applications where after additional encapsulation, resulting
packet could be bigger then interface MTU.
Aside from new helper this patch series contains minor changes in device
drivers (for ones which requires), so they would recal packet's length
not only when head pointer was adjusted, but if tail's one as well.

v2->v3:
 * adding missed "signed off by" in v2

v1->v2:
 * fixed kbuild warning
 * made offset eq 0 invalid for xdp_bpf_adjust_tail
 * splitted bpf_prog_test_run fix and selftests in sep commits
 * added SPDX licence where applicable
 * some reshuffling in patches order (tests now in the end)


Nikita V. Shirokov (11):
  bpf: making bpf_prog_test run aware of possible data_end ptr change
  bpf: adding tests for bpf_xdp_adjust_tail
  bpf: adding bpf_xdp_adjust_tail helper
  bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail
  bpf: make mlx4 compatible w/ bpf_xdp_adjust_tail
  bpf: make bnxt compatible w/ bpf_xdp_adjust_tail
  bpf: make cavium thunder compatible w/ bpf_xdp_adjust_tail
  bpf: make netronome nfp compatible w/ bpf_xdp_adjust_tail
  bpf: make tun compatible w/ bpf_xdp_adjust_tail
  bpf: make virtio compatible w/ bpf_xdp_adjust_tail
  bpf: add bpf_xdp_adjust_tail sample prog

 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c  |   2 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   |   2 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |   2 +-
 .../net/ethernet/netronome/nfp/nfp_net_common.c|   2 +-
 drivers/net/tun.c  |   3 +-
 drivers/net/virtio_net.c   |   7 +-
 include/uapi/linux/bpf.h   |  10 +-
 net/bpf/test_run.c |   3 +-
 net/core/dev.c |  10 +-
 net/core/filter.c  |  29 +++-
 samples/bpf/Makefile   |   4 +
 samples/bpf/xdp_adjust_tail_kern.c | 152 +
 samples/bpf/xdp_adjust_tail_user.c | 142 +++
 tools/include/uapi/linux/bpf.h |  10 +-
 tools/testing/selftests/bpf/Makefile   |   2 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   5 +
 tools/testing/selftests/bpf/test_adjust_tail.c |  30 
 tools/testing/selftests/bpf/test_progs.c   |  32 +
 18 files changed, 435 insertions(+), 12 deletions(-)
 create mode 100644 samples/bpf/xdp_adjust_tail_kern.c
 create mode 100644 samples/bpf/xdp_adjust_tail_user.c
 create mode 100644 tools/testing/selftests/bpf/test_adjust_tail.c

-- 
2.15.1



[PATCH bpf-next v3 03/11] bpf: make mlx4 compatible w/ bpf_xdp_adjust_tail

2018-04-18 Thread Nikita V. Shirokov
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for mlx4 driver we will just calculate packet's length unconditionally
(the same way as it's already being done in mlx5)

Acked-by: Alexei Starovoitov 
Acked-by: Tariq Toukan 
Signed-off-by: Nikita V. Shirokov 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 5c613c6663da..efc55feddc5c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -775,8 +775,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
 
act = bpf_prog_run_xdp(xdp_prog, );
 
+   length = xdp.data_end - xdp.data;
if (xdp.data != orig_data) {
-   length = xdp.data_end - xdp.data;
frags[0].page_offset = xdp.data -
xdp.data_hard_start;
va = xdp.data;
-- 
2.15.1



[PATCH bpf-next v3 07/11] bpf: make tun compatible w/ bpf_xdp_adjust_tail

2018-04-18 Thread Nikita V. Shirokov
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for tun driver we need to adjust XDP_PASS handling by recalculating
length of the packet if it was passed to the TCP/IP stack
(in case if after xdp's prog run data_end pointer was adjusted)

Reviewed-by: Jason Wang 
Signed-off-by: Nikita V. Shirokov 
---
 drivers/net/tun.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 1e58be152d5c..901351a6ed21 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1696,6 +1696,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
return NULL;
case XDP_PASS:
delta = orig_data - xdp.data;
+   len = xdp.data_end - xdp.data;
break;
default:
bpf_warn_invalid_xdp_action(act);
@@ -1716,7 +1717,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
}
 
skb_reserve(skb, pad - delta);
-   skb_put(skb, len + delta);
+   skb_put(skb, len);
get_page(alloc_frag->page);
alloc_frag->offset += buflen;
 
-- 
2.15.1



Re: [PATCH RFC net-next 00/11] udp gso

2018-04-18 Thread Willem de Bruijn
> One thing that was not clear to me about the API: shouldn't UDP_SEGMENT
> just be automatically determined in the stack from the pmtu? Whats
> the motivation for the socket option for this? also AIUI this can be
> either a per-socket or a per-packet option?

I decided to let the application explicitly set segment size, to avoid
bugs from the application assuming a different MTU from the one
used in the kernel for segmentation.

With path MTU, it is too easy for a process to incorrectly assume
link MTU or stale path MTU. With the current interface, if a process
tries to assemble segments larger than relevant path MTU, the
send call will fail.

A process may also explicitly want to send a chain of packets
smaller than MTU.


Re: [PATCH RFC net-next 00/11] udp gso

2018-04-18 Thread Paolo Abeni
On Tue, 2018-04-17 at 16:00 -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> Segmentation offload reduces cycles/byte for large packets by
> amortizing the cost of protocol stack traversal.
> 
> This patchset implements GSO for UDP. A process can concatenate and
> submit multiple datagrams to the same destination in one send call
> by setting socket option SOL_UDP/UDP_SEGMENT with the segment size,
> or passing an analogous cmsg at send time.
> 
> The stack will send the entire large (up to network layer max size)
> datagram through the protocol layer. At the GSO layer, it is broken
> up in individual segments. All receive the same network layer header
> and UDP src and dst port. All but the last segment have the same UDP
> header, but the last may differ in length and checksum.

This is interesting, thanks for sharing!

I have some local patches somewhere implementing UDP GRO, but I never
tried to upstream them, since I lacked the associated GSO and I thought
that the use-case was not too relevant.

Given that your use-case is a connected socket - no per packet route
lookup - how does GSO performs compared to plain sendmmsg()? Have you
considered using and/or improving the latter?

When testing with Spectre/Meltdown mitigation in places, I expect that
the most relevant part of the gain is due to the single syscall per
burst.

Cheers,

Paolo


[PATCH] rt2x00: fix spelling mistake in various macros, UKNOWN -> UNKNOWN

2018-04-18 Thread Colin King
From: Colin Ian King 

Rename several macros that contain mispellings of UNKNOWN

Signed-off-by: Colin Ian King 
---
 drivers/net/wireless/ralink/rt2x00/rt2800.h | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/wireless/ralink/rt2x00/rt2800.h 
b/drivers/net/wireless/ralink/rt2x00/rt2800.h
index 6a8c93fb6a43..8eccfbb5d6f8 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2800.h
+++ b/drivers/net/wireless/ralink/rt2x00/rt2800.h
@@ -1193,10 +1193,10 @@
 #define TX_PWR_CFG_3_MCS13 FIELD32(0x00f0)
 #define TX_PWR_CFG_3_MCS14 FIELD32(0x0f00)
 #define TX_PWR_CFG_3_MCS15 FIELD32(0xf000)
-#define TX_PWR_CFG_3_UKNOWN1   FIELD32(0x000f)
-#define TX_PWR_CFG_3_UKNOWN2   FIELD32(0x00f0)
-#define TX_PWR_CFG_3_UKNOWN3   FIELD32(0x0f00)
-#define TX_PWR_CFG_3_UKNOWN4   FIELD32(0xf000)
+#define TX_PWR_CFG_3_UNKNOWN1  FIELD32(0x000f)
+#define TX_PWR_CFG_3_UNKNOWN2  FIELD32(0x00f0)
+#define TX_PWR_CFG_3_UNKNOWN3  FIELD32(0x0f00)
+#define TX_PWR_CFG_3_UNKNOWN4  FIELD32(0xf000)
 /* bits for 3T devices */
 #define TX_PWR_CFG_3_MCS12_CH0 FIELD32(0x000f)
 #define TX_PWR_CFG_3_MCS12_CH1 FIELD32(0x00f0)
@@ -1216,10 +1216,10 @@
  * TX_PWR_CFG_4:
  */
 #define TX_PWR_CFG_4   0x1324
-#define TX_PWR_CFG_4_UKNOWN5   FIELD32(0x000f)
-#define TX_PWR_CFG_4_UKNOWN6   FIELD32(0x00f0)
-#define TX_PWR_CFG_4_UKNOWN7   FIELD32(0x0f00)
-#define TX_PWR_CFG_4_UKNOWN8   FIELD32(0xf000)
+#define TX_PWR_CFG_4_UNKNOWN5  FIELD32(0x000f)
+#define TX_PWR_CFG_4_UNKNOWN6  FIELD32(0x00f0)
+#define TX_PWR_CFG_4_UNKNOWN7  FIELD32(0x0f00)
+#define TX_PWR_CFG_4_UNKNOWN8  FIELD32(0xf000)
 /* bits for 3T devices */
 #define TX_PWR_CFG_4_STBC4_CH0 FIELD32(0x000f)
 #define TX_PWR_CFG_4_STBC4_CH1 FIELD32(0x00f0)
-- 
2.17.0



Re: [PATCH] rt2x00: fix spelling mistake in various macros, UKNOWN -> UNKNOWN

2018-04-18 Thread Stanislaw Gruszka
On Wed, Apr 18, 2018 at 12:47:50PM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> Rename several macros that contain mispellings of UNKNOWN
> 
> Signed-off-by: Colin Ian King 
Acked-by: Stanislaw Gruszka 


Re: [PATCH v3 1/9] net: phy: new Asix Electronics PHY driver

2018-04-18 Thread Andrew Lunn
> +
> +/**
> + * asix_soft_reset - software reset the PHY via BMCR_RESET bit
> + * @phydev: target phy_device struct
> + *
> + * Description: Perform a software PHY reset using the standard
> + * BMCR_RESET bit and poll for the reset bit to be cleared.
> + * Toggle BMCR_RESET bit off to accomodate broken PHY implementations
> + * such as used on the Individual Computers' X-Surf 100 Zorro card.
> + *
> + * Returns: 0 on success, < 0 on failure
> + */
> +static int asix_soft_reset(struct phy_device *phydev)
> +{
> + int ret;
> +
> + /* Asix PHY won't reset unless reset bit toggles */
> + ret = phy_write(phydev, MII_BMCR, 0);
> + if (ret < 0)
> + return ret;
> +
> + phy_write(phydev, MII_BMCR, BMCR_RESET);
> +
> + return phy_poll_reset(phydev);
> +}

Why not simply:

static int asix_soft_reset(struct phy_device *phydev)
{
int ret;

/* Asix PHY won't reset unless reset bit toggles */
ret = phy_write(phydev, MII_BMCR, 0);
if (ret < 0)
return ret;

return genphy_soft_reset(phydev);
}

Andrew


Re: [PATCH RFC net-next 00/11] udp gso

2018-04-18 Thread Eric Dumazet


On 04/18/2018 05:31 AM, Sowmini Varadhan wrote:
> 
> I went through the patch set and the code looks fine- it extends existing
> infra for TCP/GSO to UDP.
> 
> One thing that was not clear to me about the API: shouldn't UDP_SEGMENT
> just be automatically determined in the stack from the pmtu? Whats
> the motivation for the socket option for this? also AIUI this can be
> either a per-socket or a per-packet option?
> 
> However, I share Sridhar's concerns about the very fundamental change
> to UDP message boundary semantics here.  

There is no change at all.

This will only be used as a mechanism to send X packets of same size.

So instead of X system calls , one system call.

One traversal of some expensive part of the host stack.

The content on the wire should be the same.




Re:Re: [PATCH v3] net: davicom: dm9000: Avoid spinlock recursion during dm9000_timeout routine

2018-04-18 Thread liuxiang

Hi,
Because the timeout task gets the main spinlock and disable the current cpu's 
irq, 
there is no other task on the same cpu can run, and tasks on the other cpus can 
not
enter the dm9000_timeout() again. So in the whole dm9000_timeout() routine, 
db->timeout_cpu can not be changed by other tasks. Although smp_processor_id() 
may change 
after preempt_enable(), these tasks always get the false result when call 
dm9000_current_in_timeout.
Only the timeout task get the true result. And if there is no timeout, all the 
tasks that want to 
do asynchronous phy operation get the false result. So I think this can avoid 
racy.


At 2018-04-16 23:05:01, "David Miller"  wrote:
>From: Liu Xiang 
>Date: Sat, 14 Apr 2018 16:50:34 +0800
>
>> +static bool dm9000_current_in_timeout(struct board_info *db)
>> +{
>> +bool ret = false;
>> +
>> +preempt_disable();
>> +ret = (db->timeout_cpu == smp_processor_id());
>> +preempt_enable();
>
>This doesn't work.
>
>As soon as you do preempt_enable(), smp_processor_id() can change.


Re: [PATCH RFC net-next 00/11] udp gso

2018-04-18 Thread Willem de Bruijn
On Wed, Apr 18, 2018 at 9:47 AM, Sowmini Varadhan
 wrote:
> On (04/18/18 06:35), Eric Dumazet wrote:
>>
>> There is no change at all.
>>
>> This will only be used as a mechanism to send X packets of same size.
>>
>> So instead of X system calls , one system call.
>>
>> One traversal of some expensive part of the host stack.
>>
>> The content on the wire should be the same.
>
> I'm sorry that's not how I interpret Willem's email below
> (and maybe I misunderstood)
>
> the following taken from https://www.spinics.net/lists/netdev/msg496150.html
>
> Sowmini> If yes, how will the recvmsg differentiate between the case
> Sowmini> (2000 byte message followed by 512 byte message) and
> Sowmini> (1472 byte message, 526 byte message, then 512 byte message),
> Sowmini> in other words, how are UDP message boundary semantics preserved?
>
> Willem> They aren't. This is purely an optimization to amortize the cost of
> Willem> repeated tx stack traversal. Unlike UFO, which would preserve the
> Willem> boundaries of the original larger than MTU datagram.
>
> As I understand Willem's explanation, if I do a sendmsg of 2000 bytes,
> - classic UDP will send 2 IP fragments, the first one with a full UDP
>   header, and the IP header indicating that this is the first frag for
>   that ipid, with more frags to follow. The second frag will have the
>   rest with the same ipid, it will not have a udp header,
>   and it will indicatet that it is the last frag (no more frags).
>
>   The receiver can thus use the ipid, "more-frags" bit, frag offset etc
>   to stitch the 2000 byte udp message together and pass it up on the udp
>   socket.
>
> - in the "GSO" proposal my 2000  bytes of data are sent as *two*
>   udp packets, each of them with a unique udp header, and uh_len set
>   to 1476 (for first) and 526 (for second). The receiver has no clue
>   that they are both part of the same UDP datagram, So wire format
>   is not the same, am I mistaken?

Eric is correct. If the application sets a segment size with UDP_SEGMENT
this is an instruction to the kernel to split the payload along that border into
separate discrete datagrams.

It does not matter what the behavior is without setting this option. If a
process wants to send a larger than MTU datagram and rely on the
kernel to fragment, then it should not set the option.


Re: tcp hang when socket fills up ?

2018-04-18 Thread Jozsef Kadlecsik
On Wed, 18 Apr 2018, Dominique Martinet wrote:

> Dominique Martinet wrote on Wed, Apr 18, 2018:
> > Jozsef Kadlecsik wrote on Wed, Apr 18, 2018:
> > > Yes, the state transition is wrong for simultaneous open, because the 
> > > tcp_conntracks table is not (cannot be) smart enough. Could you verify 
> > > the 
> > > next untested patch?
> > 
> > Thanks for the patch; I'll give it a try (probably won't make it today
> > so will report tomorrow)
> 
> Actually had time; I can confirm (added printks) we did get in that if 
> that was pointed at, and we no longer get there now. The connection no 
> longer gets in invalid state, so that looks like it nailed it.
>
> I'm now confused what this has to do with tcp_timestamp though, since 
> setting that off also seemed to work around the issue, but if we get 
> something like that in I'll be happy anyway.

Thanks for the testing! One more line is required, however: we have to get 
the assured bit set for the connection, see the new patch below.

The tcp_conntracks state table could be fixed with introducing a new 
state, but that part is exposed to userspace (ctnetlink) and ugly 
compatibility code would be required for backward compatibility.
 
diff --git a/include/uapi/linux/netfilter/nf_conntrack_tcp.h 
b/include/uapi/linux/netfilter/nf_conntrack_tcp.h
index 74b9115..bcba72d 100644
--- a/include/uapi/linux/netfilter/nf_conntrack_tcp.h
+++ b/include/uapi/linux/netfilter/nf_conntrack_tcp.h
@@ -46,6 +46,9 @@ enum tcp_conntrack {
 /* Marks possibility for expected RFC5961 challenge ACK */
 #define IP_CT_EXP_CHALLENGE_ACK0x40
 
+/* Simultaneous open initialized */
+#define IP_CT_TCP_SIMULTANEOUS_OPEN0x80
+
 struct nf_ct_tcp_flags {
__u8 flags;
__u8 mask;
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c 
b/net/netfilter/nf_conntrack_proto_tcp.c
index e97cdc1..2c1fc7e 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -981,6 +981,20 @@ static int tcp_packet(struct nf_conn *ct,
return NF_ACCEPT; /* Don't change state */
}
break;
+   case TCP_CONNTRACK_SYN_SENT2:
+   /* tcp_conntracks table is not smart enough to handle
+* simultaneous open.
+*/
+   ct->proto.tcp.last_flags |= IP_CT_TCP_SIMULTANEOUS_OPEN;
+   break;
+   case TCP_CONNTRACK_SYN_RECV:
+   if (dir == IP_CT_DIR_REPLY && index == TCP_ACK_SET &&
+   ct->proto.tcp.last_flags & IP_CT_TCP_SIMULTANEOUS_OPEN) {
+   /* We want to set the assured bit */
+   old_state = TCP_CONNTRACK_SYN_RECV;
+   new_state = TCP_CONNTRACK_ESTABLISHED;
+   }
+   break;
case TCP_CONNTRACK_CLOSE:
if (index == TCP_RST_SET
&& (ct->proto.tcp.seen[!dir].flags & 
IP_CT_TCP_FLAG_MAXACK_SET)

Best regards,
Jozsef
-
E-mail  : kad...@blackhole.kfki.hu, kadlecsik.joz...@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
  H-1525 Budapest 114, POB. 49, Hungary


Re: Fw: [Bug 199429] New: smc_shutdown(net/smc/af_smc.c) has a UAF causing null pointer vulnerability.

2018-04-18 Thread Ursula Braun


On 04/18/2018 04:56 AM, Stephen Hemminger wrote:
> This may already be fixed.
> 
> Begin forwarded message:
> 
> Date: Wed, 18 Apr 2018 01:52:59 +
> From: bugzilla-dae...@bugzilla.kernel.org
> To: step...@networkplumber.org
> Subject: [Bug 199429] New: smc_shutdown(net/smc/af_smc.c) has a UAF causing 
> null pointer vulnerability.
> 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=199429
> 
> Bug ID: 199429
>Summary: smc_shutdown(net/smc/af_smc.c) has a UAF causing null
> pointer vulnerability.
>Product: Networking
>Version: 2.5
> Kernel Version: 4.16.0-rc7
>   Hardware: All
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: Other
>   Assignee: step...@networkplumber.org
>   Reporter: 1773876...@qq.com
> Regression: No
> 
> Created attachment 275431
>   --> https://bugzilla.kernel.org/attachment.cgi?id=275431=edit  
> POC
> 
> Syzkaller hit 'general protection fault in kernel_sock_shutdown' bug.
> 
> NET: Registered protocol family 43

Thanks for reporting. This fix is needed here:

 net/smc/af_smc.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1314,7 +1314,7 @@ static int smc_shutdown(struct socket *s
(sk->sk_state != SMC_APPCLOSEWAIT2) &&
(sk->sk_state != SMC_APPFINCLOSEWAIT))
goto out;
-   if (smc->use_fallback) {
+   if (smc->use_fallback || sk->sk_state == SMC_LISTEN) {
rc = kernel_sock_shutdown(smc->clcsock, how);
sk->sk_shutdown = smc->clcsock->sk->sk_shutdown;
if (sk->sk_shutdown == SHUTDOWN_MASK)

Kind regards, Ursula



Re: [Regression] net/phy/micrel.c v4.9.94

2018-04-18 Thread Andrew Lunn
> If I look at the patch I think it should call kszphy_config_init() not 
> _reset()
> in the resume function:
> 
> 
> @@ -715,8 +723,14 @@ static int kszphy_suspend(struct phy_device *phydev)
> 
>  static int kszphy_resume(struct phy_device *phydev)
>  {
> + int ret;
> +
>   genphy_resume(phydev);
> 
> - ret = kszphy_config_reset(phydev);
> +   ret = kszphy_config_init(phydev);
> + if (ret)
> + return ret;
> +
> 

Hi Chris

I think there has been a patch for this posted. If i remember
correctly, the PHY you have does not call probe, hence phydev->priv is
a NULL pointer, so priv->rmii_ref_clk_sel does not work.

It would be good to find the patch and make sure it has been accepted,
and marked for stable.

Andrew


Re: [PATCH v2 bpf-next 0/3] Add missing types to bpftool, libbpf

2018-04-18 Thread Daniel Borkmann
On 04/17/2018 07:28 PM, Andrey Ignatov wrote:
> v1->v2:
> - add new types to bpftool-cgroup man page;
> - add new types to bash completion for bpftool;
> - don't add types that should not be in bpftool cgroup.
> 
> Add support for various BPF prog types and attach types that have been
> added to kernel recently but not to bpftool or libbpf yet.
> 
> Andrey Ignatov (3):
>   bpftool: Support new prog types and attach types
>   libbpf: Support guessing post_bind{4,6} progs
>   libbpf: Type functions for raw tracepoints

Applied to bpf-next, thanks Andrey!


Re: [bpf-next PATCH] samples/bpf: fix xdp_monitor user output for tracepoint exception

2018-04-18 Thread Daniel Borkmann
On 04/17/2018 04:08 PM, Jesper Dangaard Brouer wrote:
> The variable rec_i contains an XDP action code not an error.
> Thus, using err2str() was wrong, it should have been action2str().
> 
> Signed-off-by: Jesper Dangaard Brouer 

Applied to bpf-next, thanks Jesper!


Re: [PATCH bpf-next v2 02/11] bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail

2018-04-18 Thread Jesper Dangaard Brouer
On Tue, 17 Apr 2018 21:29:42 -0700
"Nikita V. Shirokov"  wrote:

> w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
> well (only "decrease" of pointer's location is going to be supported).
> changing of this pointer will change packet's size.
> for generic XDP we need to reflect this packet's length change by
> adjusting skb's tail pointer
> 
> Acked-by: Alexei Starovoitov 

You are missing your own Signed-off-by: line on all of the patches.

BTW, thank you for working on this! It have been on my todo-list for a
while now!

_After_ this patchset, I would like to see adding support for
"increasing" the data_end location to create a larger packet.  For that
we should likely add a data_hard_end pointer.  This, would also be
helpful in cpu_map_build_skb() to know the data_hard_end, to determine
the frame size (as some driver doesn't use PAGE_SIZE frames, ixgbe).


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [RFC PATCH] net: bridge: multicast querier per VLAN support

2018-04-18 Thread Joachim Nilsson
On Wed, Apr 18, 2018 at 03:31:57PM +0300, Nikolay Aleksandrov wrote:
> On 18/04/18 15:07, Joachim Nilsson wrote:
> > - First of all, is this patch useful to anyone
> Obviously to us as it's based on our patch. :-)
> We actually recently discussed what will be needed to make it acceptable to 
> upstream.

Great! :)

> > - The current br_multicast.c is very complex.  The support for both IPv4
> >and IPv6 is a no-brainer, but it also has #ifdef VLAN_FILTERING and
> >'br->vlan_enabled' ... this has likely been discussed before, but if
> >we could remove those code paths I believe what's left would be quite
> >a bit easier to read and maintain.
> br->vlan_enabled has a wrapper that can be used without ifdefs, as does 
> br_vlan_find()
> so in short - you can remove the ifdefs and use the wrappers,  they'll 
> degrade to always
> false/null when vlans are disabled.

Thanks, I'll have a look at that and prepare an RFC v2!

> > - Many per-bridge specific multicast sysfs settings may need to have a
> >corresponding per-VLAN setting, e.g. snooping, query_interval, etc.
> >How should we go about that? (For status reporting I have a proposal)
> We'll have to add more to the per-vlan context, but yes it has to happen.
> It will be only netlink interface for config/retrieval, no sysfs.

Some settings are possible to do with sysfs, like multicast_query_interval
and ...

> > - Dito per-port specific multicast sysfs settings, e.g. multicast_router
> I'm not sure I follow this one, there is per-port mcast router config now ?

Sorry no, I meant we may want to add more per-VLAN settings when we get
this base patch merged.  Like router ports, we may want to be able to
set them per VLAN.

> Thanks for the effort, I see that you have done some of the required cleanups
> for this to be upstreamable, but as you've noted above we need to make it
> complete (with the per-vlan contexts and all).

There's definitely more work to be done.  Agreeing on a base set of changes
to start with is maybe the most important, as well as making it complete.

> I will review this patch in detail later and come back if there's anything.

Thank you so much for the quick feedback so far! :)

Cheers
 /Joachim


Reply

2018-04-18 Thread Tamale David
Dear Shaohui,
I plead an indulgence if I have invaded your privacy by receiving this
mail from me without prior permission.With due respect,I contact you
purposely based on the similarities of names between you and my
deceased client who was an oil servicing contractor with shell
petroleum in West Africa.

This is about Nine years I have been searching the contacts of the
relatives to my late client in order that they may come forward for
the repatriation of the estate of my late client,Engineer Victor
Shaohui  valued $16.4 Million but unfortunately all my search proved
abortive and the Togo bank holding these funds issued me a last notice
to bring the supposed heir/relatives before end of this year or they
will have the account declared un serviceable thereby confiscating the
funds hence my contact to you so that you may stand as the heir and
receive the funds into your bank account since you are bearing same
surname with my deceased client. I have the death certificate to send
to you as well as deposit document as proof.The sharing ratio of the
funds after a successful transfer to your bank account shall be 40/60
of which I do have trust that the funds will be secured pending my
arrival to meet you in your country.

Waiting to hear from you.
Sincerely yours,
Barrister Tamale David


[PATCHv2 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device

2018-04-18 Thread Zhu Yanjun
While a faulty cable is used or HCA firmware error, HCA device will
be offline. When the driver is accessing this offline device, the
following call trace will pop out.

"
...
  [] dump_stack+0x63/0x81
  [] panic+0xcc/0x21b
  [] mlx4_enter_error_state+0xba/0xf0 [mlx4_core]
  [] mlx4_cmd_reset_flow+0x38/0x60 [mlx4_core]
  [] mlx4_cmd_poll+0xc1/0x2e0 [mlx4_core]
  [] __mlx4_cmd+0xb0/0x160 [mlx4_core]
  [] mlx4_SENSE_PORT+0x54/0xd0 [mlx4_core]
  [] mlx4_dev_cap+0x4a4/0xb50 [mlx4_core]
...
"
In the above call trace, the function mlx4_cmd_poll calls the function
mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post
returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls
mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out.

This is not reasonable. Since HCA device is offline when it is being
accessed, it should not be reset again.

In this patch, since HCA is offline, the function mlx4_cmd_post returns
an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly returns
instead of resetting HCA.

CC: Srinivas Eeda 
CC: Junxiao Bi 
Suggested-by: Håkon Bugge 
Suggested-by: Tariq Toukan 
Signed-off-by: Zhu Yanjun 
---
V1->V2: Follow Tariq's advice, avoid the disturbance from other returned errors.
Since the returned values from the function mlx4_cmd_post are -EIO and -EINVAL,
to -EIO, the HCA device should be reset. To -EINVAL, that means that the 
function
mlx4_cmd_post is accessing an offline device. It is not necessary to reset HCA.
Go to label out directly.
---
 drivers/net/ethernet/mellanox/mlx4/cmd.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 6a9086d..df735b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -451,6 +451,8 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, u64 
in_param, u64 out_param,
 * Device is going through error recovery
 * and cannot accept commands.
 */
+   mlx4_err(dev, "%s : Device is in error recovery.\n", __func__);
+   ret = -EINVAL;
goto out;
}
 
@@ -610,8 +612,11 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev, u64 
in_param, u64 *out_param,
 
err = mlx4_cmd_post(dev, in_param, out_param ? *out_param : 0,
in_modifier, op_modifier, op, CMD_POLL_TOKEN, 0);
-   if (err)
+   if (err) {
+   if (err == -EINVAL)
+   goto out;
goto out_reset;
+   }
 
end = msecs_to_jiffies(timeout) + jiffies;
while (cmd_pending(dev) && time_before(jiffies, end)) {
@@ -710,8 +715,11 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, u64 
in_param, u64 *out_param,
 
err = mlx4_cmd_post(dev, in_param, out_param ? *out_param : 0,
in_modifier, op_modifier, op, context->token, 1);
-   if (err)
+   if (err) {
+   if (err == -EINVAL)
+   goto out;
goto out_reset;
+   }
 
if (op == MLX4_CMD_SENSE_PORT) {
ret_wait =
-- 
2.7.4



Re: [PATCH bpf-next v3 8/8] bpf: add documentation for eBPF helpers (58-64)

2018-04-18 Thread Jesper Dangaard Brouer
On Tue, 17 Apr 2018 15:34:38 +0100
Quentin Monnet  wrote:

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 350459c583de..3d329538498f 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1276,6 +1276,50 @@ union bpf_attr {
>   *   Return
>   *   0 on success, or a negative error in case of failure.
>   *
> + * int bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
> + *   Description
> + *   Redirect the packet to the endpoint referenced by *map* at
> + *   index *key*. Depending on its type, his *map* can contain
^^^

"his" -> "this"

> + *   references to net devices (for forwarding packets through other
> + *   ports), or to CPUs (for redirecting XDP frames to another CPU;
> + *   but this is only implemented for native XDP (with driver
> + *   support) as of this writing).
> + *
> + *   All values for *flags* are reserved for future usage, and must
> + *   be left at zero.
> + *   Return
> + *   **XDP_REDIRECT** on success, or **XDP_ABORT** on error.
> + *

"XDP_ABORT" -> "XDP_ABORTED"

I don't know if it's worth mentioning in the doc/man-page; that for XDP
using bpf_redirect_map() is a HUGE performance advantage, compared to
the bpf_redirect() call ?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH RFC net-next 00/11] udp gso

2018-04-18 Thread Willem de Bruijn
On Wed, Apr 18, 2018 at 7:17 AM, Paolo Abeni <pab...@redhat.com> wrote:
> On Tue, 2018-04-17 at 16:00 -0400, Willem de Bruijn wrote:
>> From: Willem de Bruijn <will...@google.com>
>>
>> Segmentation offload reduces cycles/byte for large packets by
>> amortizing the cost of protocol stack traversal.
>>
>> This patchset implements GSO for UDP. A process can concatenate and
>> submit multiple datagrams to the same destination in one send call
>> by setting socket option SOL_UDP/UDP_SEGMENT with the segment size,
>> or passing an analogous cmsg at send time.
>>
>> The stack will send the entire large (up to network layer max size)
>> datagram through the protocol layer. At the GSO layer, it is broken
>> up in individual segments. All receive the same network layer header
>> and UDP src and dst port. All but the last segment have the same UDP
>> header, but the last may differ in length and checksum.
>
> This is interesting, thanks for sharing!
>
> I have some local patches somewhere implementing UDP GRO, but I never
> tried to upstream them, since I lacked the associated GSO and I thought
> that the use-case was not too relevant.
>
> Given that your use-case is a connected socket - no per packet route
> lookup - how does GSO performs compared to plain sendmmsg()? Have you
> considered using and/or improving the latter?
>
> When testing with Spectre/Meltdown mitigation in places, I expect that
> the most relevant part of the gain is due to the single syscall per
> burst.

The main benefit is actually not route lookup avoidance. Somewhat to
my surprise. The benchmark can be run both in connected and
unconnected ('-u') mode. Both saturate the cpu cycles, so only showing
throughput:

[connected] udp tx:825 MB/s   588336 calls/s  14008 msg/s
[unconnected] udp tx:711 MB/s   506646 calls/s  12063 msg/s

This corresponds to results previously seen with other applications
of about 15%.

When looking at a perf report, there is no clear hot spot, which
indicates that the savings accrue across the protocol stack traversal.

I just hacked up a sendmmsg extension to the benchmark to verify.
Indeed that does not have nearly the same benefit as GSO:

udp tx:976 MB/s   695394 calls/s  16557 msg/s

This matches the numbers seen from TCP without TSO and GSO.
That also has few system calls, but observes per MTU stack traversal.

I pushed the branch to my github at

  https://github.com/wdebruij/linux/tree/udpgso-20180418

and also the version I sent for RFC yesterday at

  https://github.com/wdebruij/linux/tree/udpgso-rfc-v1


Re: [PATCH bpf-next 2/9] bpf: add bpf_get_stack helper

2018-04-18 Thread kbuild test robot
Hi Yonghong,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:
https://github.com/0day-ci/linux/commits/Yonghong-Song/bpf-add-bpf_get_stack-helper/20180418-210810
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: ia64-allmodconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=ia64 

All errors (new ones prefixed by >>):

   kernel/bpf/core.c: In function 'bpf_prog_free_deferred':
>> kernel/bpf/core.c:1714:3: error: implicit declaration of function 
>> 'put_callchain_buffers' [-Werror=implicit-function-declaration]
  put_callchain_buffers();
  ^
   cc1: some warnings being treated as errors

vim +/put_callchain_buffers +1714 kernel/bpf/core.c

  1704  
  1705  static void bpf_prog_free_deferred(struct work_struct *work)
  1706  {
  1707  struct bpf_prog_aux *aux;
  1708  int i;
  1709  
  1710  aux = container_of(work, struct bpf_prog_aux, work);
  1711  if (bpf_prog_is_dev_bound(aux))
  1712  bpf_prog_offload_destroy(aux->prog);
  1713  if (aux->prog->need_callchain_buf)
> 1714  put_callchain_buffers();
  1715  for (i = 0; i < aux->func_cnt; i++)
  1716  bpf_jit_free(aux->func[i]);
  1717  if (aux->func_cnt) {
  1718  kfree(aux->func);
  1719  bpf_prog_unlock_free(aux->prog);
  1720  } else {
  1721  bpf_jit_free(aux->prog);
  1722  }
  1723  }
  1724  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[RFC net-next PATCH 0/2] bpf: followup avoid leaking info stored in frame data on page reuse

2018-04-18 Thread Jesper Dangaard Brouer
This is a followup to fix commit:
 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page reuse")

Posting as RFC, as I want Daniel to review this before it goes in, as
Daniel usually have smarter/brighter ideas of howto solve this in a
more optimal manor?

---

Jesper Dangaard Brouer (2):
  bpf: avoid clear xdp_frame area again
  bpf: disallow XDP data_meta to overlap with xdp_frame area


 net/core/filter.c |   18 ++
 1 file changed, 18 insertions(+)

--


[RFC net-next PATCH 1/2] bpf: avoid clear xdp_frame area again

2018-04-18 Thread Jesper Dangaard Brouer
Avoid clearing xdp_frame area if this was already done by prevous
invocations of bpf_xdp_adjust_head.

The xdp_adjust_head helper can be called multiple times by the
bpf_prog.  If increasing the packet header size (with a negative
offset), kernel must assume bpf_prog store valuable information here,
and not clear this information.

In case of extending header into xdp_frame area the kernel clear this
area to avoid any info leaking.

The bug in the current implementation is that if existing xdp->data
pointer have already been moved into xdp_frame area, then memory is
cleared between new-data pointer and xdp_frame-end, which covers an
area that might contain information store by BPF-prog (as curr
xdp->data lays between those pointers).

Fixes: 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page 
reuse")
Signed-off-by: Jesper Dangaard Brouer 
---
 net/core/filter.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index a374b8560bc4..15e9b5477360 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2705,6 +2705,13 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, 
int, offset)
if (data < xdp_frame_end) {
unsigned long clearlen = xdp_frame_end - data;
 
+   /* Handle if prev call adjusted xdp->data into xdp_frame area */
+   if (unlikely(xdp->data < xdp_frame_end)) {
+   if (data < xdp->data)
+   clearlen = xdp->data - data;
+   else
+   clearlen = 0;
+   }
memset(data, 0, clearlen);
}
 



[RFC net-next PATCH 2/2] bpf: disallow XDP data_meta to overlap with xdp_frame area

2018-04-18 Thread Jesper Dangaard Brouer
If combining xdp_adjust_head and xdp_adjust_meta, then it is possible
to make data_meta overlap with area used by xdp_frame.  And another
invocation of xdp_adjust_head can then clear that area, due to
clearing of xdp_frame area.

The easiest solution I found was to simply not allow
xdp_buff->data_meta to overlap with area used by xdp_frame.

Fixes: 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page 
reuse")
Signed-off-by: Jesper Dangaard Brouer 
---
 net/core/filter.c |   11 +++
 1 file changed, 11 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 15e9b5477360..e3623e741181 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2701,6 +2701,11 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, 
int, offset)
 data > xdp->data_end - ETH_HLEN))
return -EINVAL;
 
+   /* Disallow data_meta to use xdp_frame area */
+   if (metalen > 0 &&
+   unlikely((data - metalen) < xdp_frame_end))
+   return -EINVAL;
+
/* Avoid info leak, when reusing area prev used by xdp_frame */
if (data < xdp_frame_end) {
unsigned long clearlen = xdp_frame_end - data;
@@ -2734,6 +2739,7 @@ static const struct bpf_func_proto 
bpf_xdp_adjust_head_proto = {
 
 BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
 {
+   void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
void *meta = xdp->data_meta + offset;
unsigned long metalen = xdp->data - meta;
 
@@ -2742,6 +2748,11 @@ BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, 
int, offset)
if (unlikely(meta < xdp->data_hard_start ||
 meta > xdp->data))
return -EINVAL;
+
+   /* Disallow data_meta to use xdp_frame area */
+   if (unlikely(meta < xdp_frame_end))
+   return -EINVAL;
+
if (unlikely((metalen & (sizeof(__u32) - 1)) ||
 (metalen > 32)))
return -EACCES;



[PATCH] net: qmi_wwan: add Wistron Neweb D19Q1

2018-04-18 Thread Pawel Dembicki
This modem is embedded on dlink dwr-960 router.
The oem configuration states:

T: Bus=01 Lev=01 Prnt=01 Port=00 Cnt=01 Dev#= 2 Spd=480 MxCh= 0
D: Ver= 2.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1
P: Vendor=1435 ProdID=d191 Rev=ff.ff
S: Manufacturer=Android
S: Product=Android
S: SerialNumber=0123456789ABCDEF
C:* #Ifs= 6 Cfg#= 1 Atr=80 MxPwr=500mA
I:* If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=(none)
E: Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E: Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
I:* If#= 1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=42 Prot=01 Driver=(none)
E: Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E: Ad=82(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
I:* If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=(none)
E: Ad=84(I) Atr=03(Int.) MxPS= 10 Ivl=32ms
E: Ad=83(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E: Ad=03(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
I:* If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=(none)
E: Ad=86(I) Atr=03(Int.) MxPS= 10 Ivl=32ms
E: Ad=85(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E: Ad=04(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
I:* If#= 4 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=qmi_wwan
E: Ad=88(I) Atr=03(Int.) MxPS= 8 Ivl=32ms
E: Ad=87(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E: Ad=05(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
I:* If#= 5 Alt= 0 #EPs= 2 Cls=08(stor.) Sub=06 Prot=50 Driver=(none)
E: Ad=89(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E: Ad=06(O) Atr=02(Bulk) MxPS= 512 Ivl=125us

Tested on openwrt distribution

Signed-off-by: Pawel Dembicki 
---
 drivers/net/usb/qmi_wwan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index ca066b7..c853e74 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -1107,6 +1107,7 @@ static const struct usb_device_id products[] = {
{QMI_FIXED_INTF(0x1435, 0xd181, 3)},/* Wistron NeWeb D18Q1 */
{QMI_FIXED_INTF(0x1435, 0xd181, 4)},/* Wistron NeWeb D18Q1 */
{QMI_FIXED_INTF(0x1435, 0xd181, 5)},/* Wistron NeWeb D18Q1 */
+   {QMI_FIXED_INTF(0x1435, 0xd191, 4)},/* Wistron NeWeb D19Q1 */
{QMI_FIXED_INTF(0x16d8, 0x6003, 0)},/* CMOTech 6003 */
{QMI_FIXED_INTF(0x16d8, 0x6007, 0)},/* CMOTech CHE-628S */
{QMI_FIXED_INTF(0x16d8, 0x6008, 0)},/* CMOTech CMU-301 */
-- 
2.7.4



Re: [PATCH 1/2] net: netsec: enable tx-irq during open callback

2018-04-18 Thread Jassi Brar
Hi Dave,

On Mon, Apr 16, 2018 at 11:16 PM, David Miller  wrote:
> From: jassisinghb...@gmail.com
> Date: Mon, 16 Apr 2018 12:52:16 +0530
>
>> From: Jassi Brar 
>>
>> Enable TX-irq as well during ndo_open() as we can not count upon
>> RX to arrive early enough to trigger the napi. This patch is critical
>> for installation over network.
>>
>> Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver")
>> Signed-off-by: Jassi Brar 
>
> Applied.
>
Just to make sure, let me please mention that c009f413b79de52 and
9a00b697ce31e are very much needed in stable kernel. Without these we
couldn't install any OS over network.

Thanks.


[PATCH v2 2/3] lan78xx: Read LED states from Device Tree

2018-04-18 Thread Phil Elwell
Add support for DT property "microchip,led-modes", a vector of zero
to four cells (u32s) in the range 0-15, each of which sets the mode
for one of the LEDs. Some possible values are:

0=link/activity  1=link1000/activity
2=link100/activity   3=link10/activity
4=link100/1000/activity  5=link10/1000/activity
6=link10/100/activity14=off15=on

These values are given symbolic constants in a dt-bindings header.

Also use the presence of the DT property to indicate that the
LEDs should be enabled - necessary in the event that no valid OTP
or EEPROM is available.

Signed-off-by: Phil Elwell 
---
 MAINTAINERS  |  1 +
 drivers/net/usb/lan78xx.c| 35 
 include/dt-bindings/net/microchip-78xx.h | 21 +++
 3 files changed, 57 insertions(+)
 create mode 100644 include/dt-bindings/net/microchip-78xx.h

diff --git a/MAINTAINERS b/MAINTAINERS
index b60179d..9c9bc63 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14573,6 +14573,7 @@ M:  Microchip Linux Driver Support 

 L: netdev@vger.kernel.org
 S: Maintained
 F: drivers/net/usb/lan78xx.*
+F: include/dt-bindings/net/microchip-78xx.h
 
 USB MASS STORAGE DRIVER
 M: Alan Stern 
diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index a823f01..f47ffea 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "lan78xx.h"
 
 #define DRIVER_AUTHOR  "WOOJUNG HUH "
@@ -74,6 +75,9 @@
 #define LAN78XX_EEPROM_MAGIC   (0x78A5)
 #define LAN78XX_OTP_MAGIC  (0x78F3)
 
+/* This register is specific to the LAN7800 and LAN7850 embedded PHYs */
+#define LAN78XX_PHY_LED_MODE_SELECT29
+
 #defineMII_READ1
 #defineMII_WRITE   0
 
@@ -2005,6 +2009,8 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
 {
int ret;
u32 mii_adv;
+   u32 led_modes[4];
+   int len;
struct phy_device *phydev;
 
phydev = phy_find_first(dev->mdiobus);
@@ -2077,6 +2083,35 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
mii_adv = (u32)mii_advertise_flowctrl(dev->fc_request_control);
phydev->advertising |= mii_adv_to_ethtool_adv_t(mii_adv);
 
+   len = of_property_read_variable_u32_array(dev->udev->dev.of_node,
+ "microchip,led-modes",
+ led_modes,
+ 0,
+ ARRAY_SIZE(led_modes));
+   if (len >= 0) {
+   u32 reg = 0;
+   int i;
+
+   for (i = 0; i < len; i++) {
+   if (led_modes[i] > 15) {
+   ret = -EINVAL;
+   goto error;
+   }
+   reg |= led_modes[i] << (i * 4);
+   }
+   for (; i < ARRAY_SIZE(led_modes); i++)
+   reg |= LAN78XX_FORCE_LED_OFF << (i * 4);
+   (void)phy_write(phydev, LAN78XX_PHY_LED_MODE_SELECT, reg);
+
+   /* Ensure the LEDs are enabled */
+   lan78xx_read_reg(dev, HW_CFG, );
+   reg |= HW_CFG_LED0_EN_ | HW_CFG_LED1_EN_;
+   lan78xx_write_reg(dev, HW_CFG, reg);
+   } else if (len == -EOVERFLOW) {
+   ret = -EINVAL;
+   goto error;
+   }
+
genphy_config_aneg(phydev);
 
dev->fc_autoneg = phydev->autoneg;
diff --git a/include/dt-bindings/net/microchip-78xx.h 
b/include/dt-bindings/net/microchip-78xx.h
new file mode 100644
index 000..dcf4a43
--- /dev/null
+++ b/include/dt-bindings/net/microchip-78xx.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _DT_BINDINGS_MICROCHIP_LAN78XX_H
+#define _DT_BINDINGS_MICROCHIP_LAN78XX_H
+
+/* LED modes */
+
+#define LAN78XX_LINK_ACTIVITY   0
+#define LAN78XX_LINK_1000_ACTIVITY  1
+#define LAN78XX_LINK_100_ACTIVITY   2
+#define LAN78XX_LINK_10_ACTIVITY3
+#define LAN78XX_LINK_100_1000_ACTIVITY  4
+#define LAN78XX_LINK_10_1000_ACTIVITY   5
+#define LAN78XX_LINK_10_100_ACTIVITY6
+#define LAN78XX_DUPLEX_COLLISION8
+#define LAN78XX_COLLISION   9
+#define LAN78XX_ACTIVITY10
+#define LAN78XX_AUTONEG_FAULT   12
+#define LAN78XX_FORCE_LED_OFF   14
+#define LAN78XX_FORCE_LED_ON15
+
+#endif
-- 
2.7.4



[bpf-next PATCH 3/3] bpf: add sample program to trace map events

2018-04-18 Thread Sebastiano Miano
This patch adds a sample program, called trace_map_events,
that shows how to capture map events and filter them based on
the map id.

The program accepts a list of map IDs, via the -i command line
option, and filters all the map events related to those IDs (i.e.,
map_create/update/lookup/next_key).
If no IDs are specified, all map events are listed and no filtering
is performed.

Sample usage:

 # trace_map_events -i  -i  -i  ...

Signed-off-by: Sebastiano Miano 
---
 samples/bpf/Makefile|4 
 samples/bpf/trace_map_events_kern.c |  225 +
 samples/bpf/trace_map_events_user.c |  314 +++
 3 files changed, 543 insertions(+)
 create mode 100644 samples/bpf/trace_map_events_kern.c
 create mode 100644 samples/bpf/trace_map_events_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4d6a6ed..a7d52b6 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -15,6 +15,7 @@ hostprogs-y += tracex6
 hostprogs-y += tracex7
 hostprogs-y += test_probe_write_user
 hostprogs-y += trace_output
+hostprogs-y += trace_map_events
 hostprogs-y += lathist
 hostprogs-y += offwaketime
 hostprogs-y += spintest
@@ -65,6 +66,7 @@ tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o
 load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o
 test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
 trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o
+trace_map_events-objs := bpf_load.o $(LIBBPF) trace_map_events_user.o
 lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o
 offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o
 spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o
@@ -111,6 +113,7 @@ always += tracex7_kern.o
 always += sock_flags_kern.o
 always += test_probe_write_user_kern.o
 always += trace_output_kern.o
+always += trace_map_events_kern.o
 always += tcbpf1_kern.o
 always += tcbpf2_kern.o
 always += tc_l2_redirect_kern.o
@@ -171,6 +174,7 @@ HOSTLOADLIBES_test_cgrp2_sock2 += -lelf
 HOSTLOADLIBES_load_sock_ops += -lelf
 HOSTLOADLIBES_test_probe_write_user += -lelf
 HOSTLOADLIBES_trace_output += -lelf -lrt
+HOSTLOADLIBES_trace_map_events += -lelf -lrt
 HOSTLOADLIBES_lathist += -lelf
 HOSTLOADLIBES_offwaketime += -lelf
 HOSTLOADLIBES_spintest += -lelf
diff --git a/samples/bpf/trace_map_events_kern.c 
b/samples/bpf/trace_map_events_kern.c
new file mode 100644
index 000..f887b5b
--- /dev/null
+++ b/samples/bpf/trace_map_events_kern.c
@@ -0,0 +1,225 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (C) 2018 Politecnico di Torino, Italy
+ *
+ * Author: Sebastiano Miano 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ */
+
+#include 
+#include 
+#include "bpf_helpers.h"
+
+enum map_event_type {
+   MAP_CREATE = 0,
+   MAP_UPDATE = 1,
+   MAP_LOOKUP = 2,
+   MAP_NEXT_KEY = 3
+};
+
+struct map_event_data {
+   u32 map_id;
+   enum map_event_type evnt_type;
+   u32 map_type;
+};
+
+struct bpf_map_def SEC("maps") map_event_trace = {
+   .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+   .key_size = sizeof(int),
+   .value_size = sizeof(u32),
+   .max_entries = 64,
+};
+
+struct bpf_map_def SEC("maps") filtered_ids = {
+   .type = BPF_MAP_TYPE_HASH,
+   .key_size = sizeof(u32),
+   .value_size = sizeof(u32),
+   .max_entries = 64,
+};
+
+struct bpf_map_def SEC("maps") filter_events = {
+   .type = BPF_MAP_TYPE_ARRAY,
+   .key_size = sizeof(u32),
+   .value_size = sizeof(bool),
+   .max_entries = 1,
+};
+
+/*
+ * Tracepoint format: 
/sys/kernel/debug/tracing/events/bpf/bpf_map_create/format
+ * Code in:kernel/include/trace/events/bpf.h
+ */
+struct bpf_map_create_ctx {
+   u64 pad;// First 8 bytes are not accessible by bpf code
+   u32 type;   // offset:8;size:4; signed:0;
+   u32 size_key;   // offset:12;   size:4; signed:0;
+   u32 size_value; // offset:16;   size:4; signed:0;
+   u32 max_entries;// offset:20;   size:4; signed:0;
+   u32 flags;  // offset:24;   size:4; signed:0;
+   int ufd;// offset:28;   size:4; signed:1;
+   u32 id; // offset:32;   size:4; signed:0;
+};
+
+SEC("tracepoint/bpf/bpf_map_create")
+int trace_bpf_map_create(struct bpf_map_create_ctx *ctx)
+{
+   struct map_event_data data;
+   int cpu = bpf_get_smp_processor_id();
+   bool *filter;
+   u32 key = 0, map_id = ctx->id;
+
+   filter = bpf_map_lookup_elem(_events, );
+   if (!filter)
+   return 1;
+
+   if (!*filter)
+   goto send_event;
+
+   /*
+* If the map_id is not in the list of filtered
+* ids we 

Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache

2018-04-18 Thread Eric Dumazet


On 04/18/2018 03:22 AM, Paolo Abeni wrote:
> This changeset extends the idea behind commit c8c8b127091b ("udp:
> under rx pressure, try to condense skbs"), trading more BH cpu
> time and memory bandwidth to decrease the load on the user space
> receiver.
> 
> At boot time we allocate a limited amount of skbs with small
> data buffer, storing them in per cpu arrays. Such skbs are never
> freed.
> 
> At run time, under rx pressure, the BH tries to copy the current
> skb contents into the cache - if the current cache skb is available,
> and the ingress skb is small enough and without any head states.
> 
> When using the cache skb, the ingress skb is dropped by the BH
> - while still hot on cache - and the cache skb is inserted into
> the rx queue, after increasing its usage count. Also, the cache
> array index is moved to the next entry.
> 
> The receive side is unmodified: in udp_rcvmsg() the usage skb
> usage count is decreased and the skb is _not_ freed - since the
> cache keeps usage > 0. Since skb->usage is hot in the cache of the
> receiver at consume time - the receiver has just read skb->data,
> which lies in the same cacheline - the whole skb_consume_udp() becomes
> really cheap.
> 
> UDP receive performances under flood improve as follow:
> 
> NR RX queues  KppsKppsDelta (%)
>   Before  After
> 
> 1 225223052
> 2 2151256919
> 4 2033239617
> 8 1969232918
> 
> Overall performances of knotd DNS server under real traffic flood
> improves as follow:
> 
>   KppsKppsDelta (%)
>   Before  After
> 
>   377739815


It might be time for knotd DNS server to finally use SO_REUSEPORT instead of
adding this bloat to the kernel ?

Sorry, 5% improvement while you easily can get 300% improvement with no kernel 
change
is not appealing to me :/





Re: [PATCH bpf-next v2 01/11] bpf: adding bpf_xdp_adjust_tail helper

2018-04-18 Thread Alexei Starovoitov
On Tue, Apr 17, 2018 at 09:29:41PM -0700, Nikita V. Shirokov wrote:
> Adding new bpf helper which would allow us to manipulate
> xdp's data_end pointer, and allow us to reduce packet's size
> indended use case: to generate ICMP messages from XDP context,
> where such message would contain truncated original packet.
> ---

your SOB is now missing in all patches.



Re: [PATCH bpf-next v3 01/11] bpf: adding bpf_xdp_adjust_tail helper

2018-04-18 Thread Alexei Starovoitov
On Tue, Apr 17, 2018 at 09:42:13PM -0700, Nikita V. Shirokov wrote:
> Adding new bpf helper which would allow us to manipulate
> xdp's data_end pointer, and allow us to reduce packet's size
> indended use case: to generate ICMP messages from XDP context,
> where such message would contain truncated original packet.
> 
> Signed-off-by: Nikita V. Shirokov 

Acked-by: Alexei Starovoitov 

patches look great. thanks!



Re: [PATCH 1/2] net: netsec: enable tx-irq during open callback

2018-04-18 Thread David Miller
From: Jassi Brar 
Date: Wed, 18 Apr 2018 18:27:59 +0530

> Just to make sure, let me please mention that c009f413b79de52 and
> 9a00b697ce31e are very much needed in stable kernel. Without these we
> couldn't install any OS over network.

Ok, both patches in this series queued up for -stable.


Re: [net 1/1] tipc: fix infinite loop when dumping link monitor summary

2018-04-18 Thread David Miller
From: Jon Maloy 
Date: Tue, 17 Apr 2018 21:58:27 +0200

> From: Tung Nguyen 
> 
> When configuring the number of used bearers to MAX_BEARER and issuing
> command "tipc link monitor summary", the command enters infinite loop
> in user space.
> 
> This issue happens because function tipc_nl_node_dump_monitor() returns
> the wrong 'prev_bearer' value when all potential monitors have been
> scanned.
> 
> The correct behavior is to always try to scan all monitors until either
> the netlink message is full, in which case we return the bearer identity
> of the affected monitor, or we continue through the whole bearer array
> until we can return MAX_BEARERS. This solution also caters for the case
> where there may be gaps in the bearer array.
> 
> Signed-off-by: Tung Nguyen 
> Signed-off-by: Jon Maloy 

Applied.


Re: [net 1/1] tipc: fix use-after-free in tipc_nametbl_stop

2018-04-18 Thread David Miller
From: Jon Maloy 
Date: Tue, 17 Apr 2018 21:25:42 +0200

> When we delete a service item in tipc_nametbl_stop() we loop over
> all service ranges in the service's RB tree, and for each service
> range we loop over its pertaining publications while calling
> tipc_service_remove_publ() for each of them.
> 
> However, tipc_service_remove_publ() has the side effect that it also
> removes the comprising service range item when there are no publications
> left. This leads to a "use-after-free" access when the inner loop
> continues to the next iteration, since the range item holding the list
> we are looping no longer exists.
> 
> We fix this by moving the delete of the service range item outside
> the said function. Instead, we now let the two functions calling it
> test if the list is empty and perform the removal when that is the
> case.
> 
> Reported-by: syzbot+d64b64afc55660106...@syzkaller.appspotmail.com
> Signed-off-by: Jon Maloy 

Applied.


[PATCH bpf] tools/bpf: fix test_sock and test_sock_addr.sh failure

2018-04-18 Thread Yonghong Song
The bpf selftests test_sock and test_sock_addr.sh failed
in my test machine. The failure looks like:
$ ./test_sock
Test case: bind4 load with invalid access: src_ip6 .. [PASS]
Test case: bind4 load with invalid access: mark .. [PASS]
Test case: bind6 load with invalid access: src_ip4 .. [PASS]
Test case: sock_create load with invalid access: src_port .. [PASS]
Test case: sock_create load w/o expected_attach_type (compat mode) .. [FAIL]
Test case: sock_create load w/ expected_attach_type .. [FAIL]
Test case: attach type mismatch bind4 vs bind6 .. [FAIL]
...
Summary: 4 PASSED, 12 FAILED
$ ./test_sock_addr.sh
Wait for testing IPv4/IPv6 to become available .
ERROR: Timeout waiting for test IP to become available.

In test_sock, bpf program loads failed due to hitting memlock limits.
In test_sock_addr.sh, my test machine is a ipv6 only test box and using
"ping" without specifying address family for an ipv6 address does not work.

This patch fixed the issue by including header bpf_rlimit.h in test_sock.c
and test_sock_addr.c, and specifying address family for ping command.

Cc: Andrey Ignatov 
Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/test_sock.c   | 1 +
 tools/testing/selftests/bpf/test_sock_addr.c  | 1 +
 tools/testing/selftests/bpf/test_sock_addr.sh | 4 ++--
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sock.c 
b/tools/testing/selftests/bpf/test_sock.c
index 73bb20c..f4d99fa 100644
--- a/tools/testing/selftests/bpf/test_sock.c
+++ b/tools/testing/selftests/bpf/test_sock.c
@@ -13,6 +13,7 @@
 #include 
 
 #include "cgroup_helpers.h"
+#include "bpf_rlimit.h"
 
 #ifndef ARRAY_SIZE
 # define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
diff --git a/tools/testing/selftests/bpf/test_sock_addr.c 
b/tools/testing/selftests/bpf/test_sock_addr.c
index d488f20..2950f80 100644
--- a/tools/testing/selftests/bpf/test_sock_addr.c
+++ b/tools/testing/selftests/bpf/test_sock_addr.c
@@ -15,6 +15,7 @@
 #include 
 
 #include "cgroup_helpers.h"
+#include "bpf_rlimit.h"
 
 #define CG_PATH"/foo"
 #define CONNECT4_PROG_PATH "./connect4_prog.o"
diff --git a/tools/testing/selftests/bpf/test_sock_addr.sh 
b/tools/testing/selftests/bpf/test_sock_addr.sh
index c6e1dcf..9832a87 100755
--- a/tools/testing/selftests/bpf/test_sock_addr.sh
+++ b/tools/testing/selftests/bpf/test_sock_addr.sh
@@ -4,7 +4,7 @@ set -eu
 
 ping_once()
 {
-   ping -q -c 1 -W 1 ${1%%/*} >/dev/null 2>&1
+   ping -${1} -q -c 1 -W 1 ${2%%/*} >/dev/null 2>&1
 }
 
 wait_for_ip()
@@ -13,7 +13,7 @@ wait_for_ip()
echo -n "Wait for testing IPv4/IPv6 to become available "
for _i in $(seq ${MAX_PING_TRIES}); do
echo -n "."
-   if ping_once ${TEST_IPv4} && ping_once ${TEST_IPv6}; then
+   if ping_once 4 ${TEST_IPv4} && ping_once 6 ${TEST_IPv6}; then
echo " OK"
return
fi
-- 
2.9.5



Re: [Bug 199429] New: smc_shutdown(net/smc/af_smc.c) has a UAF causing null pointer vulnerability.

2018-04-18 Thread Stephen Hemminger
On Wed, 18 Apr 2018 13:46:20 +0200
Ursula Braun  wrote:

> On 04/18/2018 04:56 AM, Stephen Hemminger wrote:
> > This may already be fixed.
> > 
> > Begin forwarded message:
> > 
> > Date: Wed, 18 Apr 2018 01:52:59 +
> > From: bugzilla-dae...@bugzilla.kernel.org
> > To: step...@networkplumber.org
> > Subject: [Bug 199429] New: smc_shutdown(net/smc/af_smc.c) has a UAF causing 
> > null pointer vulnerability.
> > 
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=199429
> > 
> > Bug ID: 199429
> >Summary: smc_shutdown(net/smc/af_smc.c) has a UAF causing null
> > pointer vulnerability.
> >Product: Networking
> >Version: 2.5
> > Kernel Version: 4.16.0-rc7
> >   Hardware: All
> > OS: Linux
> >   Tree: Mainline
> > Status: NEW
> >   Severity: normal
> >   Priority: P1
> >  Component: Other
> >   Assignee: step...@networkplumber.org
> >   Reporter: 1773876...@qq.com
> > Regression: No
> > 
> > Created attachment 275431  
> >   --> https://bugzilla.kernel.org/attachment.cgi?id=275431=edit
> > POC
> > 
> > Syzkaller hit 'general protection fault in kernel_sock_shutdown' bug.
> > 
> > NET: Registered protocol family 43  
> 
> Thanks for reporting. This fix is needed here:
> 
>  net/smc/af_smc.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -1314,7 +1314,7 @@ static int smc_shutdown(struct socket *s
>   (sk->sk_state != SMC_APPCLOSEWAIT2) &&
>   (sk->sk_state != SMC_APPFINCLOSEWAIT))
>   goto out;
> - if (smc->use_fallback) {
> + if (smc->use_fallback || sk->sk_state == SMC_LISTEN) {
>   rc = kernel_sock_shutdown(smc->clcsock, how);
>   sk->sk_shutdown = smc->clcsock->sk->sk_shutdown;
>   if (sk->sk_shutdown == SHUTDOWN_MASK)
> 
> Kind regards, Ursula
> 

Please submit patch to linux net with proper signed-off-by and Fixes tags.
The maintainer (davem) will take care of getting this into upstream and stable.


Re: [PATCH] net: don't use kvzalloc for DMA memory

2018-04-18 Thread Mikulas Patocka


On Wed, 18 Apr 2018, Eric Dumazet wrote:

> 
> 
> On 04/18/2018 07:34 AM, Mikulas Patocka wrote:
> > The patch 74d332c13b21 changes alloc_netdev_mqs to use vzalloc if kzalloc
> > fails (later patches change it to kvzalloc).
> > 
> > The problem with this is that if the vzalloc function is actually used, 
> > virtio_net doesn't work (because it expects that the extra memory should 
> > be accessible with DMA-API and memory allocated with vzalloc isn't).
> > 
> > This patch changes it back to kzalloc and adds a warning if the allocated
> > size is too large (the allocation is unreliable in this case).
> > 
> > Signed-off-by: Mikulas Patocka 
> > Fixes: 74d332c13b21 ("net: extend net_device allocation to vmalloc()")
> > 
> > ---
> >  net/core/dev.c |3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > Index: linux-2.6/net/core/dev.c
> > ===
> > --- linux-2.6.orig/net/core/dev.c   2018-04-16 21:08:36.0 +0200
> > +++ linux-2.6/net/core/dev.c2018-04-18 16:24:43.0 +0200
> > @@ -8366,7 +8366,8 @@ struct net_device *alloc_netdev_mqs(int
> > /* ensure 32-byte alignment of whole construct */
> > alloc_size += NETDEV_ALIGN - 1;
> >  
> > -   p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> > +   WARN_ON(alloc_size > PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
> > +   p = kzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> > if (!p)
> > return NULL;
> >  
> > 
> 
> Since when a net_device needs to be in DMA zone ???
> 
> I would rather fix virtio_net, this looks very suspect to me.
> 
> Each virtio_net should probably allocate the exact amount of DMA-memory it 
> wants,
> instead of expecting core networking stack to have a huge chunk of DMA-memory 
> for everything.

The structure net_device is followed by arbitrary driver-specific data 
(accessible with the function netdev_priv). And for virtio-net, these 
driver-specific data must be in DMA memory.

Mikulas



Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast

2018-04-18 Thread John Fastabend
On 04/18/2018 12:28 AM, Paolo Abeni wrote:
> Hi,
> 
> let me revive this old thread...
> 
> On Mon, 2018-03-26 at 11:16 -0700, John Fastabend wrote:
>> On 03/26/2018 10:30 AM, Cong Wang wrote:
>>> On Sat, Mar 24, 2018 at 10:25 PM, John Fastabend
>>>  wrote:
 After the qdisc lock was dropped in pfifo_fast we allow multiple
 enqueue threads and dequeue threads to run in parallel. On the
 enqueue side the skb bit ooo_okay is used to ensure all related
 skbs are enqueued in-order. On the dequeue side though there is
 no similar logic. What we observe is with fewer queues than CPUs
 it is possible to re-order packets when two instances of
 __qdisc_run() are running in parallel. Each thread will dequeue
 a skb and then whichever thread calls the ndo op first will
 be sent on the wire. This doesn't typically happen because
 qdisc_run() is usually triggered by the same core that did the
 enqueue. However, drivers will trigger __netif_schedule()
 when queues are transitioning from stopped to awake using the
 netif_tx_wake_* APIs. When this happens netif_schedule() calls
 qdisc_run() on the same CPU that did the netif_tx_wake_* which
 is usually done in the interrupt completion context. This CPU
 is selected with the irq affinity which is unrelated to the
 enqueue operations.
>>>
>>> Interesting. Why this is unique to pfifo_fast? For me it could
>>> happen to other qdisc's too, when we release the qdisc root
>>> lock in sch_direct_xmit(), another CPU could dequeue from
>>> the same qdisc and transmit the skb in parallel too?
>>>
>>
>> Agreed, my guess is it never happens because the timing is
>> tighter in the lock case. Or if it is happening its infrequent
>> enough that no one noticed the OOO packets.
> 
> I think the above could not happend due to the qdisc seqlock - which is
> not acquired by NOLOCK qdiscs.
> 

Yep, seems to be the case.

>> For net-next we probably could clean this up. I was just
>> going for something simple in net that didn't penalize all
>> qdiscs as Eric noted. This patch doesn't make it any worse
>> at least. And we have been living with the above race for
>> years.
> 
> I've benchmarked this patch is some different scenario, and in my
> testing it introduces a measurable regression in uncontended/lightly
> contended scenarios. The measured peak negative delta is with a pktgen
> thread using "xmit_mode queue_xmit":
> 
> before: 27674032 pps
> after: 23809052 pps

Yeah more atomic ops :/

> 
> I spend some time searching a way to improve this, without success.
> 
> John, did you had any chance to look at this again?
> 

If we have a multiple cores pulling from the same skb list and
feeding the same txq this happens. One problem is even if the
normal dev_queue_xmit path is aligned drivers call netif_schedule
from interrupt context and that happens on an arbitrary a cpu. When
the arbitrary cpu runs the netif_schedule logic it will dequeue
from the skb list using the cpu it was scheduled on.

The lockless case is not _really_ lockless after this patch we
have managed to pull apart the enqueue and dequeue serialization
though.

Thanks for bringing this up. I'll think about it for a bit maybe
there is something we can do here. There is a set of conditions
that if met we can run without the lock. Possibly ONETXQUEUE and
aligned cpu_map is sufficient. We could detect this case and drop
the locking. For existing systems and high Gbps NICs I think (feel
free to correct me) assuming a core per cpu is OK. At some point
though we probably need to revisit this assumption.

.John

> Thanks,
> 
> Paolo
> 



[PATCH bpf-next v2 9/9] tools/bpf: add a test_progs test case for bpf_get_stack helper

2018-04-18 Thread Yonghong Song
The test_stacktrace_map is enhanced to call bpf_get_stack
in the helper to get the stack trace as well.
The stack traces from bpf_get_stack and bpf_get_stackid
are compared to ensure that for the same stack as
represented as the same hash, their ip addresses
must be the same.

Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/test_progs.c  | 41 ++-
 tools/testing/selftests/bpf/test_stacktrace_map.c | 20 +--
 2 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index faadbe2..8aa2844 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -865,9 +865,39 @@ static int compare_map_keys(int map1_fd, int map2_fd)
return 0;
 }
 
+static int compare_stack_ips(int smap_fd, int amap_fd)
+{
+   int max_len = PERF_MAX_STACK_DEPTH * sizeof(__u64);
+   __u32 key, next_key, *cur_key_p, *next_key_p;
+   char val_buf1[max_len], val_buf2[max_len];
+   int i, err;
+
+   cur_key_p = NULL;
+   next_key_p = 
+   while (bpf_map_get_next_key(smap_fd, cur_key_p, next_key_p) == 0) {
+   err = bpf_map_lookup_elem(smap_fd, next_key_p, val_buf1);
+   if (err)
+   return err;
+   err = bpf_map_lookup_elem(amap_fd, next_key_p, val_buf2);
+   if (err)
+   return err;
+   for (i = 0; i < max_len; i++) {
+   if (val_buf1[i] != val_buf2[i])
+   return -1;
+   }
+   key = *next_key_p;
+   cur_key_p = 
+   next_key_p = _key;
+   }
+   if (errno != ENOENT)
+   return -1;
+
+   return 0;
+}
+
 static void test_stacktrace_map()
 {
-   int control_map_fd, stackid_hmap_fd, stackmap_fd;
+   int control_map_fd, stackid_hmap_fd, stackmap_fd, stack_amap_fd;
const char *file = "./test_stacktrace_map.o";
int bytes, efd, err, pmu_fd, prog_fd;
struct perf_event_attr attr = {};
@@ -925,6 +955,10 @@ static void test_stacktrace_map()
if (stackmap_fd < 0)
goto disable_pmu;
 
+   stack_amap_fd = bpf_find_map(__func__, obj, "stack_amap");
+   if (stack_amap_fd < 0)
+   goto disable_pmu;
+
/* give some time for bpf program run */
sleep(1);
 
@@ -946,6 +980,11 @@ static void test_stacktrace_map()
  "err %d errno %d\n", err, errno))
goto disable_pmu_noerr;
 
+   err = compare_stack_ips(stackmap_fd, stack_amap_fd);
+   if (CHECK(err, "compare_stack_ips stackmap vs. stack_amap",
+ "err %d errno %d\n", err, errno))
+   goto disable_pmu_noerr;
+
goto disable_pmu_noerr;
 disable_pmu:
error_cnt++;
diff --git a/tools/testing/selftests/bpf/test_stacktrace_map.c 
b/tools/testing/selftests/bpf/test_stacktrace_map.c
index 76d85c5d..f83c7b6 100644
--- a/tools/testing/selftests/bpf/test_stacktrace_map.c
+++ b/tools/testing/selftests/bpf/test_stacktrace_map.c
@@ -19,14 +19,21 @@ struct bpf_map_def SEC("maps") stackid_hmap = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(__u32),
.value_size = sizeof(__u32),
-   .max_entries = 1,
+   .max_entries = 16384,
 };
 
 struct bpf_map_def SEC("maps") stackmap = {
.type = BPF_MAP_TYPE_STACK_TRACE,
.key_size = sizeof(__u32),
.value_size = sizeof(__u64) * PERF_MAX_STACK_DEPTH,
-   .max_entries = 1,
+   .max_entries = 16384,
+};
+
+struct bpf_map_def SEC("maps") stack_amap = {
+   .type = BPF_MAP_TYPE_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = sizeof(__u64) * PERF_MAX_STACK_DEPTH,
+   .max_entries = 16384,
 };
 
 /* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
@@ -44,7 +51,10 @@ struct sched_switch_args {
 SEC("tracepoint/sched/sched_switch")
 int oncpu(struct sched_switch_args *ctx)
 {
+   __u32 max_len = PERF_MAX_STACK_DEPTH * sizeof(__u64);
__u32 key = 0, val = 0, *value_p;
+   void *stack_p;
+
 
value_p = bpf_map_lookup_elem(_map, );
if (value_p && *value_p)
@@ -52,8 +62,12 @@ int oncpu(struct sched_switch_args *ctx)
 
/* The size of stackmap and stackid_hmap should be the same */
key = bpf_get_stackid(ctx, , 0);
-   if ((int)key >= 0)
+   if ((int)key >= 0) {
bpf_map_update_elem(_hmap, , , 0);
+   stack_p = bpf_map_lookup_elem(_amap, );
+   if (stack_p)
+   bpf_get_stack(ctx, stack_p, max_len, 0);
+   }
 
return 0;
 }
-- 
2.9.5



[PATCH bpf-next v2 7/9] samples/bpf: add a test for bpf_get_stack helper

2018-04-18 Thread Yonghong Song
The test attached a kprobe program to kernel function sys_write.
It tested to get stack for user space, kernel space and user
space with build_id request. It also tested to get user
and kernel stack into the same buffer with back-to-back
bpf_get_stack helper calls.

Whenever the kernel stack is available, the user space
application will check to ensure that sys_write/SyS_write
is part of the stack.

Signed-off-by: Yonghong Song 
---
 samples/bpf/Makefile   |   4 +
 samples/bpf/trace_get_stack_kern.c |  86 +
 samples/bpf/trace_get_stack_user.c | 150 +
 3 files changed, 240 insertions(+)
 create mode 100644 samples/bpf/trace_get_stack_kern.c
 create mode 100644 samples/bpf/trace_get_stack_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4d6a6ed..94e7b10 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -44,6 +44,7 @@ hostprogs-y += xdp_monitor
 hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 hostprogs-y += cpustat
+hostprogs-y += trace_get_stack
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
@@ -95,6 +96,7 @@ xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
 xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
 cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
+trace_get_stack-objs := bpf_load.o $(LIBBPF) trace_get_stack_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -148,6 +150,7 @@ always += xdp_rxq_info_kern.o
 always += xdp2skb_meta_kern.o
 always += syscall_tp_kern.o
 always += cpustat_kern.o
+always += trace_get_stack_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -193,6 +196,7 @@ HOSTLOADLIBES_xdp_monitor += -lelf
 HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
 HOSTLOADLIBES_cpustat += -lelf
+HOSTLOADLIBES_trace_get_stack += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/trace_get_stack_kern.c 
b/samples/bpf/trace_get_stack_kern.c
new file mode 100644
index 000..665e4ad
--- /dev/null
+++ b/samples/bpf/trace_get_stack_kern.c
@@ -0,0 +1,86 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+/* Permit pretty deep stack traces */
+#define MAX_STACK 100
+struct stack_trace_t {
+   int pid;
+   int kern_stack_size;
+   int user_stack_size;
+   int user_stack_buildid_size;
+   u64 kern_stack[MAX_STACK];
+   u64 user_stack[MAX_STACK];
+   struct bpf_stack_build_id user_stack_buildid[MAX_STACK];
+};
+
+struct bpf_map_def SEC("maps") perfmap = {
+   .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+   .key_size = sizeof(int),
+   .value_size = sizeof(u32),
+   .max_entries = 2,
+};
+
+struct bpf_map_def SEC("maps") stackdata_map = {
+   .type = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .key_size = sizeof(u32),
+   .value_size = sizeof(struct stack_trace_t),
+   .max_entries = 1,
+};
+
+struct bpf_map_def SEC("maps") rawdata_map = {
+   .type = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .key_size = sizeof(u32),
+   .value_size = MAX_STACK * sizeof(u64) * 2,
+   .max_entries = 1,
+};
+
+SEC("kprobe/sys_write")
+int bpf_prog1(struct pt_regs *ctx)
+{
+   int max_len, max_buildid_len, usize, ksize, total_size;
+   struct stack_trace_t *data;
+   void *raw_data;
+   u32 key = 0;
+
+   data = bpf_map_lookup_elem(_map, );
+   if (!data)
+   return 0;
+
+   max_len = MAX_STACK * sizeof(u64);
+   max_buildid_len = MAX_STACK * sizeof(struct bpf_stack_build_id);
+   data->pid = bpf_get_current_pid_tgid();
+   data->kern_stack_size = bpf_get_stack(ctx, data->kern_stack,
+ max_len, 0);
+   data->user_stack_size = bpf_get_stack(ctx, data->user_stack, max_len,
+   BPF_F_USER_STACK);
+   data->user_stack_buildid_size = bpf_get_stack(
+   ctx, data->user_stack_buildid, max_buildid_len,
+   BPF_F_USER_STACK | BPF_F_USER_BUILD_ID);
+   bpf_perf_event_output(ctx, , 0, data, sizeof(*data));
+
+   /* write both kernel and user stacks to the same buffer */
+   raw_data = bpf_map_lookup_elem(_map, );
+   if (!raw_data)
+   return 0;
+
+   usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
+   if (usize < 0)
+   return 0;
+
+   ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
+   if (ksize < 0)
+   return 0;
+
+   total_size = usize + ksize;
+   if (total_size > 0 && total_size <= max_len)
+   bpf_perf_event_output(ctx, , 0, raw_data, total_size);
+
+ 

[PATCH bpf-next v2 5/9] tools/bpf: add bpf_get_stack helper to tools headers

2018-04-18 Thread Yonghong Song
Signed-off-by: Yonghong Song 
---
 tools/include/uapi/linux/bpf.h| 19 +--
 tools/testing/selftests/bpf/bpf_helpers.h |  2 ++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9d07465..63a4529 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -517,6 +517,17 @@ union bpf_attr {
  * other bits - reserved
  * Return: >= 0 stackid on success or negative error
  *
+ * int bpf_get_stack(ctx, buf, size, flags)
+ * walk user or kernel stack and store the ips in buf
+ * @ctx: struct pt_regs*
+ * @buf: user buffer to fill stack
+ * @size: the buf size
+ * @flags: bits 0-7 - numer of stack frames to skip
+ * bit 8 - collect user stack instead of kernel
+ * bit 11 - get build-id as well if user stack
+ * other bits - reserved
+ * Return: >= 0 size copied on success or negative error
+ *
  * s64 bpf_csum_diff(from, from_size, to, to_size, seed)
  * calculate csum diff
  * @from: raw from buffer
@@ -821,7 +832,8 @@ union bpf_attr {
FN(msg_apply_bytes),\
FN(msg_cork_bytes), \
FN(msg_pull_data),  \
-   FN(bind),
+   FN(bind),   \
+   FN(get_stack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -855,11 +867,14 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
 #define BPF_F_TUNINFO_IPV6 (1ULL << 0)
 
-/* BPF_FUNC_get_stackid flags. */
+/* flags for both BPF_FUNC_get_stackid and BPF_FUNC_get_stack. */
 #define BPF_F_SKIP_FIELD_MASK  0xffULL
 #define BPF_F_USER_STACK   (1ULL << 8)
+/* flags used by BPF_FUNC_get_stackid only. */
 #define BPF_F_FAST_STACK_CMP   (1ULL << 9)
 #define BPF_F_REUSE_STACKID(1ULL << 10)
+/* flags used by BPF_FUNC_get_stack only. */
+#define BPF_F_USER_BUILD_ID(1ULL << 11)
 
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index d8223d9..acaed02 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -96,6 +96,8 @@ static int (*bpf_msg_pull_data)(void *ctx, int start, int 
end, int flags) =
(void *) BPF_FUNC_msg_pull_data;
 static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
(void *) BPF_FUNC_bind;
+static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) =
+   (void *) BPF_FUNC_get_stack;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.9.5



[PATCH bpf-next v2 1/9] bpf: change prototype for stack_map_get_build_id_offset

2018-04-18 Thread Yonghong Song
This patch didn't incur functionality change. The function prototype
got changed so that the same function can be reused later.

Signed-off-by: Yonghong Song 
---
 kernel/bpf/stackmap.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 57eeb12..04f6ec1 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -262,16 +262,11 @@ static int stack_map_get_build_id(struct vm_area_struct 
*vma,
return ret;
 }
 
-static void stack_map_get_build_id_offset(struct bpf_map *map,
- struct stack_map_bucket *bucket,
+static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
  u64 *ips, u32 trace_nr, bool user)
 {
int i;
struct vm_area_struct *vma;
-   struct bpf_stack_build_id *id_offs;
-
-   bucket->nr = trace_nr;
-   id_offs = (struct bpf_stack_build_id *)bucket->data;
 
/*
 * We cannot do up_read() in nmi context, so build_id lookup is
@@ -361,8 +356,10 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct 
bpf_map *, map,
pcpu_freelist_pop(>freelist);
if (unlikely(!new_bucket))
return -ENOMEM;
-   stack_map_get_build_id_offset(map, new_bucket, ips,
- trace_nr, user);
+   new_bucket->nr = trace_nr;
+   stack_map_get_build_id_offset(
+   (struct bpf_stack_build_id *)new_bucket->data,
+   ips, trace_nr, user);
trace_len = trace_nr * sizeof(struct bpf_stack_build_id);
if (hash_matches && bucket->nr == trace_nr &&
memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
-- 
2.9.5



[PATCH bpf-next v2 3/9] bpf/verifier: refine retval R0 state for bpf_get_stack helper

2018-04-18 Thread Yonghong Song
The special property of return values for helpers bpf_get_stack
and bpf_probe_read_str are captured in verifier.
Both helpers return a negative error code or
a length, which is equal to or smaller than the buffer
size argument. This additional information in the
verifier can avoid the condition such as "retval > bufsize"
in the bpf program. For example, for the code blow,
usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
if (usize < 0 || usize > max_len)
return 0;
The verifier may have the following errors:
52: (85) call bpf_get_stack#65
 R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
 R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
 R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
 R9_w=inv800 R10=fp0,call_-1
53: (bf) r8 = r0
54: (bf) r1 = r8
55: (67) r1 <<= 32
56: (bf) r2 = r1
57: (77) r2 >>= 32
58: (25) if r2 > 0x31f goto pc+33
 R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
 umax_value=18446744069414584320,
 var_off=(0x0; 0x))
 R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
 R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
 R8=inv(id=0) R9=inv800 R10=fp0,call_-1
59: (1f) r9 -= r8
60: (c7) r1 s>>= 32
61: (bf) r2 = r7
62: (0f) r2 += r1
math between map_value pointer and register with unbounded
min value is not allowed
The failure is due to llvm compiler optimization where register "r2",
which is a copy of "r1", is tested for condition while later on "r1"
is used for map_ptr operation. The verifier is not able to track such
inst sequence effectively.

Without the "usize > max_len" condition, there is no llvm optimization
and the below generated code passed verifier:
52: (85) call bpf_get_stack#65
 R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
 R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
 R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
 R9_w=inv800 R10=fp0,call_-1
53: (b7) r1 = 0
54: (bf) r8 = r0
55: (67) r8 <<= 32
56: (c7) r8 s>>= 32
57: (6d) if r1 s> r8 goto pc+24
 R0=inv(id=0,umax_value=800) R1=inv0 R6=ctx(id=0,off=0,imm=0)
 R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
 R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
 R10=fp0,call_-1
58: (bf) r2 = r7
59: (0f) r2 += r8
60: (1f) r9 -= r8
61: (bf) r1 = r6

Signed-off-by: Yonghong Song 
---
 kernel/bpf/verifier.c | 31 ++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index aba9425..a8302c3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2333,10 +2333,32 @@ static int prepare_func_exit(struct bpf_verifier_env 
*env, int *insn_idx)
return 0;
 }
 
+static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type,
+  int func_id,
+  struct bpf_reg_state *retval_state,
+  bool is_check)
+{
+   struct bpf_reg_state *src_reg, *dst_reg;
+
+   if (ret_type != RET_INTEGER ||
+   (func_id != BPF_FUNC_get_stack &&
+func_id != BPF_FUNC_probe_read_str))
+   return;
+
+   dst_reg = is_check ? retval_state : [BPF_REG_0];
+   if (func_id == BPF_FUNC_get_stack)
+   src_reg = is_check ? [BPF_REG_3] : retval_state;
+   else
+   src_reg = is_check ? [BPF_REG_2] : retval_state;
+
+   dst_reg->smax_value = src_reg->smax_value;
+   dst_reg->umax_value = src_reg->umax_value;
+}
+
 static int check_helper_call(struct bpf_verifier_env *env, int func_id, int 
insn_idx)
 {
const struct bpf_func_proto *fn = NULL;
-   struct bpf_reg_state *regs;
+   struct bpf_reg_state *regs, retval_state;
struct bpf_call_arg_meta meta;
bool changes_data;
int i, err;
@@ -2415,6 +2437,10 @@ static int check_helper_call(struct bpf_verifier_env 
*env, int func_id, int insn
}
 
regs = cur_regs(env);
+
+   /* before reset caller saved regs, check special ret value */
+   do_refine_retval_range(regs, fn->ret_type, func_id, _state, 1);
+
/* reset caller saved regs */
for (i = 0; i < CALLER_SAVED_REGS; i++) {
mark_reg_not_init(env, regs, caller_saved[i]);
@@ -2456,6 +2482,9 @@ static int check_helper_call(struct bpf_verifier_env 
*env, int func_id, int insn
return -EINVAL;
}
 
+   /* apply additional constraints to ret value */
+   do_refine_retval_range(regs, fn->ret_type, func_id, _state, 0);
+
err = check_map_func_compatibility(env, meta.map_ptr, func_id);
if (err)
return err;
-- 
2.9.5



[PATCH bpf-next v2 0/9] bpf: add bpf_get_stack helper

2018-04-18 Thread Yonghong Song
Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table regardless of
whether BPF_F_REUSE_STACKID is specified or not,
so some stack traces may be missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Patches #1 and #2 implemented the core kernel support.
Patches #3 and #4 are two verifier improves to make
bpf programming easier. Patch #5 synced the new helper
to tools headers. Patches #6 and #7 added a test in
samples/bpf by attaching to a kprobe. Patch #8 added
a verifier test in tools/bpf for new verifier change
and Patch #9 added a test by attaching to a tracepoint.

Changelogs:
  v1 -> v2:
. fix compilation error when CONFIG_PERF_EVENTS is not enabled

Yonghong Song (9):
  bpf: change prototype for stack_map_get_build_id_offset
  bpf: add bpf_get_stack helper
  bpf/verifier: refine retval R0 state for bpf_get_stack helper
  bpf/verifier: improve register value range tracking with ARSH
  tools/bpf: add bpf_get_stack helper to tools headers
  samples/bpf: move common-purpose perf_event functions to bpf_load.c
  samples/bpf: add a test for bpf_get_stack helper
  tools/bpf: add a verifier test case for bpf_get_stack helper and ARSH
  tools/bpf: add a test_progs test case for bpf_get_stack helper

 include/linux/bpf.h   |   1 +
 include/linux/filter.h|   3 +-
 include/uapi/linux/bpf.h  |  19 ++-
 kernel/bpf/core.c |   5 +
 kernel/bpf/stackmap.c |  80 ++--
 kernel/bpf/syscall.c  |  10 ++
 kernel/bpf/verifier.c |  35 -
 kernel/trace/bpf_trace.c  |  50 +++-
 samples/bpf/Makefile  |   4 +
 samples/bpf/bpf_load.c| 104 +++
 samples/bpf/bpf_load.h|   5 +
 samples/bpf/trace_get_stack_kern.c|  86 +
 samples/bpf/trace_get_stack_user.c| 150 ++
 samples/bpf/trace_output_user.c   | 113 ++--
 tools/include/uapi/linux/bpf.h|  19 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |   2 +
 tools/testing/selftests/bpf/test_progs.c  |  41 +-
 tools/testing/selftests/bpf/test_stacktrace_map.c |  20 ++-
 tools/testing/selftests/bpf/test_verifier.c   |  45 +++
 19 files changed, 669 insertions(+), 123 deletions(-)
 create mode 100644 samples/bpf/trace_get_stack_kern.c
 create mode 100644 samples/bpf/trace_get_stack_user.c

-- 
2.9.5



[PATCH bpf-next v2 4/9] bpf/verifier: improve register value range tracking with ARSH

2018-04-18 Thread Yonghong Song
When helpers like bpf_get_stack returns an int value
and later on used for arithmetic computation, the LSH and ARSH
operations are often required to get proper sign extension into
64-bit. For example, without this patch:
54: R0=inv(id=0,umax_value=800)
54: (bf) r8 = r0
55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
55: (67) r8 <<= 32
56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff))
56: (c7) r8 s>>= 32
57: R8=inv(id=0)
With this patch:
54: R0=inv(id=0,umax_value=800)
54: (bf) r8 = r0
55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
55: (67) r8 <<= 32
56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff))
56: (c7) r8 s>>= 32
57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
With better range of "R8", later on when "R8" is added to other register,
e.g., a map pointer or scalar-value register, the better register
range can be derived and verifier failure may be avoided.

In our later example,
..
usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
if (usize < 0)
return 0;
ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
..
Without improving ARSH value range tracking, the register representing
"max_len - usize" will have smin_value equal to S64_MIN and will be
rejected by verifier.

Signed-off-by: Yonghong Song 
---
 kernel/bpf/verifier.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a8302c3..6148d31 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2944,6 +2944,7 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
__update_reg_bounds(dst_reg);
break;
case BPF_RSH:
+   case BPF_ARSH:
if (umax_val >= insn_bitness) {
/* Shifts greater than 31 or 63 are undefined.
 * This includes shifts by a negative number.
-- 
2.9.5



[PATCH bpf-next v2 8/9] tools/bpf: add a verifier test case for bpf_get_stack helper and ARSH

2018-04-18 Thread Yonghong Song
The test_verifier already has a few ARSH test cases.
This patch adds a new test case which takes advantage of newly
improved verifier behavior for bpf_get_stack and ARSH.

Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/test_verifier.c | 45 +
 1 file changed, 45 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 3e7718b..cd595ba 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -11423,6 +11423,51 @@ static struct bpf_test tests[] = {
.errstr = "BPF_XADD stores into R2 packet",
.prog_type = BPF_PROG_TYPE_XDP,
},
+   {
+   "bpf_get_stack return R0 within range",
+   .insns = {
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 28),
+   BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+   BPF_MOV64_IMM(BPF_REG_9, sizeof(struct test_val)),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_7),
+   BPF_MOV64_IMM(BPF_REG_3, sizeof(struct test_val)),
+   BPF_MOV64_IMM(BPF_REG_4, 256),
+   BPF_EMIT_CALL(BPF_FUNC_get_stack),
+   BPF_MOV64_IMM(BPF_REG_1, 0),
+   BPF_MOV64_REG(BPF_REG_8, BPF_REG_0),
+   BPF_ALU64_IMM(BPF_LSH, BPF_REG_8, 32),
+   BPF_ALU64_IMM(BPF_ARSH, BPF_REG_8, 32),
+   BPF_JMP_REG(BPF_JSLT, BPF_REG_1, BPF_REG_8, 16),
+   BPF_ALU64_REG(BPF_SUB, BPF_REG_9, BPF_REG_8),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_7),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_8),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_9),
+   BPF_ALU64_IMM(BPF_LSH, BPF_REG_1, 32),
+   BPF_ALU64_IMM(BPF_ARSH, BPF_REG_1, 32),
+   BPF_MOV64_REG(BPF_REG_3, BPF_REG_2),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_3, BPF_REG_1),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
+   BPF_MOV64_IMM(BPF_REG_5, sizeof(struct test_val)),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_5),
+   BPF_JMP_REG(BPF_JGE, BPF_REG_3, BPF_REG_1, 4),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+   BPF_MOV64_REG(BPF_REG_3, BPF_REG_9),
+   BPF_MOV64_IMM(BPF_REG_4, 0),
+   BPF_EMIT_CALL(BPF_FUNC_get_stack),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map2 = { 4 },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+   },
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)
-- 
2.9.5



Re: [PATCH bpf-next v4 07/10] bpf: btf: Add pretty print support to the basic arraymap

2018-04-18 Thread Jakub Kicinski
On Wed, 18 Apr 2018 17:20:11 +0200, Daniel Borkmann wrote:
> Hi Martin,
> 
> first of all great work on the set! One issue that puzzled me
> while digesting it further below.
> 
> On 04/17/2018 10:42 PM, Martin KaFai Lau wrote:
> > This patch adds pretty print support to the basic arraymap.
> > Support for other bpf maps can be added later.
> > 
> > This patch adds new attrs to the BPF_MAP_CREATE command to allow
> > specifying the btf_fd, btf_key_id and btf_value_id.  The
> > BPF_MAP_CREATE can then associate the btf to the map if
> > the creating map supports BTF.  
> 
> Feels like this patch is doing two things at once, meaning i)
> attaching btf object to map object through bpf syscall at map
> creation time, and ...
> 
> > A BTF supported map needs to implement two new map ops,
> > map_seq_show_elem() and map_check_btf().  This patch has
> > implemented these new map ops for the basic arraymap.
> > 
> > It also adds file_operations to the pinned map
> > such that the pinned map can be opened and read.  
> 
> ... ii) pretty print map dump via bpf fs for array map.
> 
> > Here is a sample output when reading a pinned arraymap
> > with the following map's value:
> > 
> > struct map_value {
> > int count_a;
> > int count_b;
> > };
> > 
> > cat /sys/fs/bpf/pinned_array_map:
> > 
> > 0: {1,2}
> > 1: {3,4}
> > 2: {5,6}
> > ...  
> 
> Rather than adding this to the bpf fs itself, why not add full BTF
> support for the main debugging and introspection tool we have and
> ship with the kernel for BPF, namely bpftool? You can already dump
> the whole map's key/value pairs via the following command from a
> pinned file:
> 
>   bpftool map dump pinned /sys/fs/bpf/pinned_array_map
> 
> And given we already export the BTF info in your earlier patch through
> the BPF_OBJ_GET_INFO_BY_FD, this would fit perfectly for bpftool
> integration instead where the pretty-print which is done through the
> extra cb map_seq_show_elem (which only does a map lookup and print
> anyway) in this patch can simply all be done in user space w/o any
> additional kernel complexity.
> 
> Aside that this would be very valuable there it would also nicely
> demonstrate usage of it, but more importantly we could avoid implementing
> such pretty-print callback in the kernel for every other map type and
> then having two locations where a user now needs to go for debugging
> (bpftool being one, and cat of pinned file the other; this split seems
> confusing from a user perspective, imho, but also single key lookup +
> pretty-print cannot be realized with the latter whereas it's trivial
> with bpftool).
> 
> The same could be done for bpftool map lookup, updates, deletions, etc
> where the key resp. key/value pair can be specified through a struct
> like initializer from cmdline. (But dump/lookup should be good enough
> starting point initially.) Thoughts?

I felt the same way but I'm obviously biased so I didn't say anything ;)

I'm concerned about exposing the map data in this arbitrary format.
This will be a kernel ABI.  And the format even though it looks nice
and concise for an array may not be as good for more complex maps where
key is not an int.  Will key for hashes be specified as fields in {}?
Why the discrepancy between array and hash then (array's key isn't)?
This print type is also not any standard notation AFAIU.  In bpftool
user can just ask to render JSON, including actual field names.  Or we
can add other formats if we feel like or change the default one, it's
not kernel ABI.  I'd hate to see people deploying regexps to parse the
output of pinned map dump when there is modern tooling available...

How useful is it really to provide a dump on a pinned file?  I guess
this is mostly for beginners?  Is this really more handy than bpftool?


Re: [PATCH net-next v2 16/21] net/ipv6: Add gfp_flags to route add functions

2018-04-18 Thread David Ahern
On 4/18/18 11:06 AM, Eric Dumazet wrote:
> 
> 
> On 04/17/2018 05:33 PM, David Ahern wrote:
>> Most FIB entries can be added using memory allocated with GFP_KERNEL.
>> Add gfp_flags to ip6_route_add and addrconf_dst_alloc. Code paths that
>> can be reached from the packet path (e.g., ndisc and autoconfig) or
>> atomic notifiers use GFP_ATOMIC; paths from user context (adding
>> addresses and routes) use GFP_KERNEL.
>>
>> Signed-off-by: David Ahern 
>> ---
> 
> 
> Hmmm
> 
> ipv6_ifa_notify() calls __ipv6_ifa_notify() under 
> rcu_read_lock_bh()/rcu_read_unlock_bh()
> 
> So using GFP_KERNEL in __ipv6_ifa_notify() is certainly not allowed.
> 
> 

Thanks for catching that. Will add fix to my followup set.


Re: [PATCH RFC net-next 00/11] udp gso

2018-04-18 Thread David Miller
From: Paolo Abeni 
Date: Wed, 18 Apr 2018 13:17:54 +0200

> When testing with Spectre/Meltdown mitigation in places, I expect
> that the most relevant part of the gain is due to the single syscall
> per burst.

I was going to say exactly this.

Batching schemes that were borderline beneficial before spectre/meltdown
are more worthwhile now.


Re: SRIOV switchdev mode BoF minutes

2018-04-18 Thread Andy Gospodarek
On Wed, Apr 18, 2018 at 09:26:34AM -0700, Jakub Kicinski wrote:
> On Wed, 18 Apr 2018 11:15:29 -0400, Andy Gospodarek wrote:
> > > A similar issue exists on multi-host for PFs, right?  If one of the
> > > hosts is down do we still show their PF repr?  IMHO yes.  
> > 
> > I would agree with that as well.  With today's model the VF reps are
> > created once a PF is put into switchdev mode, but I'm still working out
> > how we want to consider whether or not a PF rep for the other domains is
> > created locally or not and also how one can determine which domain is in
> > control.
> > 
> > Permanent config options (like NVRAM settings) could easily handle which
> > domain is in control, but that still does not mean that PF reps must be
> > created automatically, does it?
> 
> The control domain is tricky.  I'm not sure I understand how you could
> not have a PF rep for remote domains, though.  How do you configure
> switching to the PF netdev if there is no rep?

Yes, for complete control of all traffic using standard Linux APIs a PF
rep is a requirement.


Re: [PATCH net-next v2 16/21] net/ipv6: Add gfp_flags to route add functions

2018-04-18 Thread Eric Dumazet


On 04/18/2018 10:10 AM, David Ahern wrote:
> On 4/18/18 11:06 AM, Eric Dumazet wrote:
>>
>>
>> On 04/17/2018 05:33 PM, David Ahern wrote:
>>> Most FIB entries can be added using memory allocated with GFP_KERNEL.
>>> Add gfp_flags to ip6_route_add and addrconf_dst_alloc. Code paths that
>>> can be reached from the packet path (e.g., ndisc and autoconfig) or
>>> atomic notifiers use GFP_ATOMIC; paths from user context (adding
>>> addresses and routes) use GFP_KERNEL.
>>>
>>> Signed-off-by: David Ahern 
>>> ---
>>
>>
>> Hmmm
>>
>> ipv6_ifa_notify() calls __ipv6_ifa_notify() under 
>> rcu_read_lock_bh()/rcu_read_unlock_bh()
>>
>> So using GFP_KERNEL in __ipv6_ifa_notify() is certainly not allowed.
>>
>>
> 
> Thanks for catching that. Will add fix to my followup set.
> 
BTW, I am not sure why we use rcu_read_lock_bh()/rcu_read_unlock_bh() there :/

Maybe it is no longer needed.



Re: [PATCH iproute2 net-next] vxlan: fix ttl inherit behavior

2018-04-18 Thread Stephen Hemminger
On Wed, 18 Apr 2018 13:10:49 +0800
Hangbin Liu  wrote:

> Hi Stephen,
> 
> The patch's subject contains fix. But the kernel feature is applied on 
> net-next.
> So I'm not sure if iproute2 net-next is suitable. If you are OK with the 
> patch,
> please feel free to apply it on the branch which you think is suitable.
> 
> Thanks
> Hangbin
> 
> On 18 April 2018 at 13:05, Hangbin Liu  wrote:
> > Like kernel net-next commit 72f6d71e491e6 ("vxlan: add ttl inherit 
> > support"),
> > vxlan ttl inherit should means inherit the inner protocol's ttl value.
> >
> > But currently when we add vxlan with "ttl inherit", we only set ttl 0,
> > which is actually use whatever default value instead of inherit the inner
> > protocol's ttl value.
> >
> > To make a difference with ttl inherit and ttl == 0, we add an attribute
> > IFLA_VXLAN_TTL_INHERIT when "ttl inherit" specified. And use "ttl auto"
> > to means "use whatever default value", the same behavior with ttl == 0.
> >
> > Reported-by: Jianlin Shi 
> > Suggested-by: Jiri Benc 
> > Signed-off-by: Hangbin Liu   

When davem  merges the feature into net-next, dsa will merge this into 
iproute2-next.
We hold off merging into iproute2 because often the kernel review feedback 
causes
API changes.


Re: [PATCH] net: don't use kvzalloc for DMA memory

2018-04-18 Thread Eric Dumazet


On 04/18/2018 07:34 AM, Mikulas Patocka wrote:
> The patch 74d332c13b21 changes alloc_netdev_mqs to use vzalloc if kzalloc
> fails (later patches change it to kvzalloc).
> 
> The problem with this is that if the vzalloc function is actually used, 
> virtio_net doesn't work (because it expects that the extra memory should 
> be accessible with DMA-API and memory allocated with vzalloc isn't).
> 
> This patch changes it back to kzalloc and adds a warning if the allocated
> size is too large (the allocation is unreliable in this case).
> 
> Signed-off-by: Mikulas Patocka 
> Fixes: 74d332c13b21 ("net: extend net_device allocation to vmalloc()")
> 
> ---
>  net/core/dev.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/net/core/dev.c
> ===
> --- linux-2.6.orig/net/core/dev.c 2018-04-16 21:08:36.0 +0200
> +++ linux-2.6/net/core/dev.c  2018-04-18 16:24:43.0 +0200
> @@ -8366,7 +8366,8 @@ struct net_device *alloc_netdev_mqs(int
>   /* ensure 32-byte alignment of whole construct */
>   alloc_size += NETDEV_ALIGN - 1;
>  
> - p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> + WARN_ON(alloc_size > PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
> + p = kzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
>   if (!p)
>   return NULL;
>  
> 

Since when a net_device needs to be in DMA zone ???

I would rather fix virtio_net, this looks very suspect to me.

Each virtio_net should probably allocate the exact amount of DMA-memory it 
wants,
instead of expecting core networking stack to have a huge chunk of DMA-memory 
for everything.


Re: [RFC net-next PATCH 2/2] bpf: disallow XDP data_meta to overlap with xdp_frame area

2018-04-18 Thread Daniel Borkmann
On 04/18/2018 02:10 PM, Jesper Dangaard Brouer wrote:
> If combining xdp_adjust_head and xdp_adjust_meta, then it is possible
> to make data_meta overlap with area used by xdp_frame.  And another
> invocation of xdp_adjust_head can then clear that area, due to
> clearing of xdp_frame area.
> 
> The easiest solution I found was to simply not allow
> xdp_buff->data_meta to overlap with area used by xdp_frame.

Thanks Jesper! Trying to answer both emails in one. :) More below.

> Fixes: 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page 
> reuse")
> Signed-off-by: Jesper Dangaard Brouer 
> ---
>  net/core/filter.c |   11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 15e9b5477360..e3623e741181 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2701,6 +2701,11 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, 
> xdp, int, offset)
>data > xdp->data_end - ETH_HLEN))
>   return -EINVAL;
>  
> + /* Disallow data_meta to use xdp_frame area */
> + if (metalen > 0 &&
> + unlikely((data - metalen) < xdp_frame_end))
> + return -EINVAL;
> +
>   /* Avoid info leak, when reusing area prev used by xdp_frame */
>   if (data < xdp_frame_end) {

Effectively, when metalen > 0, then data_meta < data pointer, so above test
on new data_meta might be better, but feels like a bit of a workaround to
handle moving data pointer but disallowing moving data_meta pointer whereas
both could be handled if we wanted to go that path.

>   unsigned long clearlen = xdp_frame_end - data;
> @@ -2734,6 +2739,7 @@ static const struct bpf_func_proto 
> bpf_xdp_adjust_head_proto = {
>  
>  BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
>  {
> + void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
>   void *meta = xdp->data_meta + offset;
>   unsigned long metalen = xdp->data - meta;
>  
> @@ -2742,6 +2748,11 @@ BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, 
> xdp, int, offset)
>   if (unlikely(meta < xdp->data_hard_start ||
>meta > xdp->data))
>   return -EINVAL;
> +
> + /* Disallow data_meta to use xdp_frame area */
> + if (unlikely(meta < xdp_frame_end))
> + return -EINVAL;

(Ditto.)

>   if (unlikely((metalen & (sizeof(__u32) - 1)) ||
>(metalen > 32)))
>   return -EACCES;

The other, perhaps less invasive/complex option would be to just disallow
moving anything into previous sizeof(struct xdp_frame) area. My original
concern was that not all drivers use 256 bytes of headroom, e.g. afaik the
i40e and ixgbe have around 192 bytes of headroom available, but that should
actually still be plenty of space for encap + meta data, and potentially
with meta data use I would expect that at best custom decap would be
happening when pushing the packet up the stack. So might as well disallow
going into that region and not worry about it. Thus, reverting e9e9545e10d3
("xdp: avoid leaking info stored in frame data on page reuse") and adding
something like the below (uncompiled), should just do it:

diff --git a/net/core/filter.c b/net/core/filter.c
index 3bb0cb9..ad98ddd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2692,8 +2692,9 @@ static unsigned long xdp_get_metalen(const struct 
xdp_buff *xdp)

 BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
 {
+   void *frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
unsigned long metalen = xdp_get_metalen(xdp);
-   void *data_start = xdp->data_hard_start + metalen;
+   void *data_start = frame_end + metalen;
void *data = xdp->data + offset;

if (unlikely(data < data_start ||
@@ -2719,13 +2720,13 @@ static const struct bpf_func_proto 
bpf_xdp_adjust_head_proto = {

 BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
 {
+   void *frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
void *meta = xdp->data_meta + offset;
unsigned long metalen = xdp->data - meta;

if (xdp_data_meta_unsupported(xdp))
return -ENOTSUPP;
-   if (unlikely(meta < xdp->data_hard_start ||
-meta > xdp->data))
+   if (unlikely(meta < frame_end || meta > xdp->data))
return -EINVAL;
if (unlikely((metalen & (sizeof(__u32) - 1)) ||
 (metalen > 32)))

On top of that, we could even store a bool in struct xdp_rxq_info whether
the driver actually is able to participate in resp. has the XDP_REDIRECT
support and if not do something like:

void *frame_end = xdp->data_hard_start + xdp->rxq->has_redir ? sizeof(struct 
xdp_frame) : 0;

But the latter is merely a small optimization. Eventually we want all native XDP
drivers to support it. Thoughts?

Thanks,
Daniel


Re: [RFC PATCH] net: bridge: multicast querier per VLAN support

2018-04-18 Thread Nikolay Aleksandrov
On April 18, 2018 6:54:07 PM GMT+03:00, Stephen Hemminger 
 wrote:
>On Wed, 18 Apr 2018 16:14:26 +0300
>Nikolay Aleksandrov  wrote:
>
>> On 18/04/18 16:07, Joachim Nilsson wrote:
>> > On Wed, Apr 18, 2018 at 03:31:57PM +0300, Nikolay Aleksandrov
>wrote:  
>> >> On 18/04/18 15:07, Joachim Nilsson wrote:  
>> >>> - First of all, is this patch useful to anyone  
>> >> Obviously to us as it's based on our patch. :-)
>> >> We actually recently discussed what will be needed to make it
>acceptable to upstream.  
>> > 
>> > Great! :)
>> >   
>> >>> - The current br_multicast.c is very complex.  The support for
>both IPv4
>> >>>and IPv6 is a no-brainer, but it also has #ifdef
>VLAN_FILTERING and
>> >>>'br->vlan_enabled' ... this has likely been discussed before,
>but if
>> >>>we could remove those code paths I believe what's left would
>be quite
>> >>>a bit easier to read and maintain.  
>> >> br->vlan_enabled has a wrapper that can be used without ifdefs, as
>does br_vlan_find()
>> >> so in short - you can remove the ifdefs and use the wrappers, 
>they'll degrade to always
>> >> false/null when vlans are disabled.  
>> > 
>> > Thanks, I'll have a look at that and prepare an RFC v2!
>> >   
>> >>> - Many per-bridge specific multicast sysfs settings may need to
>have a
>> >>>corresponding per-VLAN setting, e.g. snooping, query_interval,
>etc.
>> >>>How should we go about that? (For status reporting I have a
>proposal)  
>> >> We'll have to add more to the per-vlan context, but yes it has to
>happen.
>> >> It will be only netlink interface for config/retrieval, no sysfs. 
>
>> > 
>> > Some settings are possible to do with sysfs, like
>multicast_query_interval
>> > and ...  
>> 
>> We want to avoid sysfs in general, all of networking config and stats
>> are moving to netlink. It is better controlled and structured for
>such
>> changes, also provides nice interfaces for automatic  type checks
>etc.
>> 
>> Also (but a minor reason) there is no tree/entity in sysfs for the
>vlans
>> where to add this. It will either have to be a file which does some
>> format string hack (like us currently) or will need to add new tree
>for
>> them which I'd really like to avoid for the bridge.
>
>In general, all bridge attributes need to show in netlink and sysfs.
>Sysfs is easier for scripting from languages.

True, but vlans and per-vlan settings have never been exposed via sysfs, only 
through netlink.
I'd like to avoid adding a directory with potentially 4k multiplied by the attr 
number for each vlan entries.

There is already vlan config infrastructure via netlink.








Re: SRIOV switchdev mode BoF minutes

2018-04-18 Thread Jakub Kicinski
On Wed, 18 Apr 2018 11:15:29 -0400, Andy Gospodarek wrote:
> > A similar issue exists on multi-host for PFs, right?  If one of the
> > hosts is down do we still show their PF repr?  IMHO yes.  
> 
> I would agree with that as well.  With today's model the VF reps are
> created once a PF is put into switchdev mode, but I'm still working out
> how we want to consider whether or not a PF rep for the other domains is
> created locally or not and also how one can determine which domain is in
> control.
> 
> Permanent config options (like NVRAM settings) could easily handle which
> domain is in control, but that still does not mean that PF reps must be
> created automatically, does it?

The control domain is tricky.  I'm not sure I understand how you could
not have a PF rep for remote domains, though.  How do you configure
switching to the PF netdev if there is no rep?


  1   2   3   4   >