Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-08-03 Thread Ben Pfaff
On Sat, Aug 04, 2018 at 02:43:24AM +0200, Stefano Brivio wrote:
> On Fri, 3 Aug 2018 16:01:08 -0700
> Ben Pfaff  wrote:
> > I would be very pleased if we could integrate a simple mechanism for
> > fairness, based for now on some simple criteria like the source port,
> > but thinking ahead to how we could later make it gracefully extensible
> > to consider more general and possibly customizable criteria.
> 
> We could change the patch so that instead of just using the vport for
> round-robin queue insertion, we generalise that and use "buckets"
> instead of vports, and have a set of possible functions that are called
> instead of using port_no directly in ovs_dp_upcall_queue_roundrobin(),
> making this configurable via netlink, per datapath.
> 
> We could implement selection based on source port or a hash on the
> source 5-tuple, and the relevant bits of
> ovs_dp_upcall_queue_roundrobin() would look like this:

[...]

> What do you think?

I'd support that.  Thanks.


Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-08-03 Thread Ben Pfaff
On Fri, Aug 03, 2018 at 06:52:41PM +0200, Stefano Brivio wrote:
> On Tue, 31 Jul 2018 15:06:57 -0700 Ben Pfaff  wrote:
> > My current thought is that any fairness scheme we implement directly in
> > the kernel is going to need to evolve over time.  Maybe we could do
> > something flexible with BPF and maps, instead of hard-coding it.
> 
> Honestly, I fail to see what else we might want to do here, other than
> adding a simple mechanism for fairness, to solve the specific issue at
> hand. Flexibility would probably come at a higher cost. We could easily
> make limits configurable if needed. Do you have anything else in mind?

I think that a simple mechanism for fairness is fine.  The direction of
extensibility that makes me anxious is how to decide what matters for
fairness.  So far, we've talked about per-vport fairness.  That works
pretty well for packets coming in from virtual interfaces where each
vport represents a separate VM.  It does not work well if the traffic
filling your queues all comes from a single physical port because some
source of traffic is sending traffic at a high rate.  In that case,
you'll do a lot better if you do fairness based on the source 5-tuple.
But if you're doing network virtualization, then the outer source
5-tuples won't necessarily vary much and you'd be better off looking at
the VNI and maybe some Geneve TLV options and maybe the inner 5-tuple...

I would be very pleased if we could integrate a simple mechanism for
fairness, based for now on some simple criteria like the source port,
but thinking ahead to how we could later make it gracefully extensible
to consider more general and possibly customizable criteria.

Thanks,

Ben.


Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-07-31 Thread Ben Pfaff
On Tue, Jul 31, 2018 at 07:43:34PM +, Matteo Croce wrote:
> On Mon, Jul 16, 2018 at 4:54 PM Matteo Croce  wrote:
> >
> > On Tue, Jul 10, 2018 at 6:31 PM Pravin Shelar  wrote:
> > >
> > > On Wed, Jul 4, 2018 at 7:23 AM, Matteo Croce  wrote:
> > > > From: Stefano Brivio 
> > > >
> > > > Open vSwitch sends to userspace all received packets that have
> > > > no associated flow (thus doing an "upcall"). Then the userspace
> > > > program creates a new flow and determines the actions to apply
> > > > based on its configuration.
> > > >
> > > > When a single port generates a high rate of upcalls, it can
> > > > prevent other ports from dispatching their own upcalls. vswitchd
> > > > overcomes this problem by creating many netlink sockets for each
> > > > port, but it quickly exceeds any reasonable maximum number of
> > > > open files when dealing with huge amounts of ports.
> > > >
> > > > This patch queues all the upcalls into a list, ordering them in
> > > > a per-port round-robin fashion, and schedules a deferred work to
> > > > queue them to userspace.
> > > >
> > > > The algorithm to queue upcalls in a round-robin fashion,
> > > > provided by Stefano, is based on these two rules:
> > > >  - upcalls for a given port must be inserted after all the other
> > > >occurrences of upcalls for the same port already in the queue,
> > > >in order to avoid out-of-order upcalls for a given port
> > > >  - insertion happens once the highest upcall count for any given
> > > >port (excluding the one currently at hand) is greater than the
> > > >count for the port we're queuing to -- if this condition is
> > > >never true, upcall is queued at the tail. This results in a
> > > >per-port round-robin order.
> > > >
> > > > In order to implement a fair round-robin behaviour, a variable
> > > > queueing delay is introduced. This will be zero if the upcalls
> > > > rate is below a given threshold, and grows linearly with the
> > > > queue utilisation (i.e. upcalls rate) otherwise.
> > > >
> > > > This ensures fairness among ports under load and with few
> > > > netlink sockets.
> > > >
> > > Thanks for the patch.
> > > This patch is adding following overhead for upcall handling:
> > > 1. kmalloc.
> > > 2. global spin-lock.
> > > 3. context switch to single worker thread.
> > > I think this could become bottle neck on most of multi core systems.
> > > You have mentioned issue with existing fairness mechanism, Can you
> > > elaborate on those, I think we could improve that before implementing
> > > heavy weight fairness in upcall handling.
> >
> > Hi Pravin,
> >
> > vswitchd allocates N * P netlink sockets, where N is the number of
> > online CPU cores, and P the number of ports.
> > With some setups, this number can grow quite fast, also exceeding the
> > system maximum file descriptor limit.
> > I've seen a 48 core server failing with -EMFILE when trying to create
> > more than 65535 netlink sockets needed for handling 1800+ ports.
> >
> > I made a previous attempt to reduce the sockets to one per CPU, but
> > this was discussed and rejected on ovs-dev because it would remove
> > fairness among ports[1].
> > I think that the current approach of opening a huge number of sockets
> > doesn't really work, (it doesn't scale for sure), it still needs some
> > queueing logic (either in kernel or user space) if we really want to
> > be sure that low traffic ports gets their upcalls quota when other
> > ports are doing way more traffic.
> >
> > If you are concerned about the kmalloc or spinlock, we can solve them
> > with kmem_cache or two copies of the list and rcu, I'll happy to
> > discuss the implementation details, as long as we all agree that the
> > current implementation doesn't scale well and has an issue.
> >
> > [1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-February/344279.html
> >
> > --
> > Matteo Croce
> > per aspera ad upstream
> 
> Hi all,
> 
> any idea on how to solve the file descriptor limit hit by the netlink sockets?
> I see this issue happen very often, and raising the FD limit to 400k
> seems not the right way to solve it.
> Any other suggestion on how to improve the patch, or solve the problem
> in a different way?

This is an awkward problem to try to solve with sockets because of the
nature of sockets, which are strictly first-in first-out.  What you
really want is something closer to the algorithm that we use in
ovs-vswitchd to send packets to an OpenFlow controller.  When the
channel becomes congested, then for each packet to be sent to the
controller, OVS appends it to a queue associated with its input port.
(This could be done on a more granular basis than just port.)  If the
maximum amount of queued packets is reached, then OVS discards a packet
from the longest queue.  When space becomes available in the channel,
OVS round-robins through the queues to send a packet.  This achieves
pretty good fairness but it can't be done with sockets because you can't
drop a packet 

Re: [ovs-dev] Pravin Shelar

2017-12-27 Thread Ben Pfaff
On Wed, Dec 27, 2017 at 04:22:55PM +0100, Julia Lawall wrote:
> The email address pshe...@nicira.com listed for Pravin Shelar in
> MAINTAINERS (OPENVSWITCH section) seems to bounce.

Pravin has used a newer address recently, so I sent out a suggested
update (for OVS):
https://patchwork.ozlabs.org/patch/853232/


Re: [PATCH net-next v4] openvswitch: enable NSH support

2017-08-22 Thread Ben Pfaff
On Tue, Aug 22, 2017 at 08:32:49AM +, Jan Scheurich wrote:
> > > Or why else does OVS user space code take so great pain to model
> > > possible misalignment and provide/use safe access functions?
> > 
> > I don't know how the ovs user space deals with packet allocation. In
> > the kernel, the network header is aligned in a way that it allows
> > efficient 32bit access.
> 
> It seems that OVS has not had the same approach as Linux. There is no
> config parameter covering the alignment characteristics of the machine
> architecture. For packets buffers received from outside sources
> (e.g. DPDK interfaces) they make no assumptions on alignment and play
> safe. For packets allocated inside OVS, the Ethernet packet is
> typically stored so that the L3 header is 32-bit aligned, so that the
> misalignment precautions would be unnecessary. But I didn't check all
> code paths.

We solved the alignment problem in OVS userspace a different way, by
defining our versions of the network protocol headers so that they only
need 16-bit alignment.  In turn, we did that by defining a
ovs_16aligned_be32 type as a pair of be16s and ovs_16aligned_be64 as
four be16s, and using helper functions for reads and writes.  This made
it harder to screw up alignment in a subtle way and only find out long
after release when someone tested a corner case on a RISC architecture.
It probably has a performance cost on those RISC architectures, for the
cases where the access really is aligned, but it's more obviously
correct and I highly value that for OVS userspace.

As far as I can tell it's not actually possible, in the general case, to
add padding such that all parts of a packet are aligned.  VXLAN is the
case that comes to mind.  With VXLAN, as far as I can tell, the 14-byte
inner Ethernet header mean that you can align either the outer IPv4
header or the inner IPv4 header, but not both.  That means that no
matter how careful OVS is about aligning packets, it would still have to
deal with unaligned accesses in some cases.

I see that the VXLAN issue has come up in Linux before:
https://www.ietf.org/mail-archive/web/nvo3/current/msg05743.html


Re: [ovs-dev] [PATCH net-next] openvswitch: add NSH support

2017-08-09 Thread Ben Pfaff
On Wed, Aug 09, 2017 at 08:12:36PM +, Yang, Yi Y wrote:
> Ben, do you mean we bring two new attributes (OVS_NSH_ATTR_MD1 and
> OVS_NSH_ATTR_MD2) and embed one of them in OVS_ACTION_ATTR_ENCAP_NSH?

Yes.

> Anyway we need to use a struct or something else to transfer those
> metadata between functions, how do you think we can handle this
> without metadata in struct ovs_action_encap_nsh? I mean how we handle
> the arguments for function encap_nsh.

I guess that a pointer to the embedded nlattr with type OVS_NSH_ATTR_MD1
or OVS_NSH_ATTR2 should work OK.

Keep in mind that I'm not a kernel-side maintainer of any kind.  I am
only passing along what I've perceived to be common Netlink protocol
design patterns.

> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On 
> Behalf Of Ben Pfaff
> Sent: Thursday, August 10, 2017 2:09 AM
> To: Yang, Yi Y <yi.y.y...@intel.com>
> Cc: Jan Scheurich <jan.scheur...@ericsson.com>; d...@openvswitch.org; 
> netdev@vger.kernel.org; Jiri Benc <jb...@redhat.com>; da...@davemloft.net; 
> Zoltán Balogh <zoltan.bal...@ericsson.com>
> Subject: Re: [ovs-dev] [PATCH net-next] openvswitch: add NSH support
> 
> On Wed, Aug 09, 2017 at 09:41:51AM +, Yang, Yi Y wrote:
> > Hi,  Jan
> > 
> > I have worked out a patch, will send it quickly for Ben. In addition, 
> > I also will send out a patch to change encap_nsh _nsh to 
> > push_nsh and pop_nsh. Per comments from all the people, we all agreed 
> > to do so :-)
> > 
> > diff --git a/datapath/linux/compat/include/linux/openvswitch.h 
> > b/datapath/linux/compat/include/linux/openvswitch.h
> > index bc6c94b..4936c12 100644
> > --- a/datapath/linux/compat/include/linux/openvswitch.h
> > +++ b/datapath/linux/compat/include/linux/openvswitch.h
> > @@ -793,7 +793,7 @@ struct ovs_action_push_eth {
> > struct ovs_key_ethernet addresses;  };
> > 
> > -#define OVS_ENCAP_NSH_MAX_MD_LEN 16
> > +#define OVS_ENCAP_NSH_MAX_MD_LEN 248
> >  /*
> >   * struct ovs_action_encap_nsh - %OVS_ACTION_ATTR_ENCAP_NSH
> >   * @flags: NSH header flags.
> > @@ -809,7 +809,7 @@ struct ovs_action_encap_nsh {
> >  uint8_t mdlen;
> >  uint8_t np;
> >  __be32 path_hdr;
> > -uint8_t metadata[OVS_ENCAP_NSH_MAX_MD_LEN];
> > +uint8_t metadata[];
> >  };
> 
> This brings the overall format of ovs_action_encap_nsh to:
> 
> struct ovs_action_encap_nsh {
> uint8_t flags;
> uint8_t mdtype;
> uint8_t mdlen;
> uint8_t np;
> __be32 path_hdr;
> uint8_t metadata[];
> };
> 
> This is an unusual format for a Netlink attribute.  More commonly, one would 
> put variable-length data into an attribute of its own, which allows that data 
> to be handled using the regular Netlink means.  Then the mdlen and metadata 
> members could be removed, since they would be part of the additional 
> attribute, and one might expect the mdtype member to be removed as well since 
> each type of metadata would be in a different attribute type.
> 
> So, a format closer to what I expect to see in Netlink is something like
> this:
> 
> /**
>  * enum ovs_nsh_attr - Metadata attributes for %OVS_ACTION_ENCAP_NSH action.
>  *
>  * @OVS_NSH_ATTR_MD1: Contains 16-byte NSH type-1 metadata.
>  * @OVS_NSH_ATTR_MD2: Contains 0- to 255-byte variable-length NSH type-2
>  * metadata. */
> enum ovs_nsh_attr {
> OVS_NSH_ATTR_MD1,
> OVS_NSH_ATTR_MD2
> };
> 
> /*
>  * struct ovs_action_encap_nsh - %OVS_ACTION_ATTR_ENCAP_NSH
>  *
>  * @path_hdr: NSH service path id and service index.
>  * @flags: NSH header flags.
>  * @np: NSH next_protocol: Inner packet type.
>  *
>  * Followed by either %OVS_NSH_ATTR_MD1 or %OVS_NSH_ATTR_MD2 attribute.
>  */
> struct ovs_action_encap_nsh {
> __be32 path_hdr;
> uint8_t flags;
> uint8_t np;
> };


Re: [ovs-dev] [PATCH net-next] openvswitch: add NSH support

2017-08-09 Thread Ben Pfaff
On Wed, Aug 09, 2017 at 09:41:51AM +, Yang, Yi Y wrote:
> Hi,  Jan
> 
> I have worked out a patch, will send it quickly for Ben. In addition, I also 
> will send out a patch to change encap_nsh _nsh to push_nsh and pop_nsh. 
> Per comments from all the people, we all agreed to do so :-)
> 
> diff --git a/datapath/linux/compat/include/linux/openvswitch.h 
> b/datapath/linux/compat/include/linux/openvswitch.h
> index bc6c94b..4936c12 100644
> --- a/datapath/linux/compat/include/linux/openvswitch.h
> +++ b/datapath/linux/compat/include/linux/openvswitch.h
> @@ -793,7 +793,7 @@ struct ovs_action_push_eth {
> struct ovs_key_ethernet addresses;
>  };
> 
> -#define OVS_ENCAP_NSH_MAX_MD_LEN 16
> +#define OVS_ENCAP_NSH_MAX_MD_LEN 248
>  /*
>   * struct ovs_action_encap_nsh - %OVS_ACTION_ATTR_ENCAP_NSH
>   * @flags: NSH header flags.
> @@ -809,7 +809,7 @@ struct ovs_action_encap_nsh {
>  uint8_t mdlen;
>  uint8_t np;
>  __be32 path_hdr;
> -uint8_t metadata[OVS_ENCAP_NSH_MAX_MD_LEN];
> +uint8_t metadata[];
>  };

This brings the overall format of ovs_action_encap_nsh to:

struct ovs_action_encap_nsh {
uint8_t flags;
uint8_t mdtype;
uint8_t mdlen;
uint8_t np;
__be32 path_hdr;
uint8_t metadata[];
};

This is an unusual format for a Netlink attribute.  More commonly, one
would put variable-length data into an attribute of its own, which
allows that data to be handled using the regular Netlink means.  Then
the mdlen and metadata members could be removed, since they would be
part of the additional attribute, and one might expect the mdtype member
to be removed as well since each type of metadata would be in a
different attribute type.

So, a format closer to what I expect to see in Netlink is something like
this:

/**
 * enum ovs_nsh_attr - Metadata attributes for %OVS_ACTION_ENCAP_NSH action.
 *
 * @OVS_NSH_ATTR_MD1: Contains 16-byte NSH type-1 metadata.
 * @OVS_NSH_ATTR_MD2: Contains 0- to 255-byte variable-length NSH type-2
 * metadata. */
enum ovs_nsh_attr {
OVS_NSH_ATTR_MD1,
OVS_NSH_ATTR_MD2
};

/*
 * struct ovs_action_encap_nsh - %OVS_ACTION_ATTR_ENCAP_NSH
 *
 * @path_hdr: NSH service path id and service index.
 * @flags: NSH header flags.
 * @np: NSH next_protocol: Inner packet type.
 *
 * Followed by either %OVS_NSH_ATTR_MD1 or %OVS_NSH_ATTR_MD2 attribute.
 */
struct ovs_action_encap_nsh {
__be32 path_hdr;
uint8_t flags;
uint8_t np;
};


Re: [ovs-dev] [PATCH net-next] openvswitch: add NSH support

2017-08-08 Thread Ben Pfaff
To be clear, the OVS implementation is a placeholder.  It will get
replaced by whatever netdev implements, and that's OK.  I didn't focus
on making it perfect because I knew that.  Instead, I just made sure it
was good enough for an internal OVS implementation that doesn't fix any
ABI or API.  OVS can even change the user-visible action names, as long
as we do that soon (and encap/decap versus push/pop doesn't matter to
me).

The considerations for netdev are different and more permanent.

On Wed, Aug 09, 2017 at 02:05:12AM +, Yang, Yi Y wrote:
> Hi, Jiri
> 
> Thank you for your comments.
> 
> __be32 c[4] is the name Ben Pfaff suggested, the original name is c1, c2, c3, 
> c4, they are context data, so c seems ok, too :-)
> 
> OVS has merged it and has the same name, maybe the better way is adding 
> comment /* Context data */ after it.
> 
> For MD type 2, struct ovs_key_nsh is very difficult to cover it, so far we 
> don't know how to support MD type 2 better, Geneve defined 64 
> tun_metadata0-63 to handle this, those keys are parts of struct flow_tnl, the 
> highest possibility is to reuse those keys.
> 
> So for future MD type 2, we will have two parts of keys, one is from struct 
> ovs_key_nsh, another is from struct flow_tnl, this won't break the old uAPI.
> 
> "#define OVS_ENCAP_NSH_MAX_MD_LEN 16" is changed per Ben's comment from 256, 
> Ben thinks 256 is too big but we only support MD type 1 now. We still have 
> ways to extend it, for example:
> 
> struct ovs_action_encap_nsh * oaen = (struct ovs_action_encap_nsh *) malloc 
> (sizeof(struct ovs_action_encap_nsh) + ANY_SIZE);
> 
> nl_msg_put_unspec(actions, OVS_ACTION_ATTR_ENCAP_NSH,
>   oaen, sizeof(struct ovs_action_encap_nsh) + 
> ANY_SIZE);
> 
> In addition, we also need to consider, OVS userspace code must be consistent 
> with here, so keeping it intact will be better, we have way to support 
> dynamically extension when we add MD type 2 support.
> 
> About action name, unfortunately, userspace data plane has named them as 
> encap_nsh & decap_nsh, Jan, what do you think about Jiri's suggestion?
> 
> But from my understanding, encap_* & decap_* are better because they 
> corresponding to generic encap & decap actions, in addition, encap semantics 
> are different from push, encap just pushed an empty header with default 
> values, users must use set_field to set the content of the header.
> 
> Again, OVS userspace code must be consistent with here, so keeping it intact 
> will be better because OVS userspace code was there.
> 
> 
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On 
> Behalf Of Jiri Benc
> Sent: Tuesday, August 8, 2017 10:28 PM
> To: Yang, Yi Y <yi.y.y...@intel.com>
> Cc: netdev@vger.kernel.org; d...@openvswitch.org; da...@davemloft.net
> Subject: Re: [PATCH net-next] openvswitch: add NSH support
> 
> On Tue,  8 Aug 2017 12:59:40 +0800, Yi Yang wrote:
> > +struct ovs_key_nsh {
> > +   __u8 flags;
> > +   __u8 mdtype;
> > +   __u8 np;
> > +   __u8 pad;
> > +   __be32 path_hdr;
> > +   __be32 c[4];
> 
> "c" is a very poor name. Please rename it to something that expresses what 
> this field contains.
> 
> Also, this looks like MD type 1 only. How are those fields going to work with 
> MD type 2? I don't think MD type 2 implementation is necessary in this patch 
> but I'd like to know how this is going to work - it's uAPI and thus set in 
> stone once this is merged. The uAPI needs to be designed with future use in 
> mind.
> 
> > +#define OVS_ENCAP_NSH_MAX_MD_LEN 16
> > +/*
> > + * struct ovs_action_encap_nsh - %OVS_ACTION_ATTR_ENCAP_NSH
> > + * @flags: NSH header flags.
> > + * @mdtype: NSH metadata type.
> > + * @mdlen: Length of NSH metadata in bytes.
> > + * @np: NSH next_protocol: Inner packet type.
> > + * @path_hdr: NSH service path id and service index.
> > + * @metadata: NSH metadata for MD type 1 or 2  */ struct 
> > +ovs_action_encap_nsh {
> > +   __u8 flags;
> > +   __u8 mdtype;
> > +   __u8 mdlen;
> > +   __u8 np;
> > +   __be32 path_hdr;
> > +   __u8 metadata[OVS_ENCAP_NSH_MAX_MD_LEN];
> 
> This is wrong. The metadata size is set to a fixed size by this and cannot be 
> ever extended, or at least not easily. Netlink has attributes for exactly 
> these cases and that's what needs to be used here.
> 
> > @@ -835,6 +866,8 @@ enum ovs_action_attr {
> > OVS_ACTION_ATTR_TRUNC,/* u32 struct ovs_action_trunc. */
> > OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */
> > OVS_ACTION_ATTR_

Re: [ovs-dev] [PATCH net-next] net: remove abuse of VLAN DEI/CFI bit

2016-12-05 Thread Ben Pfaff
On Mon, Dec 05, 2016 at 11:52:47PM +0100, Michał Mirosław wrote:
> On Mon, Dec 05, 2016 at 10:55:45AM -0800, Ben Pfaff wrote:
> > On Mon, Dec 05, 2016 at 06:24:36PM +0100, Michał Mirosław wrote:
> > > On Sat, Dec 03, 2016 at 03:27:30PM -0800, Ben Pfaff wrote:
> > > > On Sat, Dec 03, 2016 at 10:22:28AM +0100, Michał Mirosław wrote:
> > > > > This All-in-one patch removes abuse of VLAN CFI bit, so it can be 
> > > > > passed
> > > > > intact through linux networking stack.
> > > > This appears to change the established Open vSwitch userspace API.  You
> > > > can see that simply from the way that it changes the documentation for
> > > > the userspace API.  If I'm right about that, then this change will break
> > > > all userspace programs that use the Open vSwitch kernel module,
> > > > including Open vSwitch itself.
> > > 
> > > If I understood the code correctly, it does change expected meaning for
> > > the (unlikely?) case of header truncated just before the VLAN TCI - it 
> > > will
> > > be impossible to differentiate this case from the VLAN TCI == 0.
> > > 
> > > I guess this is a problem with OVS API, because it doesn't directly show
> > > the "missing" state of elements, but relies on an "invalid" value.
> > 
> > That particular corner case should not be a huge problem in any case.
> > 
> > The real problem is that this appears to break the common case use of
> > VLANs in Open vSwitch.  After this patch, parse_vlan() in
> > net/openvswitch/flow.c copies the tpid and tci from sk_buff (either the
> > accelerated version of them or the version in the skb data) into
> > sw_flow_key members.  OK, that's fine on it's own.  However, I don't see
> > any corresponding change to the code in flow_netlink.c to compensate for
> > the fact that, until now, the VLAN CFI bit (formerly VLAN_TAG_PRESENT)
> > was always required to be set to 1 in flow matches inside Netlink
> > messages sent from userspace, and the kernel always set it to 1 in
> > corresponding messages sent to userspace.
> > 
> > In other words, if I'm reading this change correctly:
> > 
> > * With a kernel before this change, userspace always had to set
> >   VLAN_TAG_PRESENT to 1 to match on a VLAN, or the kernel would
> >   reject the flow match.
> > 
> > * With a kernel after this change, userspace must not set
> >   VLAN_TAG_PRESENT to 1, otherwise the kernel will accept the flow
> >   match but nothing will ever match because packets do not actually
> >   have the CFI bit set.
> > 
> > Take a look at this code that the patch deletes from
> > validate_vlan_from_nlattrs(), for example, and see how it insisted that
> > VLAN_TAG_PRESENT was set:
> > 
> > if (!(tci & htons(VLAN_TAG_PRESENT))) {
> > if (tci) {
> > OVS_NLERR(log, "%s TCI does not have VLAN_TAG_PRESENT 
> > bit set.",
> >   (inner) ? "C-VLAN" : "VLAN");
> > return -EINVAL;
> > } else if (nla_len(a[OVS_KEY_ATTR_ENCAP])) {
> > /* Corner case for truncated VLAN header. */
> > OVS_NLERR(log, "Truncated %s header has non-zero encap 
> > attribute.",
> >   (inner) ? "C-VLAN" : "VLAN");
> > return -EINVAL;
> > }
> > }
> > 
> > Please let me know if I'm overlooking something.
> 
> Hmm. So the easiest change without disrupting current userspace, would be
> to flip the CFI bit on the way to/from OVS userspace. Does this seem
> correct?

That sounds correct.  (The bit should not be flipped in the mask.)

Thanks,

Ben.


Re: [ovs-dev] [PATCH net-next] net: remove abuse of VLAN DEI/CFI bit

2016-12-05 Thread Ben Pfaff
On Mon, Dec 05, 2016 at 06:24:36PM +0100, Michał Mirosław wrote:
> On Sat, Dec 03, 2016 at 03:27:30PM -0800, Ben Pfaff wrote:
> > On Sat, Dec 03, 2016 at 10:22:28AM +0100, Michał Mirosław wrote:
> > > This All-in-one patch removes abuse of VLAN CFI bit, so it can be passed
> > > intact through linux networking stack.
> > This appears to change the established Open vSwitch userspace API.  You
> > can see that simply from the way that it changes the documentation for
> > the userspace API.  If I'm right about that, then this change will break
> > all userspace programs that use the Open vSwitch kernel module,
> > including Open vSwitch itself.
> 
> If I understood the code correctly, it does change expected meaning for
> the (unlikely?) case of header truncated just before the VLAN TCI - it will
> be impossible to differentiate this case from the VLAN TCI == 0.
> 
> I guess this is a problem with OVS API, because it doesn't directly show
> the "missing" state of elements, but relies on an "invalid" value.

That particular corner case should not be a huge problem in any case.

The real problem is that this appears to break the common case use of
VLANs in Open vSwitch.  After this patch, parse_vlan() in
net/openvswitch/flow.c copies the tpid and tci from sk_buff (either the
accelerated version of them or the version in the skb data) into
sw_flow_key members.  OK, that's fine on it's own.  However, I don't see
any corresponding change to the code in flow_netlink.c to compensate for
the fact that, until now, the VLAN CFI bit (formerly VLAN_TAG_PRESENT)
was always required to be set to 1 in flow matches inside Netlink
messages sent from userspace, and the kernel always set it to 1 in
corresponding messages sent to userspace.

In other words, if I'm reading this change correctly:

* With a kernel before this change, userspace always had to set
  VLAN_TAG_PRESENT to 1 to match on a VLAN, or the kernel would
  reject the flow match.

* With a kernel after this change, userspace must not set
  VLAN_TAG_PRESENT to 1, otherwise the kernel will accept the flow
  match but nothing will ever match because packets do not actually
  have the CFI bit set.

Take a look at this code that the patch deletes from
validate_vlan_from_nlattrs(), for example, and see how it insisted that
VLAN_TAG_PRESENT was set:

if (!(tci & htons(VLAN_TAG_PRESENT))) {
if (tci) {
OVS_NLERR(log, "%s TCI does not have VLAN_TAG_PRESENT 
bit set.",
  (inner) ? "C-VLAN" : "VLAN");
return -EINVAL;
} else if (nla_len(a[OVS_KEY_ATTR_ENCAP])) {
/* Corner case for truncated VLAN header. */
OVS_NLERR(log, "Truncated %s header has non-zero encap 
attribute.",
  (inner) ? "C-VLAN" : "VLAN");
return -EINVAL;
}
}

Please let me know if I'm overlooking something.

Thanks,

Ben.


Re: [ovs-dev] [PATCH net-next] net: remove abuse of VLAN DEI/CFI bit

2016-12-03 Thread Ben Pfaff
On Sat, Dec 03, 2016 at 10:22:28AM +0100, Michał Mirosław wrote:
> This All-in-one patch removes abuse of VLAN CFI bit, so it can be passed
> intact through linux networking stack.
> 
> Signed-off-by: Michał Mirosław 
> ---
> 
> Dear NetDevs
> 
> I guess this needs to be split to the prep..convert[]..finish sequence,
> but if you like it as is, then it's ready.
> 
> The biggest question is if the modified interface and vlan_present
> is the way to go. This can be changed to use vlan_proto != 0 instead
> of an extra flag bit.
> 
> As I can't test most of the driver changes, please look at them carefully.
> OVS and bridge eyes are especially welcome.

This appears to change the established Open vSwitch userspace API.  You
can see that simply from the way that it changes the documentation for
the userspace API.  If I'm right about that, then this change will break
all userspace programs that use the Open vSwitch kernel module,
including Open vSwitch itself.


deadlink in genetlink?

2007-09-24 Thread Ben Pfaff
[Sorry about the duplicate mail, Thomas: I got the netdev address
wrong on the first try.]

lockdep reported a circular dependency between cb_mutex and
genl_mutex, as follows:

- #1 (nlk-cb_mutex){--..}:
   [c0127b5d] __lock_acquire+0x9c8/0xb98
   [c01280f6] lock_acquire+0x5d/0x75
   [c027adc1] __mutex_lock_slowpath+0xdb/0x247
   [c027af49] mutex_lock+0x1c/0x1f
   [c0236aa0] netlink_dump_start+0xa9/0x12f
  --- takes cb_mutex
   [c0237fa6] genl_rcv_msg+0xa3/0x14c
   [c0235aa5] netlink_run_queue+0x6f/0xe1
   [c0237772] genl_rcv+0x2d/0x4e
  --- trylocks genl_mutex
   [c0235ef0] netlink_data_ready+0x15/0x55
   [c0234e98] netlink_sendskb+0x1f/0x36
   [c0235772] netlink_unicast+0x1a6/0x1c0
   [c0235ecf] netlink_sendmsg+0x238/0x244
   [c021a92e] sock_sendmsg+0xcb/0xe4
   [c021aa98] sys_sendmsg+0x151/0x1af
   [c021b850] sys_socketcall+0x203/0x222
   [c01025d2] syscall_call+0x7/0xb
   [] 0x

- #0 (genl_mutex){--..}:
   [c0127a46] __lock_acquire+0x8b1/0xb98
   [c01280f6] lock_acquire+0x5d/0x75
   [c027adc1] __mutex_lock_slowpath+0xdb/0x247
   [c027af49] mutex_lock+0x1c/0x1f
   [c023717d] genl_lock+0xd/0xf
   [c023806f] ctrl_dumpfamily+0x20/0xdd
  --- takes genl_mutex
   [c0234bfa] netlink_dump+0x50/0x168
  --- takes cb_mutex
   [c02360f2] netlink_recvmsg+0x15f/0x22f
   [c021a84a] sock_recvmsg+0xd5/0xee
   [c021b24e] sys_recvmsg+0xf5/0x187
   [c021b865] sys_socketcall+0x218/0x222
   [c01025d2] syscall_call+0x7/0xb
   [] 0x

The trylock in genl_rcv doesn't prevent the deadlock, because
it's not the last lock acquired.  Perhaps this can be fixed by
not acquiring the genl_mutex in ctrl_dumpfamily; it silences
lockdep, at least.  It is not clear to me what the value of
taking the mutex is there.

If this is an appropriate fix, here is a patch for it.

Signed-off-by: Ben Pfaff [EMAIL PROTECTED]

diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 8c11ca4..2e79035 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -616,9 +616,6 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct 
netlink_callback *cb)
int chains_to_skip = cb-args[0];
int fams_to_skip = cb-args[1];
 
-   if (chains_to_skip != 0)
-   genl_lock();
-
for (i = 0; i  GENL_FAM_TAB_SIZE; i++) {
if (i  chains_to_skip)
continue;
@@ -636,9 +633,6 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct 
netlink_callback *cb)
}
 
 errout:
-   if (chains_to_skip != 0)
-   genl_unlock();
-
cb-args[0] = i;
cb-args[1] = n;
 

-- 
Ben Pfaff
Nicira Networks, Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html