Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-09-28 Thread Pravin Shelar
On Wed, Sep 26, 2018 at 2:58 AM Stefano Brivio  wrote:
>
> Hi Pravin,
>
> On Wed, 15 Aug 2018 00:19:39 -0700
> Pravin Shelar  wrote:
>
> > I understand fairness has cost, but we need to find right balance
> > between performance and fairness. Current fairness scheme is a
> > lockless algorithm without much computational overhead, did you try to
> > improve current algorithm so that it uses less number of ports.
>
> In the end, we tried harder as you suggested, and found out a way to
> avoid the need for per-thread sets of per-vport sockets: with a few
> changes, a process-wide set of per-vport sockets is actually enough to
> maintain the current fairness behaviour.
>
> Further, we can get rid of duplicate fd events if the EPOLLEXCLUSIVE
> epoll() flag is available, which improves performance noticeably. This
> is explained in more detail in the commit message of the ovs-vswitchd
> patch Matteo wrote [1], now merged.
>
> This approach solves the biggest issue (i.e. huge amount of netlink
> sockets) in ovs-vswitchd itself, without the need for kernel changes.
>

Thanks for working on this issue.

> It doesn't address some proposed improvements though, that is, it does
> nothing to improve the current fairness scheme, it won't allow neither
> the configurable fairness criteria proposed by Ben, nor usage of BPF
> maps for extensibility as suggested by William.
>
> I would however say that those topics bear definitely lower priority,
> and perhaps addressing them at all becomes overkill now that we don't
> any longer have a serious issue.
>
> [1] https://patchwork.ozlabs.org/patch/974304/
Nice!

Thanks,
Pravin.


Re: [PATCH] [PATCH net-next] openvswitch: Use correct reply values in datapath and vport ops

2018-09-28 Thread Pravin Shelar
On Wed, Sep 26, 2018 at 11:40 AM Yifeng Sun  wrote:
>
> This patch fixes the bug that all datapath and vport ops are returning
> wrong values (OVS_FLOW_CMD_NEW or OVS_DP_CMD_NEW) in their replies.
>
> Signed-off-by: Yifeng Sun 
I am surprised this was not found earlier.

Acked-by: Pravin B Shelar 

Thanks.


Re: [PATCH net-next v2] openvswitch: Derive IP protocol number for IPv6 later frags

2018-09-06 Thread Pravin Shelar
On Tue, Sep 4, 2018 at 3:37 PM Yi-Hung Wei  wrote:
>
> Currently, OVS only parses the IP protocol number for the first
> IPv6 fragment, but sets the IP protocol number for the later fragments
> to be NEXTHDF_FRAGMENT.  This patch tries to derive the IP protocol
> number for the IPV6 later frags so that we can match that.
>
> Signed-off-by: Yi-Hung Wei 
Acked-by: Pravin B Shelar 

Thanks.


Re: [PATCH] datapath.c: fix missing return value check of nla_nest_start()

2018-08-22 Thread Pravin Shelar
On Tue, Aug 21, 2018 at 4:38 PM David Miller  wrote:
>
> From: Pravin Shelar 
> Date: Tue, 21 Aug 2018 15:38:28 -0700
>
> > On Fri, Aug 17, 2018 at 1:15 AM Jiecheng Wu  wrote:
> >>
> >> Function queue_userspace_packet() defined in net/openvswitch/datapath.c 
> >> calls nla_nest_start() to allocate memory for struct nlattr which is 
> >> dereferenced immediately. As nla_nest_start() may return NULL on failure, 
> >> this code piece may cause NULL pointer dereference bug.
> >> ---
> >>  net/openvswitch/datapath.c | 4 
> >>  1 file changed, 4 insertions(+)
> >>
> >> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> >> index 0f5ce77..ff4457d 100644
> >> --- a/net/openvswitch/datapath.c
> >> +++ b/net/openvswitch/datapath.c
> >> @@ -460,6 +460,8 @@ static int queue_userspace_packet(struct datapath *dp, 
> >> struct sk_buff *skb,
> >>
> >> if (upcall_info->egress_tun_info) {
> >> nla = nla_nest_start(user_skb, 
> >> OVS_PACKET_ATTR_EGRESS_TUN_KEY);
> >> +   if (!nla)
> >> +   return -EMSGSIZE;
> > It is not possible, since user_skb is allocated to accommodate all
> > netlink attributes.
>
> Pravin, common practice is to always check nla_*() return values even if the
> SKB is allocated with "enough space".
>
> Those calculations can have bugs, and these checks are therefore helpful to
> avoid crashes and memory corruption in such cases.
>
OK, in that case this patch needs to proper error handling.


Re: [PATCH] datapath.c: fix missing return value check of nla_nest_start()

2018-08-21 Thread Pravin Shelar
On Fri, Aug 17, 2018 at 1:15 AM Jiecheng Wu  wrote:
>
> Function queue_userspace_packet() defined in net/openvswitch/datapath.c calls 
> nla_nest_start() to allocate memory for struct nlattr which is dereferenced 
> immediately. As nla_nest_start() may return NULL on failure, this code piece 
> may cause NULL pointer dereference bug.
> ---
>  net/openvswitch/datapath.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index 0f5ce77..ff4457d 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -460,6 +460,8 @@ static int queue_userspace_packet(struct datapath *dp, 
> struct sk_buff *skb,
>
> if (upcall_info->egress_tun_info) {
> nla = nla_nest_start(user_skb, 
> OVS_PACKET_ATTR_EGRESS_TUN_KEY);
> +   if (!nla)
> +   return -EMSGSIZE;
It is not possible, since user_skb is allocated to accommodate all
netlink attributes.

> err = ovs_nla_put_tunnel_info(user_skb,
>   upcall_info->egress_tun_info);
> BUG_ON(err);
> @@ -468,6 +470,8 @@ static int queue_userspace_packet(struct datapath *dp, 
> struct sk_buff *skb,
>
> if (upcall_info->actions_len) {
> nla = nla_nest_start(user_skb, OVS_PACKET_ATTR_ACTIONS);
> +   if (!nla)
> +   return -EMSGSIZE;
same as above, the check is not required.

> err = ovs_nla_put_actions(upcall_info->actions,
>   upcall_info->actions_len,
>   user_skb);
> --
> 2.6.4
>


Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-08-15 Thread Pravin Shelar
Hi Stefano

On Tue, Aug 7, 2018 at 6:31 AM, Stefano Brivio  wrote:
> Hi Pravin,
>
> On Tue, 31 Jul 2018 16:12:03 -0700
> Pravin Shelar  wrote:
>
>> Rather than reducing number of thread down to 1, we could find better
>> number of FDs per port.
>> How about this simple solution:
>> 1. Allocate (N * P) FDs as long as it is under FD limit.
>> 2. If FD limit (-EMFILE) is hit reduce N value by half and repeat step 1.
>
> I still see a few quantitative issues with this approach, other than
> Ben's observation about design (which, by the way, looks entirely
> reasonable to me).
>
> We're talking about a disproportionate amount of sockets in any case.
> We can have up to 2^16 vports here, with 5k vports being rather common,
> and for any reasonable value of N that manages somehow to perturbate the
> distribution of upcalls per thread, we are talking about something well
> in excess of 100k sockets. I think this doesn't really scale.
>
My argument is not about proposed fairness algorithm. It is about cost
of the fairness and I do not see it is addressed in any of the follow
ups. You seems to be worried about memory cost and fairness aspects, I
am worried about CPU cost of the solution.
I think proposed solution is solving the fairness issue but it is also
creating bottleneck in upcall processing. OVS is known to have slower
upcall processing. This patch is adding even more cost to the upcall
handling. The latency of first packet handling is also going up with
this approach.

I revisited the original patch, here is what I see in term of added
cost to existing upcall processing:
1. one "kzalloc(sizeof(*upcall), GFP_ATOMIC);" This involve allocate
and initialize memory
2. copy flow key which is more than 1 KB (upcall->key = *key)
3. Acquire spin_lock_bh dp->upcalls.lock, which would disable bottom
half processing on CPU while waiting for the global lock.
4. Iterate list of queued upcalls, one of objective it is to avoid out
of order packet. But I do not see point of ordering packets from
different streams.
5. signal upcall thread after delay ovs_dp_upcall_delay(). This adds
further to the latency.
6. upcall is then handed over to different thread (context switch),
likely on different CPU.
8. the upcall object is freed on remote CPU.
9. single lock essentially means OVS kernel datapath upcall processing
is single threaded no matter number of cores in system.

I would be interested in how are we going to address these issues.

In example you were talking about netlink fd issue on server with 48
core, how does this solution works when there are 5K ports each
triggering upcall ? Can you benchmark your patch? Do you have
performance numbers for TCP_CRR with and without this patch ? Also
publish latency numbers for this patch. Please turn off megaflow to
exercise upcall handling.

I understand fairness has cost, but we need to find right balance
between performance and fairness. Current fairness scheme is a
lockless algorithm without much computational overhead, did you try to
improve current algorithm so that it uses less number of ports.


> With the current value for N (3/4 * number of threads) this can even get
> close to /proc/sys/fs/file-max on some systems, and there raising the
> number of allowed file descriptors for ovs-vswitchd isn't a solution
> anymore.
>
> I would instead try to address the concerns that you had about the
> original patch adding fairness in the kernel, rather than trying to
> make the issue appear less severe in ovs-vswitchd.
>
> --
> Stefano


Re: [PATCH net-next] openvswitch: Derive IP protocol number for IPv6 later frags

2018-08-12 Thread Pravin Shelar
On Fri, Aug 10, 2018 at 10:19 AM, Yi-Hung Wei  wrote:
> Currently, OVS only parses the IP protocol number for the first
> IPv6 fragment, but sets the IP protocol number for the later fragments
> to be NEXTHDF_FRAGMENT.  This patch tries to derive the IP protocol
> number for the IPV6 later frags so that we can match that.
>
> Signed-off-by: Yi-Hung Wei 
> ---
>  net/openvswitch/flow.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
> index 56b8e7167790..3d654c4f71be 100644
> --- a/net/openvswitch/flow.c
> +++ b/net/openvswitch/flow.c
> @@ -297,7 +297,13 @@ static int parse_ipv6hdr(struct sk_buff *skb, struct 
> sw_flow_key *key)
>
> nh_len = payload_ofs - nh_ofs;
> skb_set_transport_header(skb, nh_ofs + nh_len);
> -   key->ip.proto = nexthdr;
> +   if (key->ip.frag == OVS_FRAG_TYPE_LATER) {
> +   unsigned int offset = 0;
> +
> +   key->ip.proto = ipv6_find_hdr(skb, , -1, NULL, NULL);
> +   } else {
> +   key->ip.proto = nexthdr;
> +   }
parsing ipv6 ipv6_skip_exthdr() is called to find fragment hdr and
then this patch calls ipv6_find_hdr() to find next protocol. I think
we could call ipv6_find_hdr() to get fragment type and next hdr, that
would save parsing same packet twice in some cases.

Other option would be calling ipv6_find_hdr() after setting OVS_FRAG_TYPE_LATER.


Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-07-31 Thread Pravin Shelar
On Tue, Jul 31, 2018 at 12:43 PM, Matteo Croce  wrote:
> On Mon, Jul 16, 2018 at 4:54 PM Matteo Croce  wrote:
>>
>> On Tue, Jul 10, 2018 at 6:31 PM Pravin Shelar  wrote:
>> >
>> > On Wed, Jul 4, 2018 at 7:23 AM, Matteo Croce  wrote:
>> > > From: Stefano Brivio 
>> > >
>> > > Open vSwitch sends to userspace all received packets that have
>> > > no associated flow (thus doing an "upcall"). Then the userspace
>> > > program creates a new flow and determines the actions to apply
>> > > based on its configuration.
>> > >
>> > > When a single port generates a high rate of upcalls, it can
>> > > prevent other ports from dispatching their own upcalls. vswitchd
>> > > overcomes this problem by creating many netlink sockets for each
>> > > port, but it quickly exceeds any reasonable maximum number of
>> > > open files when dealing with huge amounts of ports.
>> > >
>> > > This patch queues all the upcalls into a list, ordering them in
>> > > a per-port round-robin fashion, and schedules a deferred work to
>> > > queue them to userspace.
>> > >
>> > > The algorithm to queue upcalls in a round-robin fashion,
>> > > provided by Stefano, is based on these two rules:
>> > >  - upcalls for a given port must be inserted after all the other
>> > >occurrences of upcalls for the same port already in the queue,
>> > >in order to avoid out-of-order upcalls for a given port
>> > >  - insertion happens once the highest upcall count for any given
>> > >port (excluding the one currently at hand) is greater than the
>> > >count for the port we're queuing to -- if this condition is
>> > >never true, upcall is queued at the tail. This results in a
>> > >per-port round-robin order.
>> > >
>> > > In order to implement a fair round-robin behaviour, a variable
>> > > queueing delay is introduced. This will be zero if the upcalls
>> > > rate is below a given threshold, and grows linearly with the
>> > > queue utilisation (i.e. upcalls rate) otherwise.
>> > >
>> > > This ensures fairness among ports under load and with few
>> > > netlink sockets.
>> > >
>> > Thanks for the patch.
>> > This patch is adding following overhead for upcall handling:
>> > 1. kmalloc.
>> > 2. global spin-lock.
>> > 3. context switch to single worker thread.
>> > I think this could become bottle neck on most of multi core systems.
>> > You have mentioned issue with existing fairness mechanism, Can you
>> > elaborate on those, I think we could improve that before implementing
>> > heavy weight fairness in upcall handling.
>>
>> Hi Pravin,
>>
>> vswitchd allocates N * P netlink sockets, where N is the number of
>> online CPU cores, and P the number of ports.
>> With some setups, this number can grow quite fast, also exceeding the
>> system maximum file descriptor limit.
>> I've seen a 48 core server failing with -EMFILE when trying to create
>> more than 65535 netlink sockets needed for handling 1800+ ports.
>>
>> I made a previous attempt to reduce the sockets to one per CPU, but
>> this was discussed and rejected on ovs-dev because it would remove
>> fairness among ports[1].

Rather than reducing number of thread down to 1, we could find better
number of FDs per port.
How about this simple solution:
1. Allocate (N * P) FDs as long as it is under FD limit.
2. If FD limit (-EMFILE) is hit reduce N value by half and repeat step 1.


Thanks,
Pravin.

>> I think that the current approach of opening a huge number of sockets
>> doesn't really work, (it doesn't scale for sure), it still needs some
>> queueing logic (either in kernel or user space) if we really want to
>> be sure that low traffic ports gets their upcalls quota when other
>> ports are doing way more traffic.
>>
>> If you are concerned about the kmalloc or spinlock, we can solve them
>> with kmem_cache or two copies of the list and rcu, I'll happy to
>> discuss the implementation details, as long as we all agree that the
>> current implementation doesn't scale well and has an issue.
>>

>> [1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-February/344279.html
>>
>> --
>> Matteo Croce
>> per aspera ad upstream
>
> Hi all,
>
> any idea on how to solve the file descriptor limit hit by the netlink sockets?
> I see this issue happen very often, and raising the FD limit to 400k
> seems not the right way to solve it.
> Any other suggestion on how to improve the patch, or solve the problem
> in a different way?
>
> Regards,
>
>
>
> --
> Matteo Croce
> per aspera ad upstream


Re: [PATCH][net-next] openvswitch: eliminate cpu_used_mask from sw_flow

2018-07-28 Thread Pravin Shelar
On Fri, Jul 27, 2018 at 1:03 AM, Li RongQing  wrote:
> The size of struct cpumask varies with CONFIG_NR_CPUS, some config
> CONFIG_NR_CPUS is very larger, like 5120, struct cpumask will take
> 640 bytes, if there is thousands of flows, it will take lots of
> memory
>
I am fine with removing cpumask bitmap from flow struct.

> cpu_used_mask has two purposes
> 1: Assume first cpu as cpu0 which maybe not true; now use
>cpumask_first(cpu_possible_mask)

I am not sure about this, most of system would have cpu zero, so why
this change is done in this patch ? This adds overhead of calculating
first cpu when updating stats in fast path.

> 2: when get/clear statistic, reduce the iteratation; but it
>is not hot path, so use for_each_possible_cpu
>


Re: [PATCH net 2/2] openvswitch: check for null return for nla_nest_start in datapath

2018-07-19 Thread Pravin Shelar
On Wed, Jul 18, 2018 at 9:12 AM, Stephen Hemminger
 wrote:
> The call to nla_nest_start when forming packet messages can lead to a NULL
> return so it's possible for attr to become NULL and we can potentially
> get a NULL pointer dereference on attr.  Fix this by checking for
> a NULL return.
>
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200537
> Fixes: 8f0aad6f35f7 ("openvswitch: Extend packet attribute for egress tunnel 
> info")
> Signed-off-by: Stephen Hemminger 
> ---
>  net/openvswitch/datapath.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index 0f5ce77460d4..93c3eb635827 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -460,6 +460,10 @@ static int queue_userspace_packet(struct datapath *dp, 
> struct sk_buff *skb,
>
> if (upcall_info->egress_tun_info) {
> nla = nla_nest_start(user_skb, 
> OVS_PACKET_ATTR_EGRESS_TUN_KEY);
> +   if (!nla) {
> +   err = -EMSGSIZE;
> +   goto out;
> +   }
> err = ovs_nla_put_tunnel_info(user_skb,
>   upcall_info->egress_tun_info);
> BUG_ON(err);
> @@ -468,6 +472,10 @@ static int queue_userspace_packet(struct datapath *dp, 
> struct sk_buff *skb,
>
> if (upcall_info->actions_len) {
> nla = nla_nest_start(user_skb, OVS_PACKET_ATTR_ACTIONS);
> +   if (!nla) {
> +   err = -EMSGSIZE;
> +   goto out;
> +   }
> err = ovs_nla_put_actions(upcall_info->actions,
>   upcall_info->actions_len,
>   user_skb);

Acked-by: Pravin B Shelar 

Thanks.


Re: [PATCH net 1/2] openvswitch: check for null return for nla_nest_start

2018-07-19 Thread Pravin Shelar
On Wed, Jul 18, 2018 at 9:12 AM, Stephen Hemminger
 wrote:
> The call to nla_nest_start in conntrack can lead to a NULL
> return so it's possible for attr to become NULL and we can potentially
> get a NULL pointer dereference on attr.  Fix this by checking for
> a NULL return.
>
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200533
> Fixes: 11efd5cb04a1 ("openvswitch: Support conntrack zone limit")
> Signed-off-by: Stephen Hemminger 
> ---
>  net/openvswitch/conntrack.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index 284aca2a252d..2e316f641df8 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -2132,6 +2132,8 @@ static int ovs_ct_limit_cmd_get(struct sk_buff *skb, 
> struct genl_info *info)
> return PTR_ERR(reply);
>
> nla_reply = nla_nest_start(reply, OVS_CT_LIMIT_ATTR_ZONE_LIMIT);
> +   if (!nla_reply)
> +   return PRT_ERR(-EMSGSIZE);
>
> if (a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT]) {
> err = ovs_ct_limit_get_zone_limit(
> --
Acked-by: Pravin B Shelar 

Thanks.


Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-07-10 Thread Pravin Shelar
On Wed, Jul 4, 2018 at 7:23 AM, Matteo Croce  wrote:
> From: Stefano Brivio 
>
> Open vSwitch sends to userspace all received packets that have
> no associated flow (thus doing an "upcall"). Then the userspace
> program creates a new flow and determines the actions to apply
> based on its configuration.
>
> When a single port generates a high rate of upcalls, it can
> prevent other ports from dispatching their own upcalls. vswitchd
> overcomes this problem by creating many netlink sockets for each
> port, but it quickly exceeds any reasonable maximum number of
> open files when dealing with huge amounts of ports.
>
> This patch queues all the upcalls into a list, ordering them in
> a per-port round-robin fashion, and schedules a deferred work to
> queue them to userspace.
>
> The algorithm to queue upcalls in a round-robin fashion,
> provided by Stefano, is based on these two rules:
>  - upcalls for a given port must be inserted after all the other
>occurrences of upcalls for the same port already in the queue,
>in order to avoid out-of-order upcalls for a given port
>  - insertion happens once the highest upcall count for any given
>port (excluding the one currently at hand) is greater than the
>count for the port we're queuing to -- if this condition is
>never true, upcall is queued at the tail. This results in a
>per-port round-robin order.
>
> In order to implement a fair round-robin behaviour, a variable
> queueing delay is introduced. This will be zero if the upcalls
> rate is below a given threshold, and grows linearly with the
> queue utilisation (i.e. upcalls rate) otherwise.
>
> This ensures fairness among ports under load and with few
> netlink sockets.
>
Thanks for the patch.
This patch is adding following overhead for upcall handling:
1. kmalloc.
2. global spin-lock.
3. context switch to single worker thread.
I think this could become bottle neck on most of multi core systems.
You have mentioned issue with existing fairness mechanism, Can you
elaborate on those, I think we could improve that before implementing
heavy weight fairness in upcall handling.


Re: [PATCH v2 net-next] openvswitch: kernel datapath clone action

2018-07-05 Thread Pravin Shelar
On Mon, Jul 2, 2018 at 8:18 AM, Yifeng Sun  wrote:
> Add 'clone' action to kernel datapath by using existing functions.
> When actions within clone don't modify the current flow, the flow
> key is not cloned before executing clone actions.
>
> This is a follow up patch for this incomplete work:
> https://patchwork.ozlabs.org/patch/722096/
>
> v1 -> v2:
> Refactor as advised by reviewer.
>
> Signed-off-by: Yifeng Sun 
> Signed-off-by: Andy Zhou 

Acked-by: Pravin B Shelar 
Thanks.


Re: [PATCH net-next] openvswitch: kernel datapath clone action

2018-06-29 Thread Pravin Shelar
On Thu, Jun 28, 2018 at 8:20 AM, Yifeng Sun  wrote:
> Add 'clone' action to kernel datapath by using existing functions.
> When actions within clone don't modify the current flow, the flow
> key is not cloned before executing clone actions.
>
> This is a follow up patch for this incomplete work:
> https://patchwork.ozlabs.org/patch/722096/
>
> Signed-off-by: Yifeng Sun 
> Signed-off-by: Andy Zhou 
> ---
>  include/uapi/linux/openvswitch.h |  8 +
>  net/openvswitch/actions.c| 33 ++
>  net/openvswitch/flow_netlink.c   | 73 
> 
>  3 files changed, 114 insertions(+)
>
> diff --git a/include/uapi/linux/openvswitch.h 
> b/include/uapi/linux/openvswitch.h
> index 863aaba..5de8583 100644
> --- a/include/uapi/linux/openvswitch.h
> +++ b/include/uapi/linux/openvswitch.h
> @@ -625,6 +625,11 @@ struct sample_arg {
>   * 'OVS_SAMPLE_ATTR_PROBABILITY'.
>   */
>  };
> +
> +#define OVS_CLONE_ATTR_EXEC  0   /* Specify an u32 value. When nonzero,
> + * actions in clone will not change flow
> + * keys. False otherwise.
> + */
>  #endif

This symbol is used only in datapath, so we can move it to kernel
headers from uapi.

> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index 30a5df2..e31 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -1057,6 +1057,28 @@ static int sample(struct datapath *dp, struct sk_buff 
> *skb,
>  clone_flow_key);
>  }
>
> +/* When 'last' is true, clone() should always consume the 'skb'.
> + * Otherwise, clone() should keep 'skb' intact regardless what
> + * actions are executed within clone().
> + */
> +static int clone(struct datapath *dp, struct sk_buff *skb,
> +struct sw_flow_key *key, const struct nlattr *attr,
> +bool last)
> +{
> +   struct nlattr *actions;
> +   struct nlattr *clone_arg;
> +   int rem = nla_len(attr);
> +   bool clone_flow_key;
> +
> +   /* The first action is always 'OVS_CLONE_ATTR_ARG'. */
> +   clone_arg = nla_data(attr);
> +   clone_flow_key = !nla_get_u32(clone_arg);
> +   actions = nla_next(clone_arg, );
> +

Since OVS_CLONE_ATTR_EXEC means do not clone the key, it can be named
accordingly.


Re: [PATCH net-next v5 1/2] openvswitch: Add conntrack limit netlink definition

2018-05-25 Thread Pravin Shelar
On Thu, May 24, 2018 at 5:56 PM, Yi-Hung Wei  wrote:
> Define netlink messages and attributes to support user kernel
> communication that uses the conntrack limit feature.
>
> Signed-off-by: Yi-Hung Wei 

Acked-by: Pravin B Shelar 

Thanks.


Re: [PATCH net-next v5 2/2] openvswitch: Support conntrack zone limit

2018-05-25 Thread Pravin Shelar
On Thu, May 24, 2018 at 5:56 PM, Yi-Hung Wei  wrote:
> Currently, nf_conntrack_max is used to limit the maximum number of
> conntrack entries in the conntrack table for every network namespace.
> For the VMs and containers that reside in the same namespace,
> they share the same conntrack table, and the total # of conntrack entries
> for all the VMs and containers are limited by nf_conntrack_max.  In this
> case, if one of the VM/container abuses the usage the conntrack entries,
> it blocks the others from committing valid conntrack entries into the
> conntrack table.  Even if we can possibly put the VM in different network
> namespace, the current nf_conntrack_max configuration is kind of rigid
> that we cannot limit different VM/container to have different # conntrack
> entries.
>
> To address the aforementioned issue, this patch proposes to have a
> fine-grained mechanism that could further limit the # of conntrack entries
> per-zone.  For example, we can designate different zone to different VM,
> and set conntrack limit to each zone.  By providing this isolation, a
> mis-behaved VM only consumes the conntrack entries in its own zone, and
> it will not influence other well-behaved VMs.  Moreover, the users can
> set various conntrack limit to different zone based on their preference.
>
> The proposed implementation utilizes Netfilter's nf_conncount backend
> to count the number of connections in a particular zone.  If the number of
> connection is above a configured limitation, ovs will return ENOMEM to the
> userspace.  If userspace does not configure the zone limit, the limit
> defaults to zero that is no limitation, which is backward compatible to
> the behavior without this patch.
>
> The following high leve APIs are provided to the userspace:
>   - OVS_CT_LIMIT_CMD_SET:
> * set default connection limit for all zones
> * set the connection limit for a particular zone
>   - OVS_CT_LIMIT_CMD_DEL:
> * remove the connection limit for a particular zone
>   - OVS_CT_LIMIT_CMD_GET:
> * get the default connection limit for all zones
> * get the connection limit for a particular zone
>
> Signed-off-by: Yi-Hung Wei 

Thanks for working on the feature.

Acked-by: Pravin B Shelar 


Re: [PATCH net-next v4 2/2] openvswitch: Support conntrack zone limit

2018-05-24 Thread Pravin Shelar
On Mon, May 21, 2018 at 5:16 PM, Yi-Hung Wei  wrote:
> Currently, nf_conntrack_max is used to limit the maximum number of
> conntrack entries in the conntrack table for every network namespace.
> For the VMs and containers that reside in the same namespace,
> they share the same conntrack table, and the total # of conntrack entries
> for all the VMs and containers are limited by nf_conntrack_max.  In this
> case, if one of the VM/container abuses the usage the conntrack entries,
> it blocks the others from committing valid conntrack entries into the
> conntrack table.  Even if we can possibly put the VM in different network
> namespace, the current nf_conntrack_max configuration is kind of rigid
> that we cannot limit different VM/container to have different # conntrack
> entries.
>
> To address the aforementioned issue, this patch proposes to have a
> fine-grained mechanism that could further limit the # of conntrack entries
> per-zone.  For example, we can designate different zone to different VM,
> and set conntrack limit to each zone.  By providing this isolation, a
> mis-behaved VM only consumes the conntrack entries in its own zone, and
> it will not influence other well-behaved VMs.  Moreover, the users can
> set various conntrack limit to different zone based on their preference.
>
> The proposed implementation utilizes Netfilter's nf_conncount backend
> to count the number of connections in a particular zone.  If the number of
> connection is above a configured limitation, ovs will return ENOMEM to the
> userspace.  If userspace does not configure the zone limit, the limit
> defaults to zero that is no limitation, which is backward compatible to
> the behavior without this patch.
>
> The following high leve APIs are provided to the userspace:
>   - OVS_CT_LIMIT_CMD_SET:
> * set default connection limit for all zones
> * set the connection limit for a particular zone
>   - OVS_CT_LIMIT_CMD_DEL:
> * remove the connection limit for a particular zone
>   - OVS_CT_LIMIT_CMD_GET:
> * get the default connection limit for all zones
> * get the connection limit for a particular zone
>
> Signed-off-by: Yi-Hung Wei 

I have few comments, but otherwise patch looks good.
> ---
>  net/openvswitch/Kconfig |   3 +-
>  net/openvswitch/conntrack.c | 541 
> +++-
>  net/openvswitch/conntrack.h |   9 +-
>  net/openvswitch/datapath.c  |   7 +-
>  net/openvswitch/datapath.h  |   3 +
>  5 files changed, 557 insertions(+), 6 deletions(-)
>
> diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
> index 2650205cdaf9..89da9512ec1e 100644
> --- a/net/openvswitch/Kconfig
> +++ b/net/openvswitch/Kconfig
> @@ -9,7 +9,8 @@ config OPENVSWITCH
>(NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
>  (!NF_NAT || NF_NAT) && \
>  (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
> -(!NF_NAT_IPV6 || NF_NAT_IPV6)))
> +(!NF_NAT_IPV6 || NF_NAT_IPV6) && \
> +(!NETFILTER_CONNCOUNT || 
> NETFILTER_CONNCOUNT)))
> select LIBCRC32C
> select MPLS
> select NET_MPLS_GSO
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index 02fc343feb66..e8bb91420ca9 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -16,8 +16,11 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
> +#include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -76,6 +79,31 @@ struct ovs_conntrack_info {
>  #endif
>  };
>
> +#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
> +#define OVS_CT_LIMIT_UNLIMITED 0
> +#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
> +#define CT_LIMIT_HASH_BUCKETS 512
> +static DEFINE_STATIC_KEY_FALSE(ovs_ct_limit_enabled);
> +
> +struct ovs_ct_limit {
> +   /* Elements in ovs_ct_limit_info->limits hash table */
> +   struct hlist_node hlist_node;
> +   struct rcu_head rcu;
> +   u16 zone;
> +   u32 limit;
> +};
> +
> +struct ovs_ct_limit_info {
> +   u32 default_limit;
> +   struct hlist_head *limits;
> +   struct nf_conncount_data *data __aligned(8);

Why does it need explicit alignment attribute?

> +};
> +
> +static const struct nla_policy ct_limit_policy[OVS_CT_LIMIT_ATTR_MAX + 1] = {
> +   [OVS_CT_LIMIT_ATTR_ZONE_LIMIT] = { .type = NLA_NESTED, },
> +};
> +#endif
> +
>  static bool labels_nonzero(const struct ovs_key_ct_labels *labels);

...

> +static int ovs_ct_check_limit(struct net *net,
> + const struct ovs_conntrack_info *info,
> + const struct nf_conntrack_tuple *tuple)
> +{
> +   struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
> +   const struct ovs_ct_limit_info *ct_limit_info = 
> ovs_net->ct_limit_info;
> +   

Re: [PATCH net-next v4 0/2] openvswitch: Support conntrack zone limit

2018-05-23 Thread Pravin Shelar
On Wed, May 23, 2018 at 10:13 AM, David Miller  wrote:
> From: Yi-Hung Wei 
> Date: Mon, 21 May 2018 17:16:03 -0700
>
>> v3->v4:
>>   - Addresses comments from Parvin that include simplify netlink API,
>> and remove unncessary RCU lockings.
>>   - Rebases to master.
>
> Pravin, please review.

I will finish review in few hours.


Re: [PATCH net-next v3 1/2] openvswitch: Add conntrack limit netlink definition

2018-05-03 Thread Pravin Shelar
On Mon, Apr 30, 2018 at 2:28 PM, Yi-Hung Wei  wrote:
> Define netlink messages and attributes to support user kernel
> communication that uses the conntrack limit feature.
>
> Signed-off-by: Yi-Hung Wei 
> ---
>  include/uapi/linux/openvswitch.h | 62 
> 
>  1 file changed, 62 insertions(+)
>
> diff --git a/include/uapi/linux/openvswitch.h 
> b/include/uapi/linux/openvswitch.h
> index 713e56ce681f..ca63c16375ce 100644
> --- a/include/uapi/linux/openvswitch.h
> +++ b/include/uapi/linux/openvswitch.h
> @@ -937,4 +937,66 @@ enum ovs_meter_band_type {
>
>  #define OVS_METER_BAND_TYPE_MAX (__OVS_METER_BAND_TYPE_MAX - 1)
>
> +/* Conntrack limit */
> +#define OVS_CT_LIMIT_FAMILY  "ovs_ct_limit"
> +#define OVS_CT_LIMIT_MCGROUP "ovs_ct_limit"
> +#define OVS_CT_LIMIT_VERSION 0x1
> +
> +enum ovs_ct_limit_cmd {
> +   OVS_CT_LIMIT_CMD_UNSPEC,
> +   OVS_CT_LIMIT_CMD_SET,   /* Add or modify ct limit. */
> +   OVS_CT_LIMIT_CMD_DEL,   /* Delete ct limit. */
> +   OVS_CT_LIMIT_CMD_GET/* Get ct limit. */
> +};
> +
> +enum ovs_ct_limit_attr {
> +   OVS_CT_LIMIT_ATTR_UNSPEC,
> +   OVS_CT_LIMIT_ATTR_OPTION,   /* Nested OVS_CT_LIMIT_ATTR_* */
> +   __OVS_CT_LIMIT_ATTR_MAX
> +};
> +
> +#define OVS_CT_LIMIT_ATTR_MAX (__OVS_CT_LIMIT_ATTR_MAX - 1)
> +
> +/**
> + * @OVS_CT_ZONE_LIMIT_ATTR_SET_REQ: Contains either
> + * OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT or a pair of
> + * OVS_CT_ZONE_LIMIT_ATTR_ZONE and OVS_CT_ZONE_LIMIT_ATTR_LIMIT.
> + * @OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ: Contains OVS_CT_ZONE_LIMIT_ATTR_ZONE.
> + * @OVS_CT_ZONE_LIMIT_ATTR_GET_REQ: Contains OVS_CT_ZONE_LIMIT_ATTR_ZONE.
> + * @OVS_CT_ZONE_LIMIT_ATTR_GET_RLY: Contains either
> + * OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT or a triple of
> + * OVS_CT_ZONE_LIMIT_ATTR_ZONE, OVS_CT_ZONE_LIMIT_ATTR_LIMIT and
> + * OVS_CT_ZONE_LIMIT_ATTR_COUNT.
> + */
> +enum ovs_ct_limit_option_attr {
> +   OVS_CT_LIMIT_OPTION_ATTR_UNSPEC,
> +   OVS_CT_ZONE_LIMIT_ATTR_SET_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
> +* attributes. */
> +   OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
> +* attributes. */
> +   OVS_CT_ZONE_LIMIT_ATTR_GET_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
> +* attributes. */
> +   OVS_CT_ZONE_LIMIT_ATTR_GET_RLY, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*

This option looks redundant to me, can we just use ovs_ct_limit_cmd
and have nested attributes with ovs_ct_zone_limit_attr as parameters ?
I do not see need for ovs_ct_limit_attr either, These changes would
simplify the interface.

> +* attributes. */
> +   __OVS_CT_LIMIT_OPTION_ATTR_MAX
> +};
> +
> +#define OVS_CT_LIMIT_OPTION_ATTR_MAX (__OVS_CT_LIMIT_OPTION_ATTR_MAX - 1)
> +
> +enum ovs_ct_zone_limit_attr {
> +   OVS_CT_ZONE_LIMIT_ATTR_UNSPEC,
> +   OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT,   /* u32 default conntrack limit
> +* for all zones. */
> +   OVS_CT_ZONE_LIMIT_ATTR_ZONE,/* u16 conntrack zone id. */
> +   OVS_CT_ZONE_LIMIT_ATTR_LIMIT,   /* u32 max number of conntrack
> +* entries allowed in the
> +* corresponding zone. */
> +   OVS_CT_ZONE_LIMIT_ATTR_COUNT,   /* u32 number of conntrack
> +* entries in the 
> corresponding
> +* zone. */
> +   __OVS_CT_ZONE_LIMIT_ATTR_MAX
> +};
> +
> +#define OVS_CT_ZONE_LIMIT_ATTR_MAX (__OVS_CT_ZONE_LIMIT_ATTR_MAX - 1)
> +
>  #endif /* _LINUX_OPENVSWITCH_H */
> --
> 2.7.4
>


Re: [PATCH net-next v3 2/2] openvswitch: Support conntrack zone limit

2018-05-03 Thread Pravin Shelar
On Mon, Apr 30, 2018 at 2:28 PM, Yi-Hung Wei  wrote:
> Currently, nf_conntrack_max is used to limit the maximum number of
> conntrack entries in the conntrack table for every network namespace.
> For the VMs and containers that reside in the same namespace,
> they share the same conntrack table, and the total # of conntrack entries
> for all the VMs and containers are limited by nf_conntrack_max.  In this
> case, if one of the VM/container abuses the usage the conntrack entries,
> it blocks the others from committing valid conntrack entries into the
> conntrack table.  Even if we can possibly put the VM in different network
> namespace, the current nf_conntrack_max configuration is kind of rigid
> that we cannot limit different VM/container to have different # conntrack
> entries.
>
> To address the aforementioned issue, this patch proposes to have a
> fine-grained mechanism that could further limit the # of conntrack entries
> per-zone.  For example, we can designate different zone to different VM,
> and set conntrack limit to each zone.  By providing this isolation, a
> mis-behaved VM only consumes the conntrack entries in its own zone, and
> it will not influence other well-behaved VMs.  Moreover, the users can
> set various conntrack limit to different zone based on their preference.
>
> The proposed implementation utilizes Netfilter's nf_conncount backend
> to count the number of connections in a particular zone.  If the number of
> connection is above a configured limitation, ovs will return ENOMEM to the
> userspace.  If userspace does not configure the zone limit, the limit
> defaults to zero that is no limitation, which is backward compatible to
> the behavior without this patch.
>
> The following high leve APIs are provided to the userspace:
>   - OVS_CT_LIMIT_CMD_SET:
> * set default connection limit for all zones
> * set the connection limit for a particular zone
>   - OVS_CT_LIMIT_CMD_DEL:
> * remove the connection limit for a particular zone
>   - OVS_CT_LIMIT_CMD_GET:
> * get the default connection limit for all zones
> * get the connection limit for a particular zone
>
> Signed-off-by: Yi-Hung Wei 
> ---
>  net/openvswitch/Kconfig |   3 +-
>  net/openvswitch/conntrack.c | 508 
> +++-
>  net/openvswitch/conntrack.h |   9 +-
>  net/openvswitch/datapath.c  |   7 +-
>  net/openvswitch/datapath.h  |   1 +
>  5 files changed, 522 insertions(+), 6 deletions(-)
>
..
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index c5904f629091..8234964889d9 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
...

> +/* Call with ovs_mutex */
> +static void ct_limit_del(const struct ovs_ct_limit_info *info, u16 zone)
> +{
> +   struct ovs_ct_limit *ct_limit;
> +   struct hlist_head *head;
> +
> +   head = ct_limit_hash_bucket(info, zone);
> +   hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
better to use hlist_for_each_entry_safe()

> +   if (ct_limit->zone == zone) {
> +   hlist_del_rcu(_limit->hlist_node);
> +   kfree_rcu(ct_limit, rcu);
> +   return;
> +   }
> +   }
> +}
> +


> +static int ovs_ct_check_limit(struct net *net,
> + const struct ovs_conntrack_info *info,
> + const struct nf_conntrack_tuple *tuple)
> +{
> +   struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
> +   const struct ovs_ct_limit_info *ct_limit_info = 
> ovs_net->ct_limit_info;
> +   u32 per_zone_limit, connections;
> +   u32 conncount_key[5];
> +
> +   conncount_key[0] = info->zone.id;
> +
> +   rcu_read_lock();
This function is call with rcu_read_lock() in datapath, so no need to
take it again.

> +   per_zone_limit = ct_limit_get(ct_limit_info, info->zone.id);
> +   if (per_zone_limit == OVS_CT_LIMIT_UNLIMITED) {
> +   rcu_read_unlock();
> +   return 0;
> +   }
> +
> +   connections = nf_conncount_count(net, ct_limit_info->data,
> +conncount_key, tuple, >zone);
> +   if (connections > per_zone_limit) {
> +   rcu_read_unlock();
> +   return -ENOMEM;
> +   }
> +
> +   rcu_read_unlock();
> +   return 0;
> +}
> +#endif
> +


>
>  static void __net_exit list_vports_from_net(struct net *net, struct net 
> *dnet,
> @@ -2469,3 +2471,4 @@ MODULE_ALIAS_GENL_FAMILY(OVS_VPORT_FAMILY);
>  MODULE_ALIAS_GENL_FAMILY(OVS_FLOW_FAMILY);
>  MODULE_ALIAS_GENL_FAMILY(OVS_PACKET_FAMILY);
>  MODULE_ALIAS_GENL_FAMILY(OVS_METER_FAMILY);
> +MODULE_ALIAS_GENL_FAMILY(OVS_CT_LIMIT_FAMILY);
> diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
> index 523d65526766..51bd4dcb6c8b 100644
> --- a/net/openvswitch/datapath.h
> +++ b/net/openvswitch/datapath.h
> @@ -144,6 

Re: [PATCH net-next v2 2/2] openvswitch: Support conntrack zone limit

2018-04-27 Thread Pravin Shelar
On Wed, Apr 25, 2018 at 2:51 PM, Yi-Hung Wei  wrote:
>>> +#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
>>> +#define OVS_CT_LIMIT_UNLIMITED 0
>>> +#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
>>> +#define CT_LIMIT_HASH_BUCKETS 512
>>> +
>> Can you use static key when the limit is not set.
>> This would avoid overhead in datapath when these limits are not used.
>>
> Thanks Parvin for the review.  I'm not sure about the 'static key'
> part, are you suggesting that say if we can have a flag to check if no
> one has ever set the ct_limit?   If the ct_limit feature is not used
> so far, then instead of look up the hash table for the zone limit, we
> just return the default unlimited value.  So that we can avoid the
> overhead of checking ct_limit.
>

Right, here documentaion of static keys:
https://www.kernel.org/doc/Documentation/static-keys.txt

>>> +struct ovs_ct_limit {
>>> +   /* Elements in ovs_ct_limit_info->limits hash table */
>>> +   struct hlist_node hlist_node;
>>> +   struct rcu_head rcu;
>>> +   u16 zone;
>>> +   u32 limit;
>>> +};
>>> +
>> ...
>
>
>>>  /* Lookup connection and confirm if unconfirmed. */
>>>  static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
>>>  const struct ovs_conntrack_info *info,
>>> @@ -1054,6 +1176,13 @@ static int ovs_ct_commit(struct net *net, struct 
>>> sw_flow_key *key,
>>> if (!ct)
>>> return 0;
>>>
>>> +#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
>>> +   err = ovs_ct_check_limit(net, info,
>>> +>tuplehash[IP_CT_DIR_ORIGINAL].tuple);
>>> +   if (err)
>>> +   return err;
>>> +#endif
>>> +
>>
>> This could be checked during flow install time, so that only permitted
>> flows would have 'ct commit' action, we can avoid per packet cost
>> checking the limit.
> It seems to me that it would be hard to check the # of connections of
> a flow in the flow installation stage.  For example, if we have a
> datapath flow that performs “ct commit” action on all UDP traffic from
> in_port 1, then it could be various combinations of 5-tuple that form
> various # of connections.  Therefore, it would be hard to do the
> admission control there.
>

Ok, I see the issue.
I think we could still avoid the lookup every time by checking the
limits for unconfirmed connections (ref: nf_ct_is_confirmed()).
So once connection confirmed and is under limit  we should allow all
packets for given connection.
This way we can avoid connection disruption when we are reaching near
connection limit.

>
>> returning error code form ovs_ct_commit() is lost in datapath and it
>> would be hard to debug packet lost in case of the limit is reached. So
>> another advantage of checking the limit in flow install be better
>> traceability. datapath would return error to usespace and it can log
>> the error code.
> Thanks for pointing out the error code from ovs_ct_commit() will be
> lost in the datapath.  In this case, shall we consider to report the
> packet drop by some rate_limit logging such as net_warn_ratelimited()
> or net_info_ratelimited()?
>

ok, that would be fine.


Re: [PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit

2018-04-24 Thread Pravin Shelar
On Mon, Apr 23, 2018 at 2:19 PM, Yi-Hung Wei <yihung@gmail.com> wrote:
> On Mon, Apr 23, 2018 at 1:10 PM, Pravin Shelar <pshe...@ovn.org> wrote:
>> On Mon, Apr 23, 2018 at 6:39 AM, David Miller <da...@davemloft.net> wrote:
>>> From: Yi-Hung Wei <yihung@gmail.com>
>>> Date: Tue, 17 Apr 2018 17:30:27 -0700
>>>
>>>> Currently, nf_conntrack_max is used to limit the maximum number of
>>>> conntrack entries in the conntrack table for every network namespace.
>>>> For the VMs and containers that reside in the same namespace,
>>>> they share the same conntrack table, and the total # of conntrack entries
>>>> for all the VMs and containers are limited by nf_conntrack_max.  In this
>>>> case, if one of the VM/container abuses the usage the conntrack entries,
>>>> it blocks the others from committing valid conntrack entries into the
>>>> conntrack table.  Even if we can possibly put the VM in different network
>>>> namespace, the current nf_conntrack_max configuration is kind of rigid
>>>> that we cannot limit different VM/container to have different # conntrack
>>>> entries.
>>>>
>>
>> Hi
>> This looks like general problem related to nf zone usage limit, Did
>> you considered changing nf-conntrack to have a per zone limit, so that
>> all users of nf-filter can use it. I prefer this to adding a wrapper
>> in OVS nf-filter layer.
>>
>> Thanks,
>> Pravin.
>>
>
> Hi Prvain,
>
> Thanks for your comment.  Originally, I was thinking to add this
> feature in nf_conntrack and had some discussion with Florian.  It
> turns out that iptables and nft have their own way to keep track of
> the connection limits, and it sounds reasonable to share the backend
> that counts the number of connections, but each module can enforce the
> connection limit in their own way.  Therefore, Florian helped to pull
> out the common backend to nf_conncount in the following commit. The
> nf_conncount then can be used by xtables, nft, and ovs.
>
> commit 625c556118f3c2fd28bb8ef6da18c53bd4037be4
> Author: Florian Westphal <f...@strlen.de>
> Date:   Sat Dec 9 21:01:08 2017 +0100
>
> netfilter: connlimit: split xt_connlimit into front and backend
>
> This allows to reuse xt_connlimit infrastructure from nf_tables.
> The upcoming nf_tables frontend can just pass in an nftables register
> as input key, this allows limiting by any nft-supported key, including
> concatenations.  For xt_connlimit, pass in the zone and the ip/ipv6 addres.
> 
>
>
> Basically, to achieve conntrack zone limit in OVS.  We need the
> following 3 parts.
> 1. Count the number of connections (this is provided by netfilter's
> nf_conncount backend)
> 2. Keep track of the connection limits of zones, and check if it
> exceeds the limit.
> 3. An API for userspace to set/delete/get the conntrack zone limit.
>
> This patch series implements item 2 and 3, and it reuses the
> nf_conncount from netfiler for the first part.
>
OK. Thanks for the info.


Re: [PATCH net-next v2 2/2] openvswitch: Support conntrack zone limit

2018-04-24 Thread Pravin Shelar
On Tue, Apr 17, 2018 at 5:30 PM, Yi-Hung Wei  wrote:
> Currently, nf_conntrack_max is used to limit the maximum number of
> conntrack entries in the conntrack table for every network namespace.
> For the VMs and containers that reside in the same namespace,
> they share the same conntrack table, and the total # of conntrack entries
> for all the VMs and containers are limited by nf_conntrack_max.  In this
> case, if one of the VM/container abuses the usage the conntrack entries,
> it blocks the others from committing valid conntrack entries into the
> conntrack table.  Even if we can possibly put the VM in different network
> namespace, the current nf_conntrack_max configuration is kind of rigid
> that we cannot limit different VM/container to have different # conntrack
> entries.
>
> To address the aforementioned issue, this patch proposes to have a
> fine-grained mechanism that could further limit the # of conntrack entries
> per-zone.  For example, we can designate different zone to different VM,
> and set conntrack limit to each zone.  By providing this isolation, a
> mis-behaved VM only consumes the conntrack entries in its own zone, and
> it will not influence other well-behaved VMs.  Moreover, the users can
> set various conntrack limit to different zone based on their preference.
>
> The proposed implementation utilizes Netfilter's nf_conncount backend
> to count the number of connections in a particular zone.  If the number of
> connection is above a configured limitation, ovs will return ENOMEM to the
> userspace.  If userspace does not configure the zone limit, the limit
> defaults to zero that is no limitation, which is backward compatible to
> the behavior without this patch.
>
> The following high leve APIs are provided to the userspace:
>   - OVS_CT_LIMIT_CMD_SET:
> * set default connection limit for all zones
> * set the connection limit for a particular zone
>   - OVS_CT_LIMIT_CMD_DEL:
> * remove the connection limit for a particular zone
>   - OVS_CT_LIMIT_CMD_GET:
> * get the default connection limit for all zones
> * get the connection limit for a particular zone
>
> Signed-off-by: Yi-Hung Wei 
> ---
>  net/openvswitch/Kconfig |   3 +-
>  net/openvswitch/conntrack.c | 498 
> +++-
>  net/openvswitch/conntrack.h |   9 +-
>  net/openvswitch/datapath.c  |   7 +-
>  net/openvswitch/datapath.h  |   1 +
>  5 files changed, 512 insertions(+), 6 deletions(-)
>
> diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
> index 2650205cdaf9..89da9512ec1e 100644
> --- a/net/openvswitch/Kconfig
> +++ b/net/openvswitch/Kconfig
> @@ -9,7 +9,8 @@ config OPENVSWITCH
>(NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
>  (!NF_NAT || NF_NAT) && \
>  (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
> -(!NF_NAT_IPV6 || NF_NAT_IPV6)))
> +(!NF_NAT_IPV6 || NF_NAT_IPV6) && \
> +(!NETFILTER_CONNCOUNT || 
> NETFILTER_CONNCOUNT)))
> select LIBCRC32C
> select MPLS
> select NET_MPLS_GSO
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index c5904f629091..d09b572f72b4 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -17,7 +17,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -76,6 +78,38 @@ struct ovs_conntrack_info {
>  #endif
>  };
>
> +#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
> +#define OVS_CT_LIMIT_UNLIMITED 0
> +#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
> +#define CT_LIMIT_HASH_BUCKETS 512
> +
Can you use static key when the limit is not set.
This would avoid overhead in datapath when these limits are not used.

> +struct ovs_ct_limit {
> +   /* Elements in ovs_ct_limit_info->limits hash table */
> +   struct hlist_node hlist_node;
> +   struct rcu_head rcu;
> +   u16 zone;
> +   u32 limit;
> +};
> +
...

> +#endif
> +
>  /* Lookup connection and confirm if unconfirmed. */
>  static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
>  const struct ovs_conntrack_info *info,
> @@ -1054,6 +1176,13 @@ static int ovs_ct_commit(struct net *net, struct 
> sw_flow_key *key,
> if (!ct)
> return 0;
>
> +#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
> +   err = ovs_ct_check_limit(net, info,
> +>tuplehash[IP_CT_DIR_ORIGINAL].tuple);
> +   if (err)
> +   return err;
> +#endif
> +

This could be checked during flow install time, so that only permitted
flows would have 'ct commit' action, we can avoid per packet cost
checking the limit.
returning error code form ovs_ct_commit() is lost in datapath and it
would be hard to 

Re: [PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit

2018-04-23 Thread Pravin Shelar
On Mon, Apr 23, 2018 at 6:39 AM, David Miller  wrote:
> From: Yi-Hung Wei 
> Date: Tue, 17 Apr 2018 17:30:27 -0700
>
>> Currently, nf_conntrack_max is used to limit the maximum number of
>> conntrack entries in the conntrack table for every network namespace.
>> For the VMs and containers that reside in the same namespace,
>> they share the same conntrack table, and the total # of conntrack entries
>> for all the VMs and containers are limited by nf_conntrack_max.  In this
>> case, if one of the VM/container abuses the usage the conntrack entries,
>> it blocks the others from committing valid conntrack entries into the
>> conntrack table.  Even if we can possibly put the VM in different network
>> namespace, the current nf_conntrack_max configuration is kind of rigid
>> that we cannot limit different VM/container to have different # conntrack
>> entries.
>>

Hi
This looks like general problem related to nf zone usage limit, Did
you considered changing nf-conntrack to have a per zone limit, so that
all users of nf-filter can use it. I prefer this to adding a wrapper
in OVS nf-filter layer.

Thanks,
Pravin.

>> To address the aforementioned issue, this patch proposes to have a
>> fine-grained mechanism that could further limit the # of conntrack entries
>> per-zone.  For example, we can designate different zone to different VM,
>> and set conntrack limit to each zone.  By providing this isolation, a
>> mis-behaved VM only consumes the conntrack entries in its own zone, and
>> it will not influence other well-behaved VMs.  Moreover, the users can
>> set various conntrack limit to different zone based on their preference.
>>
>> The proposed implementation utilizes Netfilter's nf_conncount backend
>> to count the number of connections in a particular zone.  If the number of
>> connection is above a configured limitation, OVS will return ENOMEM to the
>> userspace.  If userspace does not configure the zone limit, the limit
>> defaults to zero that is no limitation, which is backward compatible to
>> the behavior without this patch.
>>
>> The first patch defines the conntrack limit netlink definition, and the
>> second patch provides the implementation.
>
> Pravin, I need this series reviewed.
>
> Thank you.


Re: [PATCH net] net: fix possible out-of-bound read in skb_network_protocol()

2018-03-27 Thread Pravin Shelar
On Mon, Mar 26, 2018 at 8:08 AM, Eric Dumazet  wrote:
> skb mac header is not necessarily set at the time skb_network_protocol()
> is called. Use skb->data instead.
>
> BUG: KASAN: slab-out-of-bounds in skb_network_protocol+0x46b/0x4b0 
> net/core/dev.c:2739
> Read of size 2 at addr 8801b3097a0b by task syz-executor5/14242
>
> CPU: 1 PID: 14242 Comm: syz-executor5 Not tainted 4.16.0-rc6+ #280
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x24d lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:256
>  kasan_report_error mm/kasan/report.c:354 [inline]
>  kasan_report+0x23c/0x360 mm/kasan/report.c:412
>  __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:443
>  skb_network_protocol+0x46b/0x4b0 net/core/dev.c:2739
>  harmonize_features net/core/dev.c:2924 [inline]
>  netif_skb_features+0x509/0x9b0 net/core/dev.c:3011
>  validate_xmit_skb+0x81/0xb00 net/core/dev.c:3084
>  validate_xmit_skb_list+0xbf/0x120 net/core/dev.c:3142
>  packet_direct_xmit+0x117/0x790 net/packet/af_packet.c:256
>  packet_snd net/packet/af_packet.c:2944 [inline]
>  packet_sendmsg+0x3aed/0x60b0 net/packet/af_packet.c:2969
>  sock_sendmsg_nosec net/socket.c:629 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:639
>  ___sys_sendmsg+0x767/0x8b0 net/socket.c:2047
>  __sys_sendmsg+0xe5/0x210 net/socket.c:2081
>
> Fixes: 19acc327258a ("gso: Handle Trans-Ether-Bridging protocol in 
> skb_network_protocol()")
> Signed-off-by: Eric Dumazet 
> Cc: Pravin B Shelar 
> Reported-by: Reported-by: syzbot 
> ---
>  net/core/dev.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 
> 12be205357146f0dcd55cc6e6f71dfb65fdeb33b..ef0cc6ea5f8da5b87c751d9eebfc0943fbe36a06
>  100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2735,7 +2735,7 @@ __be16 skb_network_protocol(struct sk_buff *skb, int 
> *depth)
> if (unlikely(!pskb_may_pull(skb, sizeof(struct ethhdr
> return 0;
>
> -   eth = (struct ethhdr *)skb_mac_header(skb);
> +   eth = (struct ethhdr *)skb->data;
> type = eth->h_proto;
> }
>
Thanks for fixing it.


Re: [PATCH net v2] openvswitch: meter: fix the incorrect calculation of max delta_t

2018-03-10 Thread Pravin Shelar
On Thu, Mar 8, 2018 at 6:08 PM, zhangliping  wrote:
> From: zhangliping 
>
> Max delat_t should be the full_bucket/rate instead of the full_bucket.
> Also report EINVAL if the rate is zero.
>
> Fixes: 96fbc13d7e77 ("openvswitch: Add meter infrastructure")
> Cc: Andy Zhou 
> Signed-off-by: zhangliping 
> ---
>  V2: report EINVAL if the rate is 0 to avoid divide by zero error.
>

Looks good.

Acked-by: Pravin B Shelar 


Re: [PATCHv2 net-next] openvswitch: fix vport packet length check.

2018-03-07 Thread Pravin Shelar
On Wed, Mar 7, 2018 at 3:38 PM, William Tu  wrote:
> When sending a packet to a tunnel device, the dev's hard_header_len
> could be larger than the skb->len in function packet_length().
> In the case of ip6gretap/erspan, hard_header_len = LL_MAX_HEADER + t_hlen,
> which is around 180, and an ARP packet sent to this tunnel has
> skb->len = 42.  This causes the 'unsign int length' to become super
> large because it is negative value, causing the later ovs_vport_send
> to drop it due to over-mtu size.  The patch fixes it by setting it to 0.
>
> Signed-off-by: William Tu 
> ---
> v1->v2:
>   replace the return type from unsigned int to int
> ---
Acked-by: Pravin B Shelar 


Re: [PATCH net-next] openvswitch: fix vport packet length check.

2018-03-07 Thread Pravin Shelar
On Tue, Mar 6, 2018 at 5:56 PM, William Tu  wrote:
> When sending a packet to a tunnel device, the dev's hard_header_len
> could be larger than the skb->len in function packet_length().
> In the case of ip6gretap/erspan, hard_header_len = LL_MAX_HEADER + t_hlen,
> which is around 180, and an ARP packet sent to this tunnel has
> skb->len = 42.  This causes the 'unsign int length' to become super
> large because it is negative value, causing the later ovs_vport_send
> to drop it due to over-mtu size.  The patch fixes it by setting it to 0.
>
> Signed-off-by: William Tu 
> ---
>  net/openvswitch/vport.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
> index b6c8524032a0..7718d5b4cf8a 100644
> --- a/net/openvswitch/vport.c
> +++ b/net/openvswitch/vport.c
> @@ -467,7 +467,7 @@ int ovs_vport_receive(struct vport *vport, struct sk_buff 
> *skb,
>  static unsigned int packet_length(const struct sk_buff *skb,
>   struct net_device *dev)
Can you also change return type of this function?

>  {
> -   unsigned int length = skb->len - dev->hard_header_len;
> +   int length = skb->len - dev->hard_header_len;
>
> if (!skb_vlan_tag_present(skb) &&
> eth_type_vlan(skb->protocol))
> @@ -478,7 +478,7 @@ static unsigned int packet_length(const struct sk_buff 
> *skb,
>  * account for 802.1ad. e.g. is_skb_forwardable().
>  */
>
> -   return length;
> +   return length > 0 ? length : 0;
>  }
>
>  void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto)
> --
> 2.7.4
>


Re: [PATCH v4] openvswitch: Remove padding from packet before L3+ conntrack processing

2018-01-31 Thread Pravin Shelar
On Wed, Jan 31, 2018 at 6:48 PM, Ed Swierk  wrote:
> IPv4 and IPv6 packets may arrive with lower-layer padding that is not
> included in the L3 length. For example, a short IPv4 packet may have
> up to 6 bytes of padding following the IP payload when received on an
> Ethernet device with a minimum packet length of 64 bytes.
>
> Higher-layer processing functions in netfilter (e.g. nf_ip_checksum(),
> and help() in nf_conntrack_ftp) assume skb->len reflects the length of
> the L3 header and payload, rather than referring back to
> ip_hdr->tot_len or ipv6_hdr->payload_len, and get confused by
> lower-layer padding.
>
> In the normal IPv4 receive path, ip_rcv() trims the packet to
> ip_hdr->tot_len before invoking netfilter hooks. In the IPv6 receive
> path, ip6_rcv() does the same using ipv6_hdr->payload_len. Similarly
> in the br_netfilter receive path, br_validate_ipv4() and
> br_validate_ipv6() trim the packet to the L3 length before invoking
> netfilter hooks.
>
> Currently in the OVS conntrack receive path, ovs_ct_execute() pulls
> the skb to the L3 header but does not trim it to the L3 length before
> calling nf_conntrack_in(NF_INET_PRE_ROUTING). When
> nf_conntrack_proto_tcp encounters a packet with lower-layer padding,
> nf_ip_checksum() fails causing a "nf_ct_tcp: bad TCP checksum" log
> message. While extra zero bytes don't affect the checksum, the length
> in the IP pseudoheader does. That length is based on skb->len, and
> without trimming, it doesn't match the length the sender used when
> computing the checksum.
>
> In ovs_ct_execute(), trim the skb to the L3 length before higher-layer
> processing.
>
> Signed-off-by: Ed Swierk 

Acked-by: Pravin B Shelar 


Re: [PATCHv6 net-next 3/3] openvswitch: add erspan version I and II support

2018-01-25 Thread Pravin Shelar
On Thu, Jan 25, 2018 at 1:20 PM, William Tu  wrote:
> The patch adds support for openvswitch to configure erspan
> v1 and v2.  The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added
> to uapi as a binary blob to support all ERSPAN v1 and v2's
> fields.  Note that Previous commit "openvswitch: Add erspan tunnel
> support." was reverted since it does not design properly.
>
> Signed-off-by: William Tu 

Acked-by: Pravin B Shelar 

Thanks.


Re: [PATCHv6 net-next 2/3] net: erspan: create erspan metadata uapi header

2018-01-25 Thread Pravin Shelar
On Thu, Jan 25, 2018 at 1:20 PM, William Tu  wrote:
> The patch adds a new uapi header file, erspan.h, and moves
> the 'struct erspan_metadata' from internal erspan.h to it.
>
> Signed-off-by: William Tu 

Acked-by: Pravin B Shelar 

Thanks.


Re: [PATCHv6 net-next 1/3] net: erspan: use bitfield instead of mask and offset

2018-01-25 Thread Pravin Shelar
On Thu, Jan 25, 2018 at 1:20 PM, William Tu  wrote:
> Originally the erspan fields are defined as a group into a __be16 field,
> and use mask and offset to access each field.  This is more costly due to
> calling ntohs/htons.  The patch changes it to use bitfields.
>
> Signed-off-by: William Tu 

Acked-by: Pravin B Shelar 

Thanks.


Re: [PATCHv5 net-next 2/2] openvswitch: add erspan version I and II support

2018-01-25 Thread Pravin Shelar
On Thu, Jan 25, 2018 at 10:34 AM, William Tu <u9012...@gmail.com> wrote:
> Thanks for the review.
>
> On Thu, Jan 25, 2018 at 9:32 AM, Pravin Shelar <pshe...@ovn.org> wrote:
>> On Wed, Jan 24, 2018 at 11:06 AM, William Tu <u9012...@gmail.com> wrote:
>>> The patch adds support for openvswitch to configure erspan
>>> v1 and v2.  The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added
>>> to uapi as a binary blob to support all ERSPAN v1 and v2's
>>> fields.  Note that Previous commit "openvswitch: Add erspan tunnel
>>> support." was reverted since it does not design properly.
>>>
>>> Signed-off-by: William Tu <u9012...@gmail.com>
>>> ---
>>>  include/uapi/linux/openvswitch.h |  2 +-
>>>  net/openvswitch/flow_netlink.c   | 90 
>>> +++-
>>>  2 files changed, 90 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/uapi/linux/openvswitch.h 
>>> b/include/uapi/linux/openvswitch.h
>>> index dcfab5e3b55c..158c2e45c0a5 100644
>>> --- a/include/uapi/linux/openvswitch.h
>>> +++ b/include/uapi/linux/openvswitch.h
>>> @@ -273,7 +273,6 @@ enum {
>>>
>>>  #define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1)
>>>
>>> -
>>>  /* OVS_VPORT_ATTR_OPTIONS attributes for tunnels.
>>>   */
>>>  enum {
>>> @@ -363,6 +362,7 @@ enum ovs_tunnel_key_attr {
>>> OVS_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
>>> address. */
>>> OVS_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
>>> address. */
>>> OVS_TUNNEL_KEY_ATTR_PAD,
>>> +   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* struct erspan_metadata */
>>> __OVS_TUNNEL_KEY_ATTR_MAX
>>>  };
>>>
>> Since this is uapi, we need to define the struct erspan_metadata in a
>> UAPI header file.
>
> Should I define "struct erspan_metadata" in include/uapi/linux/openvswitch.h?
>
> Or I'm planning to create a new file in uapi "include/uapi/linux/erspan.h",
> define "struct erspan_metadata" there, and remove its duplicate at
> include/net/erspan.h.
>
Better to define separate header for ERSPAN.

Thanks,
Pravin.


Re: [PATCHv5 net-next 2/2] openvswitch: add erspan version I and II support

2018-01-25 Thread Pravin Shelar
On Wed, Jan 24, 2018 at 11:06 AM, William Tu  wrote:
> The patch adds support for openvswitch to configure erspan
> v1 and v2.  The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added
> to uapi as a binary blob to support all ERSPAN v1 and v2's
> fields.  Note that Previous commit "openvswitch: Add erspan tunnel
> support." was reverted since it does not design properly.
>
> Signed-off-by: William Tu 
> ---
>  include/uapi/linux/openvswitch.h |  2 +-
>  net/openvswitch/flow_netlink.c   | 90 
> +++-
>  2 files changed, 90 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/openvswitch.h 
> b/include/uapi/linux/openvswitch.h
> index dcfab5e3b55c..158c2e45c0a5 100644
> --- a/include/uapi/linux/openvswitch.h
> +++ b/include/uapi/linux/openvswitch.h
> @@ -273,7 +273,6 @@ enum {
>
>  #define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1)
>
> -
>  /* OVS_VPORT_ATTR_OPTIONS attributes for tunnels.
>   */
>  enum {
> @@ -363,6 +362,7 @@ enum ovs_tunnel_key_attr {
> OVS_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
> address. */
> OVS_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
> address. */
> OVS_TUNNEL_KEY_ATTR_PAD,
> +   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* struct erspan_metadata */
> __OVS_TUNNEL_KEY_ATTR_MAX
>  };
>
Since this is uapi, we need to define the struct erspan_metadata in a
UAPI header file.

Also lets move field 'version' to the begining of the struct for easy
expansion later.
struct erspan_metadata {
int version;
union {
__be32 index;   /* Version 1 (type II)*/
struct erspan_md2 md2;  /* Version 2 (type III) */
} u;
};

> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
> index f143908b651d..9d00c24b2836 100644
> --- a/net/openvswitch/flow_netlink.c
> +++ b/net/openvswitch/flow_netlink.c
> @@ -49,6 +49,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "flow_netlink.h"
>
> @@ -329,7 +330,8 @@ size_t ovs_tun_key_attr_size(void)
> + nla_total_size(0)/* OVS_TUNNEL_KEY_ATTR_CSUM */
> + nla_total_size(0)/* OVS_TUNNEL_KEY_ATTR_OAM */
> + nla_total_size(256)  /* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS */
> -   /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS is mutually exclusive with
> +   /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS and
> +* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS is mutually exclusive with
>  * OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
>  */
> + nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_SRC */
> @@ -400,6 +402,7 @@ static const struct ovs_len_tbl 
> ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
> .next = 
> ovs_vxlan_ext_key_lens },
> [OVS_TUNNEL_KEY_ATTR_IPV6_SRC]  = { .len = sizeof(struct 
> in6_addr) },
> [OVS_TUNNEL_KEY_ATTR_IPV6_DST]  = { .len = sizeof(struct 
> in6_addr) },
> +   [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = OVS_ATTR_VARIABLE },
>  };
>
>  static const struct ovs_len_tbl
> @@ -631,6 +634,33 @@ static int vxlan_tun_opt_from_nlattr(const struct nlattr 
> *attr,
> return 0;
>  }
>
> +static int erspan_tun_opt_from_nlattr(const struct nlattr *a,
> + struct sw_flow_match *match, bool 
> is_mask,
> + bool log)
> +{
> +   unsigned long opt_key_offset;
> +
> +   BUILD_BUG_ON(sizeof(struct erspan_metadata) >
> +sizeof(match->key->tun_opts));
> +
> +   if (nla_len(a) > sizeof(match->key->tun_opts)) {
> +   OVS_NLERR(log, "ERSPAN option length err (len %d, max %zu).",
> + nla_len(a), sizeof(match->key->tun_opts));
> +   return -EINVAL;
> +   }
> +
> +   if (!is_mask)
> +   SW_FLOW_KEY_PUT(match, tun_opts_len,
> +   sizeof(struct erspan_metadata), false);
> +   else
> +   SW_FLOW_KEY_PUT(match, tun_opts_len, 0xff, true);
> +
> +   opt_key_offset = TUN_METADATA_OFFSET(nla_len(a));
> +   SW_FLOW_KEY_MEMCPY_OFFSET(match, opt_key_offset, nla_data(a),
> + nla_len(a), is_mask);
> +   return 0;
> +}
> +
...
> @@ -2461,6 +2509,41 @@ static int validate_geneve_opts(struct sw_flow_key 
> *key)
> return 0;
>  }
>
> +static int validate_erspan_opts(struct sw_flow_key *key, bool log)
> +{
I do not see any need to validate ERSPAN key values, except the total
len, which should be less than 255.

> +
>  static int validate_and_copy_set_tun(const struct nlattr *attr,
>  struct sw_flow_actions **sfa, bool log)
>  {
> @@ -2486,6 +2569,11 @@ static int validate_and_copy_set_tun(const struct 
> nlattr *attr,
>  

Re: [PATCH v3 1/2] skbuff: Add skb_network_trim

2018-01-23 Thread Pravin Shelar
On Mon, Jan 22, 2018 at 6:42 PM, Ed Swierk  wrote:
> IPv4 and IPv6 packets may arrive with lower-layer padding that is not
> included in the L3 length. For example, a short IPv4 packet may have
> up to 6 bytes of padding following the IP payload when received on an
> Ethernet device with a minimum packet length of 64 bytes.
>
> Higher-layer processing functions in netfilter (e.g. nf_ip_checksum(),
> and help() in nf_conntrack_ftp) assume skb->len reflects the length of
> the L3 header and payload, rather than referring back to
> ip_hdr->tot_len or ipv6_hdr->payload_len, and get confused by
> lower-layer padding.
>
> In the normal IPv4 receive path, ip_rcv() trims the packet to
> ip_hdr->tot_len before invoking netfilter hooks. In the IPv6 receive
> path, ip6_rcv() does the same using ipv6_hdr->payload_len. Similarly
> in the br_netfilter receive path, br_validate_ipv4() and
> br_validate_ipv6() trim the packet to the L3 length before invoking
> netfilter hooks.
>
> Currently the openvswitch receive path does not perform such trimming,
> which will be fixed by the next patch. In preparation, this patch adds
> a generic skb_network_trim() function.
>
> Signed-off-by: Ed Swierk 
> ---
>  include/linux/skbuff.h |  2 ++
>  net/core/skbuff.c  | 44 
>  2 files changed, 46 insertions(+)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 051e093..0478645 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -4018,6 +4018,8 @@ struct sk_buff *skb_checksum_trimmed(struct sk_buff 
> *skb,
>  unsigned int transport_len,
>  __sum16(*skb_chkf)(struct sk_buff *skb));
>
> +int skb_network_trim(struct sk_buff *skb);
> +
>  /**
>   * skb_head_is_locked - Determine if the skb->head is locked down
>   * @skb: skb to check
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 15fa5ba..cef3d1e 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4743,6 +4743,50 @@ struct sk_buff *skb_checksum_trimmed(struct sk_buff 
> *skb,
>  }
>  EXPORT_SYMBOL(skb_checksum_trimmed);
>
> +/**
> + * skb_network_trim - trim skb to length specified by the network header
> + * @skb: the skb to trim
> + *
> + * Trims the skb to the length specified by the IP/IPv6 header,
> + * removing any trailing lower-layer padding. This prepares the skb
> + * for higher-layer processing that assumes skb->len excludes padding.
> + *
> + * Leaves the skb alone if the protocol is not IP or IPv6. Frees the
> + * skb on error.
> + *
> + * Caller needs to pull the skb to the network header.
> + */
> +int skb_network_trim(struct sk_buff *skb)
> +{
> +   unsigned int len;
> +   int err = -ENOMEM;
> +
> +   switch (skb->protocol) {
> +   case htons(ETH_P_IP):
> +   if (!pskb_may_pull(skb, sizeof(struct iphdr)))
> +   goto out;
Since you are going to move this to OVS specific code, can you remove
this skb-pull, which is not required in OVS code path.

> +   len = ntohs(ip_hdr(skb)->tot_len);
> +   break;
> +   case htons(ETH_P_IPV6):
> +   if (!pskb_may_pull(skb, sizeof(struct ipv6hdr)))
> +   goto out;
> +   len = sizeof(struct ipv6hdr)
> +   + ntohs(ipv6_hdr(skb)->payload_len);


Re: [PATCH 0/3] Check gso_size of packets when forwarding

2018-01-22 Thread Pravin Shelar
On Mon, Jan 22, 2018 at 12:14 PM, David Miller <da...@davemloft.net> wrote:
> From: Pravin Shelar <pshe...@ovn.org>
> Date: Fri, 19 Jan 2018 13:54:15 -0800
>
>> I agree it is not perfect. But the other proposed patch does not fix
>> the connectivity issue. It only adds log msg in such cases at cost
>> of extra checks/code. Therefore I prefer the easier fix for the
>> issue which also fixes for all cases of packet forwarding rather
>> than just OVS and Bridge.
>
> I really think that something needs to guarantee that device drivers
> will never be given either over-MTU or over-max-GSO-seg-size SKBs.
>
> Otherwise drivers need to add completely stupid checks like making
> sure that SKB lengths do not exceed the maxmimu value that can be
> encoded into descriptors.
>
> What's probably happening often now in such situations is that the
> driver ends up masking the length blindly and ends up sending out a
> truncated packet.
>
> Which frankly is quite bad too.
>
> It doesn't scale to add these checks into every driver, or trying to
> "figure out" which drivers will behave adversely and only add checks
> to those.
>
> The kernel shouldn't pass objects with out-of-range attributes
> to the driver, period.

OK, in that case we could add a check in validate_xmit_skb() as done
by Daniel in a patch posted earlier.


Re: [PATCH v3 2/2] openvswitch: Remove padding from packet before L3+ conntrack processing

2018-01-22 Thread Pravin Shelar
On Mon, Jan 22, 2018 at 6:42 PM, Ed Swierk  wrote:
> In the OVS conntrack receive path, ovs_ct_execute() pulls the skb to
> the L3 header but does not trim it to the L3 length before calling
> nf_conntrack_in(NF_INET_PRE_ROUTING). When nf_conntrack_proto_tcp
> encounters a packet with lower-layer padding, nf_checksum() fails and
> logs "nf_ct_tcp: bad TCP checksum". While extra zero bytes don't
> affect the checksum, the length in the IP pseudoheader does. That
> length is based on skb->len, and without trimming, it doesn't match
> the length the sender used when computing the checksum.
>
> In ovs_ct_execute(), call skb_network_trim() before any L3+ conntrack
> processing.
>
> Signed-off-by: Ed Swierk 

Acked-by: Pravin B Shelar 

Thanks.


Re: [PATCH v3 1/2] skbuff: Add skb_network_trim

2018-01-22 Thread Pravin Shelar
On Mon, Jan 22, 2018 at 6:42 PM, Ed Swierk  wrote:
> IPv4 and IPv6 packets may arrive with lower-layer padding that is not
> included in the L3 length. For example, a short IPv4 packet may have
> up to 6 bytes of padding following the IP payload when received on an
> Ethernet device with a minimum packet length of 64 bytes.
>
> Higher-layer processing functions in netfilter (e.g. nf_ip_checksum(),
> and help() in nf_conntrack_ftp) assume skb->len reflects the length of
> the L3 header and payload, rather than referring back to
> ip_hdr->tot_len or ipv6_hdr->payload_len, and get confused by
> lower-layer padding.
>
> In the normal IPv4 receive path, ip_rcv() trims the packet to
> ip_hdr->tot_len before invoking netfilter hooks. In the IPv6 receive
> path, ip6_rcv() does the same using ipv6_hdr->payload_len. Similarly
> in the br_netfilter receive path, br_validate_ipv4() and
> br_validate_ipv6() trim the packet to the L3 length before invoking
> netfilter hooks.
>
> Currently the openvswitch receive path does not perform such trimming,
> which will be fixed by the next patch. In preparation, this patch adds
> a generic skb_network_trim() function.
>
> Signed-off-by: Ed Swierk 

Acked-by: Pravin B Shelar 

Thanks.


Re: [PATCHv4 net-next 2/2] openvswitch: add erspan version I and II support

2018-01-20 Thread Pravin Shelar
On Thu, Jan 18, 2018 at 2:04 PM, William Tu  wrote:
> The patch adds support for openvswitch to configure erspan
> v1 and v2.  The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added
> to uapi as a nested attribute to support all ERSPAN v1 and v2's
> fields.  Note taht Previous commit "openvswitch: Add erspan tunnel
> support." was reverted since it does not design properly.
>
> Signed-off-by: William Tu 
> ---
>  include/uapi/linux/openvswitch.h |  11 +++
>  net/openvswitch/flow_netlink.c   | 154 
> ++-
>  2 files changed, 164 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/openvswitch.h 
> b/include/uapi/linux/openvswitch.h
> index dcfab5e3b55c..f1f98fd703fe 100644
> --- a/include/uapi/linux/openvswitch.h
> +++ b/include/uapi/linux/openvswitch.h
> @@ -273,6 +273,16 @@ enum {
>
>  #define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1)
>
> +enum {
> +   OVS_ERSPAN_OPT_UNSPEC,
> +   OVS_ERSPAN_OPT_IDX, /* u32 index */
> +   OVS_ERSPAN_OPT_VER, /* u8 version number */
> +   OVS_ERSPAN_OPT_DIR, /* u8 direction */
> +   OVS_ERSPAN_OPT_HWID,/* u8 hardware ID */
> +   __OVS_ERSPAN_OPT_MAX,
> +};
> +
> +#define OVS_ERSPAN_OPT_MAX (__OVS_ERSPAN_OPT_MAX - 1)
>
>  /* OVS_VPORT_ATTR_OPTIONS attributes for tunnels.
>   */
> @@ -363,6 +373,7 @@ enum ovs_tunnel_key_attr {
> OVS_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
> address. */
> OVS_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
> address. */
> OVS_TUNNEL_KEY_ATTR_PAD,
> +   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* Nested OVS_ERSPAN_OPT_* */
> __OVS_TUNNEL_KEY_ATTR_MAX
>  };
>
Rather than passing individual members of erspan_metadata, can you
just pass whole structure between kernel and userspace. So that ovs
kernel module can just handle all erspan options as one binary blob
and does not need to interpret it.


Re: [PATCH 0/3] Check gso_size of packets when forwarding

2018-01-19 Thread Pravin Shelar
On Fri, Jan 19, 2018 at 3:58 AM, Daniel Axtens <d...@axtens.net> wrote:
> Pravin Shelar <pshe...@ovn.org> writes:
>
>> On Thu, Jan 18, 2018 at 5:28 PM, Daniel Axtens <d...@axtens.net> wrote:
>>> Pravin Shelar <pshe...@ovn.org> writes:
>>>
>>>> On Thu, Jan 18, 2018 at 5:08 AM, Daniel Axtens <d...@axtens.net> wrote:
>>>>> Pravin Shelar <pshe...@ovn.org> writes:
>>>>>
>>>>>> On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens <d...@axtens.net> wrote:
>>>>>>> When regular packets are forwarded, we validate their size against the
>>>>>>> MTU of the destination device. However, when GSO packets are
>>>>>>> forwarded, we do not validate their size against the MTU. We
>>>>>>> implicitly assume that when they are segmented, the resultant packets
>>>>>>> will be correctly sized.
>>>>>>>
>>>>>>> This is not always the case.
>>>>>>>
>>>>>>> We observed a case where a packet received on an ibmveth device had a
>>>>>>> GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x
>>>>>>> device, where it caused a firmware assert. This is described in detail
>>>>>>> at [0] and was the genesis of this series. Rather than fixing it in
>>>>>>> the driver, this series fixes the forwarding path.
>>>>>>>
>>>>>> Are there any other possible forwarding path in networking stack? TC
>>>>>> is one subsystem that could forward such a packet to the bnx2x device,
>>>>>> how is that handled ?
>>>>>
>>>>> So far I have only looked at bridges, openvswitch and macvlan. In
>>>>> general, if the code uses dev_forward_skb() it should automatically be
>>>>> fine as that invokes is_skb_forwardable(), which we patch.
>>>>>
>>>> But there are other ways to forward packets, e.g tc-mirred or bpf
>>>> redirect. We need to handle all these cases rather than fixing one at
>>>> a time. As Jason suggested netif_needs_gso() looks like good function
>>>> to validate if a device is capable of handling given GSO packet.
>>>
>>> I am not entirely sure this is a better solution.
>>>
>>> The biggest reason I am uncomfortable with this is that if
>>> netif_needs_gso() returns true, the skb will be segmented. The segment
>>> sizes will be based on gso_size. Since gso_size is greater than the MTU,
>>> the resulting segments will themselves be over-MTU. Those over-MTU
>>> segments will then be passed to the network card. I think we should not
>>> be creating over-MTU segments; we should instead be dropping the packet
>>> and logging.
>>>
>>
>> Would this oversized segment cause the firmware assert?
>> If this solves the assert issue then I do not see much value in adding
>> checks in fast-path just for logging.
>
> No - I tested this (or rather, as I don't have direct access to a bnx2x
> card, this was tested on my behalf): as long as the packet is not a GSO
> packet, it doesn't cause the crash. So we *could* segment them, I just
> think that knowingly creating invalid segments is not particularly
> pleasant.
>
I agree it is not perfect. But the other proposed patch does not fix
the connectivity issue. It only adds log msg in such cases at cost of
extra checks/code. Therefore I prefer the easier fix for the issue
which also fixes for all cases of packet forwarding rather than just
OVS and Bridge.


Re: [PATCH 0/3] Check gso_size of packets when forwarding

2018-01-18 Thread Pravin Shelar
On Thu, Jan 18, 2018 at 5:28 PM, Daniel Axtens <d...@axtens.net> wrote:
> Pravin Shelar <pshe...@ovn.org> writes:
>
>> On Thu, Jan 18, 2018 at 5:08 AM, Daniel Axtens <d...@axtens.net> wrote:
>>> Pravin Shelar <pshe...@ovn.org> writes:
>>>
>>>> On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens <d...@axtens.net> wrote:
>>>>> When regular packets are forwarded, we validate their size against the
>>>>> MTU of the destination device. However, when GSO packets are
>>>>> forwarded, we do not validate their size against the MTU. We
>>>>> implicitly assume that when they are segmented, the resultant packets
>>>>> will be correctly sized.
>>>>>
>>>>> This is not always the case.
>>>>>
>>>>> We observed a case where a packet received on an ibmveth device had a
>>>>> GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x
>>>>> device, where it caused a firmware assert. This is described in detail
>>>>> at [0] and was the genesis of this series. Rather than fixing it in
>>>>> the driver, this series fixes the forwarding path.
>>>>>
>>>> Are there any other possible forwarding path in networking stack? TC
>>>> is one subsystem that could forward such a packet to the bnx2x device,
>>>> how is that handled ?
>>>
>>> So far I have only looked at bridges, openvswitch and macvlan. In
>>> general, if the code uses dev_forward_skb() it should automatically be
>>> fine as that invokes is_skb_forwardable(), which we patch.
>>>
>> But there are other ways to forward packets, e.g tc-mirred or bpf
>> redirect. We need to handle all these cases rather than fixing one at
>> a time. As Jason suggested netif_needs_gso() looks like good function
>> to validate if a device is capable of handling given GSO packet.
>
> I am not entirely sure this is a better solution.
>
> The biggest reason I am uncomfortable with this is that if
> netif_needs_gso() returns true, the skb will be segmented. The segment
> sizes will be based on gso_size. Since gso_size is greater than the MTU,
> the resulting segments will themselves be over-MTU. Those over-MTU
> segments will then be passed to the network card. I think we should not
> be creating over-MTU segments; we should instead be dropping the packet
> and logging.
>

Would this oversized segment cause the firmware assert?
If this solves the assert issue then I do not see much value in adding
checks in fast-path just for logging.


Re: [PATCH 0/3] Check gso_size of packets when forwarding

2018-01-18 Thread Pravin Shelar
On Thu, Jan 18, 2018 at 5:08 AM, Daniel Axtens <d...@axtens.net> wrote:
> Pravin Shelar <pshe...@ovn.org> writes:
>
>> On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens <d...@axtens.net> wrote:
>>> When regular packets are forwarded, we validate their size against the
>>> MTU of the destination device. However, when GSO packets are
>>> forwarded, we do not validate their size against the MTU. We
>>> implicitly assume that when they are segmented, the resultant packets
>>> will be correctly sized.
>>>
>>> This is not always the case.
>>>
>>> We observed a case where a packet received on an ibmveth device had a
>>> GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x
>>> device, where it caused a firmware assert. This is described in detail
>>> at [0] and was the genesis of this series. Rather than fixing it in
>>> the driver, this series fixes the forwarding path.
>>>
>> Are there any other possible forwarding path in networking stack? TC
>> is one subsystem that could forward such a packet to the bnx2x device,
>> how is that handled ?
>
> So far I have only looked at bridges, openvswitch and macvlan. In
> general, if the code uses dev_forward_skb() it should automatically be
> fine as that invokes is_skb_forwardable(), which we patch.
>
But there are other ways to forward packets, e.g tc-mirred or bpf
redirect. We need to handle all these cases rather than fixing one at
a time. As Jason suggested netif_needs_gso() looks like good function
to validate if a device is capable of handling given GSO packet.


Re: [PATCH 0/3] Check gso_size of packets when forwarding

2018-01-18 Thread Pravin Shelar
On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens  wrote:
> When regular packets are forwarded, we validate their size against the
> MTU of the destination device. However, when GSO packets are
> forwarded, we do not validate their size against the MTU. We
> implicitly assume that when they are segmented, the resultant packets
> will be correctly sized.
>
> This is not always the case.
>
> We observed a case where a packet received on an ibmveth device had a
> GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x
> device, where it caused a firmware assert. This is described in detail
> at [0] and was the genesis of this series. Rather than fixing it in
> the driver, this series fixes the forwarding path.
>
Are there any other possible forwarding path in networking stack? TC
is one subsystem that could forward such a packet to the bnx2x device,
how is that handled ?

> To fix this:
>
>  - Move a helper in patch 1.
>
>  - Validate GSO segment lengths in is_skb_forwardable() in the GSO
>case, rather than assuming all will be well. This fixes bridges.
>This is patch 2.
>
>  - Open vSwitch uses its own slightly specialised algorithm for
>checking lengths. Wire up checking for that in patch 3.
>
> [0]: https://patchwork.ozlabs.org/patch/859410/
>
> Cc: manish.cho...@cavium.com
> Cc: d...@openvswitch.org
>
> Daniel Axtens (3):
>   net: move skb_gso_mac_seglen to skbuff.h
>   net: is_skb_forwardable: validate length of GSO packet segments
>   openvswitch: drop GSO packets that are too large
>
>  include/linux/skbuff.h  | 16 
>  net/core/dev.c  |  7 ---
>  net/core/skbuff.c   | 34 ++
>  net/openvswitch/vport.c | 37 ++---
>  net/sched/sch_tbf.c | 10 --
>  5 files changed, 84 insertions(+), 20 deletions(-)
>
> --
> 2.14.1
>


Re: [PATCH net] Revert "openvswitch: Add erspan tunnel support."

2018-01-12 Thread Pravin Shelar
On Fri, Jan 12, 2018 at 12:29 PM, William Tu <u9012...@gmail.com> wrote:
> This reverts commit ceaa001a170e43608854d5290a48064f57b565ed.
>
> The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr should be designed
> as a nested attribute to support all ERSPAN v1 and v2's fields.
> The current attr is a be32 supporting only one field.  Thus, this
> patch reverts it and later patch will redo it using nested attr.
>
> Signed-off-by: William Tu <u9012...@gmail.com>
> Cc: Jiri Benc <jb...@redhat.com>
> Cc: Pravin Shelar <pshe...@ovn.org>
Acked-by: Pravin B Shelar <pshe...@ovn.org>


Re: [PATCHv2 net-next 2/2] openvswitch: add erspan version II support

2018-01-12 Thread Pravin Shelar
On Fri, Jan 12, 2018 at 12:27 AM, Jiri Benc  wrote:
> On Thu, 11 Jan 2018 08:34:14 -0800, William Tu wrote:
>> I'd also prefer reverting ceaa001a170e since it's more clean but I
>> also hope to have this feature in 4.15.
>> How long does reverting take? Am I only able to submit the new patch
>> after the reverting is merged? Or I can submit revert and this new
>> patch in one series? I have little experience in reverting, can you
>> suggest which way is better?
>
> Send the revert for net (subject will be "[PATCH net] revert:
> openvswitch: Add erspan tunnel support."). Don't forget to explain why
> you're proposing a revert.
>
> After it is accepted and applied to net.git, wait until the patch
> appears in net-next.git. It may take a little while. After that, send
> the new patch(es) for net-next.
>

I agree, Once we have the V2 interface, this current ERSAN interface
unlikely to be used by any one, so it would be nice to get rid of the
old interface while we can.


Re: [PATCH v2] openvswitch: Trim off padding before L3+ netfilter processing

2018-01-09 Thread Pravin Shelar
On Mon, Jan 8, 2018 at 7:02 PM, Ed Swierk <eswi...@skyportsystems.com> wrote:
> On 1/6/18 10:57, Pravin Shelar wrote:
>> On Fri, Jan 5, 2018 at 10:59 PM, Ed Swierk <eswi...@skyportsystems.com> 
>> wrote:
>>>
>>>
>>> On Jan 5, 2018 22:17, "Pravin Shelar" <pshe...@ovn.org> wrote:
>>>
>>> On Fri, Jan 5, 2018 at 3:20 PM, Ed Swierk <eswi...@skyportsystems.com>
>>> wrote:
>>>> On Fri, Jan 5, 2018 at 10:14 AM, Ed Swierk <eswi...@skyportsystems.com>
>>>> wrote:
>>>>> On Thu, Jan 4, 2018 at 7:36 PM, Pravin Shelar <pshe...@ovn.org> wrote:
>>>>>> OVS already pull all required headers in skb linear data, so no need
>>>>>> to redo all of it. only check required is the ip-checksum validation.
>>>>>> I think we could avoid it in most of cases by checking skb length to
>>>>>> ipheader length before verifying the ip header-checksum.
>>>>>
>>>>> Shouldn't the IP header checksum be verified even earlier, like in
>>>>> key_extract(), before actually using any of the fields in the IP
>>>>> header?
>>>>
>>>> Something like this for verifying the IP header checksum (not tested):
>>>>
>>> AFAIU openflow does not need this verification, so it is not required
>>> in flow extract.
>>>
>>>
>>> Okay. How about my proposed trimming implementation, caching the pad length
>>> in the ovs cb?
>>>
>> Caching the length is not that simple, OVS actions can change the
>> length. Keeping it consistent with packet would be more work, so lets
>> calculate it in ovs-ct function.
>>
>
> Something like this?
>
Sure, Can you submit formal patch?

> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index a38c80e..282325d 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -4084,6 +4084,8 @@ struct sk_buff *skb_checksum_trimmed(struct sk_buff 
> *skb,
>  unsigned int transport_len,
>  __sum16(*skb_chkf)(struct sk_buff *skb));
>
> +int skb_network_trim(struct sk_buff *skb);
> +
>  /**
>   * skb_head_is_locked - Determine if the skb->head is locked down
>   * @skb: skb to check
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 08f5740..c68e927 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4740,6 +4740,41 @@ struct sk_buff *skb_checksum_trimmed(struct sk_buff 
> *skb,
>  }
>  EXPORT_SYMBOL(skb_checksum_trimmed);
>
> +/**
> + * skb_network_trim - trim skb to length specified by the network header
> + * @skb: the skb to trim
> + *
> + * Trims the skb to the length specified by the network header,
> + * removing any trailing padding. Leaves the skb alone if the protocol
> + * is not IP or IPv6. Frees the skb on error.
> + *
> + * Caller needs to pull the skb to the network header.
> + */
> +int skb_network_trim(struct sk_buff *skb)
> +{
> +   unsigned int len;
> +   int err;
> +
> +   switch (skb->protocol) {
> +   case htons(ETH_P_IP):
> +   len = ntohs(ip_hdr(skb)->tot_len);
> +   break;
> +   case htons(ETH_P_IPV6):
> +   len = sizeof(struct ipv6hdr)
> +   + ntohs(ipv6_hdr(skb)->payload_len);
> +   break;
> +   default:
> +   len = skb->len;
> +   }
> +
> +   err = pskb_trim_rcsum(skb, len);
> +   if (unlikely(err))
> +   kfree_skb(skb);
> +
> +   return err;
> +}
> +EXPORT_SYMBOL(skb_network_trim);
> +
>  void __skb_warn_lro_forwarding(const struct sk_buff *skb)
>  {
> net_warn_ratelimited("%s: received packets cannot be forwarded while 
> LRO is enabled\n",
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index b27c5c6..73418d3 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -1112,6 +1112,10 @@ int ovs_ct_execute(struct net *net, struct sk_buff 
> *skb,
> nh_ofs = skb_network_offset(skb);
> skb_pull_rcsum(skb, nh_ofs);
>
> +   err = skb_network_trim(skb);
> +   if (err)
> +   return err;
> +
> if (key->ip.frag != OVS_FRAG_TYPE_NONE) {
> err = handle_fragments(net, key, info->zone.id, skb);
> if (err)


Re: [PATCH net-next 2/2] openvswitch: add erspan version II support

2018-01-09 Thread Pravin Shelar
On Tue, Jan 9, 2018 at 7:17 AM, William Tu <u9012...@gmail.com> wrote:
> On Mon, Jan 8, 2018 at 4:03 PM, Pravin Shelar <pshe...@ovn.org> wrote:
>> On Fri, Jan 5, 2018 at 2:29 PM, William Tu <u9012...@gmail.com> wrote:
>>> The patch adds support for configuring the erspan version II
>>> fields for openvswitch.
>>>
>> The patch looks good, But it could change userspace API for
>> OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS, how are we going to handle
>> compatibility?
>>
> Thanks.
> Yes, it will break the previous API, which uses
> OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS only to store 32-bit erspan index.
> Should I create another tunnel key attr?
> Something like:
> -   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* be32 ERSPAN index. */
> +   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTSV1,/* compatibility for
> erspan v1, be32 ERSPAN index. */
> +   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTSV2,/* supporting both
> v1 and v2 using Nested OVS_ERSPAN_OPT_* */
>

Sounds good, you could deprecate the V1 if v2 can handle both cases.


Re: [PATCH v2] openvswitch: Trim off padding before L3+ netfilter processing

2018-01-08 Thread Pravin Shelar
On Sat, Jan 6, 2018 at 10:57 AM, Pravin Shelar <pshe...@ovn.org> wrote:
> On Fri, Jan 5, 2018 at 10:59 PM, Ed Swierk <eswi...@skyportsystems.com> wrote:
>>
>>
>> On Jan 5, 2018 22:17, "Pravin Shelar" <pshe...@ovn.org> wrote:
>>
>> On Fri, Jan 5, 2018 at 3:20 PM, Ed Swierk <eswi...@skyportsystems.com>
>> wrote:
>>> On Fri, Jan 5, 2018 at 10:14 AM, Ed Swierk <eswi...@skyportsystems.com>
>>> wrote:
>>>> On Thu, Jan 4, 2018 at 7:36 PM, Pravin Shelar <pshe...@ovn.org> wrote:
>>>>> OVS already pull all required headers in skb linear data, so no need
>>>>> to redo all of it. only check required is the ip-checksum validation.
>>>>> I think we could avoid it in most of cases by checking skb length to
>>>>> ipheader length before verifying the ip header-checksum.
>>>>
>>>> Shouldn't the IP header checksum be verified even earlier, like in
>>>> key_extract(), before actually using any of the fields in the IP
>>>> header?
>>>
>>> Something like this for verifying the IP header checksum (not tested):
>>>
>> AFAIU openflow does not need this verification, so it is not required
>> in flow extract.
>>
>>
>> Okay. How about my proposed trimming implementation, caching the pad length
>> in the ovs cb?
>>
> Caching the length is not that simple, OVS actions can change the
> length. Keeping it consistent with packet would be more work, so lets
> calculate it in ovs-ct function.

You could make it specific for skb-len-trimming, something like
boolean flag. so that it is easy to reason with.


Re: [PATCH net-next 2/2] openvswitch: add erspan version II support

2018-01-08 Thread Pravin Shelar
On Fri, Jan 5, 2018 at 2:29 PM, William Tu  wrote:
> The patch adds support for configuring the erspan version II
> fields for openvswitch.
>
The patch looks good, But it could change userspace API for
OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS, how are we going to handle
compatibility?

> Signed-off-by: William Tu 
> ---
>  include/uapi/linux/openvswitch.h |  12 +++-
>  net/openvswitch/flow_netlink.c   | 125 
> +++
>  2 files changed, 126 insertions(+), 11 deletions(-)
>
> diff --git a/include/uapi/linux/openvswitch.h 
> b/include/uapi/linux/openvswitch.h
> index 4265d7f9e1f2..3b1950c59a0c 100644
> --- a/include/uapi/linux/openvswitch.h
> +++ b/include/uapi/linux/openvswitch.h
> @@ -273,6 +273,16 @@ enum {
>
>  #define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1)
>
> +enum {
> +   OVS_ERSPAN_OPT_UNSPEC,
> +   OVS_ERSPAN_OPT_IDX, /* be32 index */
> +   OVS_ERSPAN_OPT_VER, /* u8 version number */
> +   OVS_ERSPAN_OPT_DIR, /* u8 direction */
> +   OVS_ERSPAN_OPT_HWID,/* u8 hardware ID */
> +   __OVS_ERSPAN_OPT_MAX,
> +};
> +
> +#define OVS_ERSPAN_OPT_MAX (__OVS_ERSPAN_OPT_MAX - 1)
>
>  /* OVS_VPORT_ATTR_OPTIONS attributes for tunnels.
>   */
> @@ -363,7 +373,7 @@ enum ovs_tunnel_key_attr {
> OVS_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
> address. */
> OVS_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
> address. */
> OVS_TUNNEL_KEY_ATTR_PAD,
> -   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* be32 ERSPAN index. */
> +   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* Nested OVS_ERSPAN_OPT_* */
> __OVS_TUNNEL_KEY_ATTR_MAX
>  };
>
> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
> index bce1f78b0de5..696198cf3765 100644
> --- a/net/openvswitch/flow_netlink.c
> +++ b/net/openvswitch/flow_netlink.c
> @@ -334,8 +334,10 @@ size_t ovs_tun_key_attr_size(void)
>  * OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
>  */
> + nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_SRC */
> -   + nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_DST */
> -   + nla_total_size(4);   /* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS */
> +   + nla_total_size(2);   /* OVS_TUNNEL_KEY_ATTR_TP_DST */
> +   /* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS is mutually exclusive with
> +* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
> +*/
>  }
>
>  static size_t ovs_nsh_key_attr_size(void)
> @@ -386,6 +388,13 @@ static const struct ovs_len_tbl 
> ovs_vxlan_ext_key_lens[OVS_VXLAN_EXT_MAX + 1] =
> [OVS_VXLAN_EXT_GBP] = { .len = sizeof(u32) },
>  };
>
> +static const struct ovs_len_tbl ovs_erspan_opt_lens[OVS_ERSPAN_OPT_MAX + 1] 
> = {
> +   [OVS_ERSPAN_OPT_IDX]= { .len = sizeof(u32) },
> +   [OVS_ERSPAN_OPT_VER]= { .len = sizeof(u8) },
> +   [OVS_ERSPAN_OPT_DIR]= { .len = sizeof(u8) },
> +   [OVS_ERSPAN_OPT_HWID]   = { .len = sizeof(u8) },
> +};
> +
>  static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX 
> + 1] = {
> [OVS_TUNNEL_KEY_ATTR_ID]= { .len = sizeof(u64) },
> [OVS_TUNNEL_KEY_ATTR_IPV4_SRC]  = { .len = sizeof(u32) },
> @@ -402,7 +411,8 @@ static const struct ovs_len_tbl 
> ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
> .next = 
> ovs_vxlan_ext_key_lens },
> [OVS_TUNNEL_KEY_ATTR_IPV6_SRC]  = { .len = sizeof(struct 
> in6_addr) },
> [OVS_TUNNEL_KEY_ATTR_IPV6_DST]  = { .len = sizeof(struct 
> in6_addr) },
> -   [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = sizeof(u32) },
> +   [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = OVS_ATTR_NESTED,
> +   .next = ovs_erspan_opt_lens },
>  };
>
>  static const struct ovs_len_tbl
> @@ -640,16 +650,78 @@ static int erspan_tun_opt_from_nlattr(const struct 
> nlattr *attr,
>  {
> unsigned long opt_key_offset;
> struct erspan_metadata opts;
> +   struct nlattr *a;
> +   u16 hwid, dir;
> +   int rem;
>
> BUILD_BUG_ON(sizeof(opts) > sizeof(match->key->tun_opts));
>
> memset(, 0, sizeof(opts));
> -   opts.u.index = nla_get_be32(attr);
> +   nla_for_each_nested(a, attr, rem) {
> +   int type = nla_type(a);
>
> -   /* Index has only 20-bit */
> -   if (ntohl(opts.u.index) & ~INDEX_MASK) {
> -   OVS_NLERR(log, "ERSPAN index number %x too large.",
> - ntohl(opts.u.index));
> +   if (type > OVS_ERSPAN_OPT_MAX) {
> +   OVS_NLERR(log, "ERSPAN option %d out of range max %d",
> + type, OVS_ERSPAN_OPT_MAX);
> +   return -EINVAL;
> +   }
> +
> +   if 

Re: [PATCH v2] openvswitch: Trim off padding before L3+ netfilter processing

2018-01-06 Thread Pravin Shelar
On Fri, Jan 5, 2018 at 10:59 PM, Ed Swierk <eswi...@skyportsystems.com> wrote:
>
>
> On Jan 5, 2018 22:17, "Pravin Shelar" <pshe...@ovn.org> wrote:
>
> On Fri, Jan 5, 2018 at 3:20 PM, Ed Swierk <eswi...@skyportsystems.com>
> wrote:
>> On Fri, Jan 5, 2018 at 10:14 AM, Ed Swierk <eswi...@skyportsystems.com>
>> wrote:
>>> On Thu, Jan 4, 2018 at 7:36 PM, Pravin Shelar <pshe...@ovn.org> wrote:
>>>> OVS already pull all required headers in skb linear data, so no need
>>>> to redo all of it. only check required is the ip-checksum validation.
>>>> I think we could avoid it in most of cases by checking skb length to
>>>> ipheader length before verifying the ip header-checksum.
>>>
>>> Shouldn't the IP header checksum be verified even earlier, like in
>>> key_extract(), before actually using any of the fields in the IP
>>> header?
>>
>> Something like this for verifying the IP header checksum (not tested):
>>
> AFAIU openflow does not need this verification, so it is not required
> in flow extract.
>
>
> Okay. How about my proposed trimming implementation, caching the pad length
> in the ovs cb?
>
Caching the length is not that simple, OVS actions can change the
length. Keeping it consistent with packet would be more work, so lets
calculate it in ovs-ct function.


Re: [PATCH v2] openvswitch: Trim off padding before L3+ netfilter processing

2018-01-05 Thread Pravin Shelar
On Fri, Jan 5, 2018 at 3:20 PM, Ed Swierk <eswi...@skyportsystems.com> wrote:
> On Fri, Jan 5, 2018 at 10:14 AM, Ed Swierk <eswi...@skyportsystems.com>
> wrote:
>> On Thu, Jan 4, 2018 at 7:36 PM, Pravin Shelar <pshe...@ovn.org> wrote:
>>> OVS already pull all required headers in skb linear data, so no need
>>> to redo all of it. only check required is the ip-checksum validation.
>>> I think we could avoid it in most of cases by checking skb length to
>>> ipheader length before verifying the ip header-checksum.
>>
>> Shouldn't the IP header checksum be verified even earlier, like in
>> key_extract(), before actually using any of the fields in the IP
>> header?
>
> Something like this for verifying the IP header checksum (not tested):
>
AFAIU openflow does not need this verification, so it is not required
in flow extract.


Re: [PATCH v2] openvswitch: Trim off padding before L3+ netfilter processing

2018-01-04 Thread Pravin Shelar
On Wed, Jan 3, 2018 at 7:49 PM, Ed Swierk <eswi...@skyportsystems.com> wrote:
> On Fri, Dec 22, 2017 at 3:31 PM, Pravin Shelar <pshe...@ovn.org> wrote:
>> On Thu, Dec 21, 2017 at 7:17 AM, Ed Swierk <eswi...@skyportsystems.com> 
>> wrote:
>>> IPv4 and IPv6 packets may arrive with lower-layer padding that is not
>>> included in the L3 length. For example, a short IPv4 packet may have
>>> up to 6 bytes of padding following the IP payload when received on an
>>> Ethernet device. In the normal IPv4 receive path, ip_rcv() trims the
>>> packet to ip_hdr->tot_len before invoking netfilter hooks (including
>>> conntrack and nat).
>>>
>>> In the IPv6 receive path, ip6_rcv() does the same using
>>> ipv6_hdr->payload_len. Similarly in the br_netfilter receive path,
>>> br_validate_ipv4() and br_validate_ipv6() trim the packet to the L3
>>> length before invoking NF_INET_PRE_ROUTING hooks.
>>>
>>> In the OVS conntrack receive path, ovs_ct_execute() pulls the skb to
>>> the L3 header but does not trim it to the L3 length before calling
>>> nf_conntrack_in(NF_INET_PRE_ROUTING). When nf_conntrack_proto_tcp
>>> encounters a packet with lower-layer padding, nf_checksum() fails and
>>> logs "nf_ct_tcp: bad TCP checksum". While extra zero bytes don't
>>> affect the checksum, the length in the IP pseudoheader does. That
>>> length is based on skb->len, and without trimming, it doesn't match
>>> the length the sender used when computing the checksum.
>>>
>>> The assumption throughout nf_conntrack and nf_nat is that skb->len
>>> reflects the length of the L3 header and payload, so there is no need
>>> to refer back to ip_hdr->tot_len or ipv6_hdr->payload_len.
>>>
>>> This change brings OVS into line with other netfilter users, trimming
>>> IPv4 and IPv6 packets prior to L3+ netfilter processing.
>>>
>>> Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
>>> ---
>>> v2:
>>> - Trim packet in nat receive path as well as conntrack
>>> - Free skb on error
>>> ---
>>>  net/openvswitch/conntrack.c | 34 ++
>>>  1 file changed, 34 insertions(+)
>>>
>>> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>>> index b27c5c6..1bdc78f 100644
>>> --- a/net/openvswitch/conntrack.c
>>> +++ b/net/openvswitch/conntrack.c
>>> @@ -703,6 +703,33 @@ static bool skb_nfct_cached(struct net *net,
>>> return ct_executed;
>>>  }
>>>
>>> +/* Trim the skb to the L3 length. Assumes the skb is already pulled to
>>> + * the L3 header. The skb is freed on error.
>>> + */
>>> +static int skb_trim_l3(struct sk_buff *skb)
>>> +{
>>> +   unsigned int nh_len;
>>> +   int err;
>>> +
>>> +   switch (skb->protocol) {
>>> +   case htons(ETH_P_IP):
>>> +   nh_len = ntohs(ip_hdr(skb)->tot_len);
>>> +   break;
>>> +   case htons(ETH_P_IPV6):
>>> +   nh_len = ntohs(ipv6_hdr(skb)->payload_len)
>>> +   + sizeof(struct ipv6hdr);
>>> +   break;
>>> +   default:
>>> +   nh_len = skb->len;
>>> +   }
>>> +
>>> +   err = pskb_trim_rcsum(skb, nh_len);
>>> +   if (err)
>> This should is unlikely.
>>> +   kfree_skb(skb);
>>> +
>>> +   return err;
>>> +}
>>> +
>> This looks like a generic function, it probably does not belong to OVS
>> code base.
>
> It occurs to me that skb_trim_l3() can't just reach into ip_hdr(skb)
> before calling pskb_may_pull(skb, sizeof(struct iphdr)) to make sure
> the IP header is actually there; and for IPv4 it should validate the
> IP header checksum, including options. Once we add all these steps,
> skb_trim_l3() starts to look an awful lot like br_validate_ipv4() and
> br_validate_ipv6(). And those in turn are eerily similar to ip_rcv()
> and ip6_rcv(). It would be nice to avoid duplicating this logic yet
> again.
>
OVS already pull all required headers in skb linear data, so no need
to redo all of it. only check required is the ip-checksum validation.
I think we could avoid it in most of cases by checking skb length to
ipheader length before verifying the ip header-checksum.


Re: [PATCH] MAINTAINERS: Update my email address.

2018-01-04 Thread Pravin Shelar
On Thu, Jan 4, 2018 at 10:38 AM, David Miller <da...@davemloft.net> wrote:
> From: Pravin B Shelar <pshe...@ovn.org>
> Date: Tue,  2 Jan 2018 20:14:42 -0800
>
>> Signed-off-by: Pravin Shelar <pshe...@ovn.org>
>
> Applied, but please take Joe's feedback into consideration.

I was bit busy so could not send update soon. I am planing on updating mailmap.

Thanks.


Re: [PATCH v2] openvswitch: Trim off padding before L3+ netfilter processing

2018-01-02 Thread Pravin Shelar
On Fri, Dec 22, 2017 at 4:39 PM, Ed Swierk <eswi...@skyportsystems.com> wrote:
> On Fri, Dec 22, 2017 at 3:31 PM, Pravin Shelar <pshe...@ovn.org> wrote:
>> On Thu, Dec 21, 2017 at 7:17 AM, Ed Swierk <eswi...@skyportsystems.com> 
>> wrote:
>>> IPv4 and IPv6 packets may arrive with lower-layer padding that is not
>>> included in the L3 length. For example, a short IPv4 packet may have
>>> up to 6 bytes of padding following the IP payload when received on an
>>> Ethernet device. In the normal IPv4 receive path, ip_rcv() trims the
>>> packet to ip_hdr->tot_len before invoking netfilter hooks (including
>>> conntrack and nat).
>>>
>>> In the IPv6 receive path, ip6_rcv() does the same using
>>> ipv6_hdr->payload_len. Similarly in the br_netfilter receive path,
>>> br_validate_ipv4() and br_validate_ipv6() trim the packet to the L3
>>> length before invoking NF_INET_PRE_ROUTING hooks.
>>>
>>> In the OVS conntrack receive path, ovs_ct_execute() pulls the skb to
>>> the L3 header but does not trim it to the L3 length before calling
>>> nf_conntrack_in(NF_INET_PRE_ROUTING). When nf_conntrack_proto_tcp
>>> encounters a packet with lower-layer padding, nf_checksum() fails and
>>> logs "nf_ct_tcp: bad TCP checksum". While extra zero bytes don't
>>> affect the checksum, the length in the IP pseudoheader does. That
>>> length is based on skb->len, and without trimming, it doesn't match
>>> the length the sender used when computing the checksum.
>>>
>>> The assumption throughout nf_conntrack and nf_nat is that skb->len
>>> reflects the length of the L3 header and payload, so there is no need
>>> to refer back to ip_hdr->tot_len or ipv6_hdr->payload_len.
>>>
>>> This change brings OVS into line with other netfilter users, trimming
>>> IPv4 and IPv6 packets prior to L3+ netfilter processing.
>>>
>>> Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
>>> ---
>>> v2:
>>> - Trim packet in nat receive path as well as conntrack
>>> - Free skb on error
>>> ---
>>>  net/openvswitch/conntrack.c | 34 ++
>>>  1 file changed, 34 insertions(+)
>>>
>>> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>>> index b27c5c6..1bdc78f 100644
>>> --- a/net/openvswitch/conntrack.c
>>> +++ b/net/openvswitch/conntrack.c
>>> @@ -703,6 +703,33 @@ static bool skb_nfct_cached(struct net *net,
>>> return ct_executed;
>>>  }
>>>
>>> +/* Trim the skb to the L3 length. Assumes the skb is already pulled to
>>> + * the L3 header. The skb is freed on error.
>>> + */
>>> +static int skb_trim_l3(struct sk_buff *skb)
>>> +{
>>> +   unsigned int nh_len;
>>> +   int err;
>>> +
>>> +   switch (skb->protocol) {
>>> +   case htons(ETH_P_IP):
>>> +   nh_len = ntohs(ip_hdr(skb)->tot_len);
>>> +   break;
>>> +   case htons(ETH_P_IPV6):
>>> +   nh_len = ntohs(ipv6_hdr(skb)->payload_len)
>>> +   + sizeof(struct ipv6hdr);
>>> +   break;
>>> +   default:
>>> +   nh_len = skb->len;
>>> +   }
>>> +
>>> +   err = pskb_trim_rcsum(skb, nh_len);
>>> +   if (err)
>> This should is unlikely.
>
> I'll add unlikely().
>
>>> +   kfree_skb(skb);
>>> +
>>> +   return err;
>>> +}
>>> +
>> This looks like a generic function, it probably does not belong to OVS
>> code base.
>
> Indeed. I'll move it to skbuff.c, unless you have a better idea.
>
>>>  #ifdef CONFIG_NF_NAT_NEEDED
>>>  /* Modelled after nf_nat_ipv[46]_fn().
>>>   * range is only used for new, uninitialized NAT state.
>>> @@ -715,8 +742,12 @@ static int ovs_ct_nat_execute(struct sk_buff *skb, 
>>> struct nf_conn *ct,
>>>  {
>>> int hooknum, nh_off, err = NF_ACCEPT;
>>>
>>> +   /* The nat module expects to be working at L3. */
>>> nh_off = skb_network_offset(skb);
>>> skb_pull_rcsum(skb, nh_off);
>>> +   err = skb_trim_l3(skb);
>>> +   if (err)
>>> +   return err;
>>>
>> ct-nat is executed within ct action, so I do not see why you you call
>> skb-trim again from ovs_ct_nat_execute().
>> ovs_ct_execute() trim should take care of the skb.
>
> I see. Doesn't that mean that skb_pull_rcsum() is also unnecessary in
> ovs_ct_nat_execute(), as ovs_ct_execute() has already pulled the skb
> to the L3 header?
>

Yes, It looks redundant. but lets address it in separate patch.


Re: [ovs-dev] Pravin Shelar

2018-01-02 Thread Pravin Shelar
On Tue, Jan 2, 2018 at 11:36 AM, David Miller <da...@davemloft.net> wrote:
> From: Pravin Shelar <pshe...@ovn.org>
> Date: Thu, 28 Dec 2017 15:47:39 -0800
>
>> Thanks Joe for the patch. But it is corrupted. I will send updated patch 
>> soon.
>
> I'm still waiting for this, just FYI :)

I sent the patch.

Thanks.


Re: [ovs-dev] Pravin Shelar

2017-12-28 Thread Pravin Shelar
On Wed, Dec 27, 2017 at 10:33 AM, Joe Perches <j...@perches.com> wrote:
> On Wed, 2017-12-27 at 10:25 -0800, Ben Pfaff wrote:
>> On Wed, Dec 27, 2017 at 04:22:55PM +0100, Julia Lawall wrote:
>> > The email address pshe...@nicira.com listed for Pravin Shelar in
>> > MAINTAINERS (OPENVSWITCH section) seems to bounce.
>>
>> Pravin has used a newer address recently, so I sent out a suggested
>> update (for OVS):
>> https://patchwork.ozlabs.org/patch/853232/
>
> As Pravin is still active with acks but not any authored patches in
> the
> last year, this should still be updated in the linux-kernel's
> MAINTAINERS
> file too.
> ---
> diff --git a/MAINTAINERS b/MAINTAINERS
> index
> a6e86e20761e..5869e5f0b930 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@
> -10137,7 +10137,7 @@ F: drivers/irqchip/irq-ompic.c
>  F: dri
> vers/irqchip/irq-or1k-*
>
>  OPENVSWITCH
> -M: Pravin Shelar <pshelar@ni
> cira.com>
> +M: Pravin Shelar <pshe...@ovn.org>
>  L: netdev@vge
> r.kernel.org
>  L: d...@openvswitch.org
>  W: http://openvswitch.org

Thanks Joe for the patch. But it is corrupted. I will send updated patch soon.


Re: [PATCH v2] openvswitch: Trim off padding before L3+ netfilter processing

2017-12-22 Thread Pravin Shelar
On Thu, Dec 21, 2017 at 7:17 AM, Ed Swierk  wrote:
> IPv4 and IPv6 packets may arrive with lower-layer padding that is not
> included in the L3 length. For example, a short IPv4 packet may have
> up to 6 bytes of padding following the IP payload when received on an
> Ethernet device. In the normal IPv4 receive path, ip_rcv() trims the
> packet to ip_hdr->tot_len before invoking netfilter hooks (including
> conntrack and nat).
>
> In the IPv6 receive path, ip6_rcv() does the same using
> ipv6_hdr->payload_len. Similarly in the br_netfilter receive path,
> br_validate_ipv4() and br_validate_ipv6() trim the packet to the L3
> length before invoking NF_INET_PRE_ROUTING hooks.
>
> In the OVS conntrack receive path, ovs_ct_execute() pulls the skb to
> the L3 header but does not trim it to the L3 length before calling
> nf_conntrack_in(NF_INET_PRE_ROUTING). When nf_conntrack_proto_tcp
> encounters a packet with lower-layer padding, nf_checksum() fails and
> logs "nf_ct_tcp: bad TCP checksum". While extra zero bytes don't
> affect the checksum, the length in the IP pseudoheader does. That
> length is based on skb->len, and without trimming, it doesn't match
> the length the sender used when computing the checksum.
>
> The assumption throughout nf_conntrack and nf_nat is that skb->len
> reflects the length of the L3 header and payload, so there is no need
> to refer back to ip_hdr->tot_len or ipv6_hdr->payload_len.
>
> This change brings OVS into line with other netfilter users, trimming
> IPv4 and IPv6 packets prior to L3+ netfilter processing.
>
> Signed-off-by: Ed Swierk 
> ---
> v2:
> - Trim packet in nat receive path as well as conntrack
> - Free skb on error
> ---
>  net/openvswitch/conntrack.c | 34 ++
>  1 file changed, 34 insertions(+)
>
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index b27c5c6..1bdc78f 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -703,6 +703,33 @@ static bool skb_nfct_cached(struct net *net,
> return ct_executed;
>  }
>
> +/* Trim the skb to the L3 length. Assumes the skb is already pulled to
> + * the L3 header. The skb is freed on error.
> + */
> +static int skb_trim_l3(struct sk_buff *skb)
> +{
> +   unsigned int nh_len;
> +   int err;
> +
> +   switch (skb->protocol) {
> +   case htons(ETH_P_IP):
> +   nh_len = ntohs(ip_hdr(skb)->tot_len);
> +   break;
> +   case htons(ETH_P_IPV6):
> +   nh_len = ntohs(ipv6_hdr(skb)->payload_len)
> +   + sizeof(struct ipv6hdr);
> +   break;
> +   default:
> +   nh_len = skb->len;
> +   }
> +
> +   err = pskb_trim_rcsum(skb, nh_len);
> +   if (err)
This should is unlikely.
> +   kfree_skb(skb);
> +
> +   return err;
> +}
> +
This looks like a generic function, it probably does not belong to OVS
code base.

>  #ifdef CONFIG_NF_NAT_NEEDED
>  /* Modelled after nf_nat_ipv[46]_fn().
>   * range is only used for new, uninitialized NAT state.
> @@ -715,8 +742,12 @@ static int ovs_ct_nat_execute(struct sk_buff *skb, 
> struct nf_conn *ct,
>  {
> int hooknum, nh_off, err = NF_ACCEPT;
>
> +   /* The nat module expects to be working at L3. */
> nh_off = skb_network_offset(skb);
> skb_pull_rcsum(skb, nh_off);
> +   err = skb_trim_l3(skb);
> +   if (err)
> +   return err;
>
ct-nat is executed within ct action, so I do not see why you you call
skb-trim again from ovs_ct_nat_execute().
ovs_ct_execute() trim should take care of the skb.

> /* See HOOK2MANIP(). */
> if (maniptype == NF_NAT_MANIP_SRC)
> @@ -,6 +1142,9 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb,
> /* The conntrack module expects to be working at L3. */
> nh_ofs = skb_network_offset(skb);
> skb_pull_rcsum(skb, nh_ofs);
> +   err = skb_trim_l3(skb);
> +   if (err)
> +   return err;
>
> if (key->ip.frag != OVS_FRAG_TYPE_NONE) {
> err = handle_fragments(net, key, info->zone.id, skb);
> --
> 1.9.1
>


Re: [PATCH] openvswitch: Trim off padding before L3 conntrack processing

2017-12-17 Thread Pravin Shelar
On Thu, Dec 14, 2017 at 12:05 PM, Ed Swierk <eswi...@skyportsystems.com> wrote:
> On Wed, Dec 13, 2017 at 4:58 PM, Pravin Shelar <pshe...@ovn.org> wrote:
>> On Tue, Dec 12, 2017 at 8:17 AM, Ed Swierk <eswi...@skyportsystems.com> 
>> wrote:
>>> A short IPv4 packet may have up to 6 bytes of padding following the IP
>>> payload when received on an Ethernet device.
>>>
>>> In the normal IPv4 receive path, ip_rcv() trims the packet to
>>> ip_hdr->tot_len before invoking NF_INET_PRE_ROUTING hooks (including
>>> conntrack). Then any subsequent L3+ processing steps, like
>>> nf_checksum(), use skb->len as the length of the packet, rather than
>>> referring back to ip_hdr->tot_len. In the IPv6 receive path, ip6_rcv()
>>> does the same using ipv6_hdr->payload_len.
>>>
>>> In the OVS conntrack receive path, this trimming does not occur, so
>>> the checksum verification in tcp_header() fails, printing "nf_ct_tcp:
>>> bad TCP checksum". Extra zero bytes don't affect the checksum, but the
>>> length in the IP pseudoheader does. That length is based on skb->len,
>>> and without trimming, it doesn't match the length the sender used when
>>> computing the checksum.
>>>
>>> With this change, OVS conntrack trims IPv4 and IPv6 packets prior to
>>> L3 processing.
>>>
>>> Signed-off-by: Ed Swierk <eswi...@skyportsystems.com>
>>> ---
>>>  net/openvswitch/conntrack.c | 17 +
>>>  1 file changed, 17 insertions(+)
>>>
>>> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>>> index d558e882ca0c..3a7c9215c431 100644
>>> --- a/net/openvswitch/conntrack.c
>>> +++ b/net/openvswitch/conntrack.c
>>> @@ -1105,12 +1105,29 @@ int ovs_ct_execute(struct net *net, struct sk_buff 
>>> *skb,
>>>const struct ovs_conntrack_info *info)
>>>  {
>>> int nh_ofs;
>>> +   unsigned int nh_len;
>>> int err;
>>>
>>> /* The conntrack module expects to be working at L3. */
>>> nh_ofs = skb_network_offset(skb);
>>> skb_pull_rcsum(skb, nh_ofs);
>>>
>>> +   /* Trim to L3 length since nf_checksum() doesn't expect padding. */
>> Can you explore if nf_checksum can be changed to avoid the padding?
>
> The nf_ip_checksum() and nf_ip6_checksum() helper functions can easily
> be changed to avoid the padding.
>
> My worry is that conntrack is just one of many netfilter hooks that
> perform L3+ processing, and may assume that once skb->data points to
> the L3 header, skb->len reflects the length of the L3 header and
> payload. For example, in nf_conntrack_ftp.c, help() uses skb->len to
> determine the length of the FTP payload and the TCP sequence number of
> the next packet; this would be thrown off by lower-layer padding.
>
> br_netfilter, a cousin of OVS, has always preserved this
> assumption--like ip_rcv() and ip6_rcv(), br_validate_ipv4() and
> br_validate_ipv6() trim the skb to the L3 length before they invoke
> NF_INET_PRE_ROUTING hooks. Modifying OVS to fit the mold seems more
> straightforward than changing this assumption.
>
we could avoid extra processing in fast path, thats why I wanted to
explore this, But if it is too complex, I am fine with this patch.

>>> +   switch (skb->protocol) {
>>> +   case htons(ETH_P_IP):
>>> +   nh_len = ntohs(ip_hdr(skb)->tot_len);
>>> +   break;
>>> +   case htons(ETH_P_IPV6):
>>> +   nh_len = ntohs(ipv6_hdr(skb)->payload_len)
>>> +   + sizeof(struct ipv6hdr);
>>> +   break;
>>> +   default:
>>> +   nh_len = skb->len;
>>> +   }
>>> +   err = pskb_trim_rcsum(skb, nh_len);
>>> +   if (err)
>>> +   return err;
>>> +
>> In case of error skb needs to be freed.
>
> Thanks, I will fix this.
>
> --Ed


Re: [PATCH] openvswitch: Trim off padding before L3 conntrack processing

2017-12-13 Thread Pravin Shelar
On Tue, Dec 12, 2017 at 8:17 AM, Ed Swierk  wrote:
> A short IPv4 packet may have up to 6 bytes of padding following the IP
> payload when received on an Ethernet device.
>
> In the normal IPv4 receive path, ip_rcv() trims the packet to
> ip_hdr->tot_len before invoking NF_INET_PRE_ROUTING hooks (including
> conntrack). Then any subsequent L3+ processing steps, like
> nf_checksum(), use skb->len as the length of the packet, rather than
> referring back to ip_hdr->tot_len. In the IPv6 receive path, ip6_rcv()
> does the same using ipv6_hdr->payload_len.
>
> In the OVS conntrack receive path, this trimming does not occur, so
> the checksum verification in tcp_header() fails, printing "nf_ct_tcp:
> bad TCP checksum". Extra zero bytes don't affect the checksum, but the
> length in the IP pseudoheader does. That length is based on skb->len,
> and without trimming, it doesn't match the length the sender used when
> computing the checksum.
>
> With this change, OVS conntrack trims IPv4 and IPv6 packets prior to
> L3 processing.
>
> Signed-off-by: Ed Swierk 
> ---
>  net/openvswitch/conntrack.c | 17 +
>  1 file changed, 17 insertions(+)
>
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index d558e882ca0c..3a7c9215c431 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -1105,12 +1105,29 @@ int ovs_ct_execute(struct net *net, struct sk_buff 
> *skb,
>const struct ovs_conntrack_info *info)
>  {
> int nh_ofs;
> +   unsigned int nh_len;
> int err;
>
> /* The conntrack module expects to be working at L3. */
> nh_ofs = skb_network_offset(skb);
> skb_pull_rcsum(skb, nh_ofs);
>
> +   /* Trim to L3 length since nf_checksum() doesn't expect padding. */
Can you explore if nf_checksum can be changed to avoid the padding?

> +   switch (skb->protocol) {
> +   case htons(ETH_P_IP):
> +   nh_len = ntohs(ip_hdr(skb)->tot_len);
> +   break;
> +   case htons(ETH_P_IPV6):
> +   nh_len = ntohs(ipv6_hdr(skb)->payload_len)
> +   + sizeof(struct ipv6hdr);
> +   break;
> +   default:
> +   nh_len = skb->len;
> +   }
> +   err = pskb_trim_rcsum(skb, nh_len);
> +   if (err)
> +   return err;
> +
In case of error skb needs to be freed.


Re: [ovs-dev] [PATCH 7/8] net: ovs: remove unused hardirq.h

2017-12-07 Thread Pravin Shelar
On Fri, Nov 17, 2017 at 3:02 PM, Yang Shi <yan...@alibaba-inc.com> wrote:
> Preempt counter APIs have been split out, currently, hardirq.h just
> includes irq_enter/exit APIs which are not used by openvswitch at all.
>
> So, remove the unused hardirq.h.
>
> Signed-off-by: Yang Shi <yan...@alibaba-inc.com>
> Cc: Pravin Shelar <pshe...@nicira.com>
> Cc: "David S. Miller" <da...@davemloft.net>
> Cc: d...@openvswitch.org

Acked-by: Pravin B Shelar <pshe...@ovn.org>


Re: [PATCH net-next] openvswitch: do not propagate headroom updates to internal port

2017-12-01 Thread Pravin Shelar
On Thu, Nov 30, 2017 at 6:35 AM, Paolo Abeni  wrote:
> After commit 3a927bc7cf9d ("ovs: propagate per dp max headroom to
> all vports") the need_headroom for the internal vport is updated
> accordingly to the max needed headroom in its datapath.
>
> That avoids the pskb_expand_head() costs when sending/forwarding
> packets towards tunnel devices, at least for some scenarios.
>
> We still require such copy when using the ovs-preferred configuration
> for vxlan tunnels:
>
> br_int
>   /   \
> tap  vxlan
>(remote_ip:X)
>
> br_phy
>  \
> NIC
>
> where the route towards the IP 'X' is via 'br_phy'.
>
> When forwarding traffic from the tap towards the vxlan device, we
> will call pskb_expand_head() in vxlan_build_skb() because
> br-phy->needed_headroom is equal to tun->needed_headroom.
>
> With this change we avoid updating the internal vport needed_headroom,
> so that in the above scenario no head copy is needed, giving 5%
> performance improvement in UDP throughput test.
>
> As a trade-off, packets sent from the internal port towards a tunnel
> device will now experience the head copy overhead. The rationale is
> that the latter use-case is less relevant performance-wise.
>
> Signed-off-by: Paolo Abeni 

Acked-by: Pravin B Shelar 

Thanks.


Re: [PATCH] openvswitch: use ktime_get_ts64() instead of ktime_get_ts()

2017-11-27 Thread Pravin Shelar
On Mon, Nov 27, 2017 at 5:11 PM, Arnd Bergmann  wrote:
> timespec is deprecated because of the y2038 overflow, so let's convert
> this one to ktime_get_ts64(). The code is already safe even on 32-bit
> architectures, since it uses monotonic times. On 64-bit architectures,
> nothing changes, while on 32-bit architectures this avoids one
> type conversion.
>
> Signed-off-by: Arnd Bergmann 
> ---
>  net/openvswitch/flow.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
> index dbe2379329c5..76d050aba7a4 100644
> --- a/net/openvswitch/flow.c
> +++ b/net/openvswitch/flow.c
> @@ -56,12 +56,12 @@
>
>  u64 ovs_flow_used_time(unsigned long flow_jiffies)
>  {
> -   struct timespec cur_ts;
> +   struct timespec64 cur_ts;
> u64 cur_ms, idle_ms;
>
> -   ktime_get_ts(_ts);
> +   ktime_get_ts64(_ts);
> idle_ms = jiffies_to_msecs(jiffies - flow_jiffies);
> -   cur_ms = (u64)cur_ts.tv_sec * MSEC_PER_SEC +
> +   cur_ms = (u64)(u32)cur_ts.tv_sec * MSEC_PER_SEC +

I am not sure why is tv_sec converted to u32.

>  cur_ts.tv_nsec / NSEC_PER_MSEC;
>
> return cur_ms - idle_ms;
> --
> 2.9.0
>


Re: [PATCH net] openvswitch: fix the incorrect flow action alloc size

2017-11-26 Thread Pravin Shelar
On Sat, Nov 25, 2017 at 7:32 PM, zhangliping  wrote:
> From: zhangliping 
>
> If we want to add a datapath flow, which has more than 500 vxlan outputs'
> action, we will get the following error reports:
>   openvswitch: netlink: Flow action size 32832 bytes exceeds max
>   openvswitch: netlink: Flow action size 32832 bytes exceeds max
>   openvswitch: netlink: Actions may not be safe on all matching packets
>   ... ...
>
> It seems that we can simply enlarge the MAX_ACTIONS_BUFSIZE to fix it, but
> this is not the root cause. For example, for a vxlan output action, we need
> about 60 bytes for the nlattr, but after it is converted to the flow
> action, it only occupies 24 bytes. This means that we can still support
> more than 1000 vxlan output actions for a single datapath flow under the
> the current 32k max limitation.
>
> So even if the nla_len(attr) is larger than MAX_ACTIONS_BUFSIZE, we
> shouldn't report EINVAL and keep it move on, as the judgement can be
> done by the reserve_sfa_size.
>
> Signed-off-by: zhangliping 

Thanks for the patch.

Acked-by: Pravin B Shelar 


Re: [ovs-dev] [PATCH net-next] openvswitch: Using kfree_rcu() to simplify the code

2017-11-14 Thread Pravin Shelar
On Tue, Nov 14, 2017 at 11:57 AM, Wei Yongjun  wrote:
> The callback function of call_rcu() just calls a kfree(), so we
> can use kfree_rcu() instead of call_rcu() + callback function.
>
> Signed-off-by: Wei Yongjun 

Acked-by: Pravin B Shelar 


Re: [ovs-dev] [PATCH net-next] openvswitch: Make local function ovs_nsh_key_attr_size() static

2017-11-14 Thread Pravin Shelar
On Tue, Nov 14, 2017 at 11:57 AM, Wei Yongjun  wrote:
> Fixes the following sparse warnings:
>
> net/openvswitch/flow_netlink.c:340:8: warning:
>  symbol 'ovs_nsh_key_attr_size' was not declared. Should it be static?
>
Acked-by: Pravin B Shelar 


> Signed-off-by: Wei Yongjun 
> ---
>  net/openvswitch/flow_netlink.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
> index 4201f92..d79c364 100644
> --- a/net/openvswitch/flow_netlink.c
> +++ b/net/openvswitch/flow_netlink.c
> @@ -337,7 +337,7 @@ size_t ovs_tun_key_attr_size(void)
> + nla_total_size(4);   /* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS */
>  }
>
> -size_t ovs_nsh_key_attr_size(void)
> +static size_t ovs_nsh_key_attr_size(void)
>  {
> /* Whenever adding new OVS_NSH_KEY_ FIELDS, we should consider
>  * updating this function.
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net-next] openvswitch: Fix return value check in ovs_meter_cmd_features()

2017-11-14 Thread Pravin Shelar
On Tue, Nov 14, 2017 at 11:50 AM, Wei Yongjun  wrote:
> In case of error, the function ovs_meter_cmd_reply_start() returns
> ERR_PTR() not NULL. The NULL test in the return value check should
> be replaced with IS_ERR().
>
> Fixes: 96fbc13d7e77 ("openvswitch: Add meter infrastructure")
> Signed-off-by: Wei Yongjun 

Acked-by: Pravin B Shelar 


Re: [net-next v5 0/4] Openvswitch meter action

2017-11-10 Thread Pravin Shelar
On Sat, Nov 11, 2017 at 1:39 AM, Andy Zhou  wrote:
> This patch series is the first attempt to add openvswitch
> meter support. We have previously experimented with adding
> metering support in nftables. However 1) It was not clear
> how to expose a named nftables object cleanly, and 2)
> the logic that implements metering is quite small, < 100 lines
> of code.
>
> With those two observations, it seems cleaner to add meter
> support in the openvswitch module directly.
>
> ---
>
> v1(RFC)->v2:  remove unused code improve locking
>   and other review comments
> v2 -> v3: rebase
> v3 -> v4: fix undefined "__udivdi3" references on 32 bit builds.
>   use div_u64() instead.
> v4 -> v5: rebase
>
Acked-by: Pravin B Shelar 


Re: [ovs-dev] [PATCH] openvswitch: add null pointer check on upcall

2017-11-09 Thread Pravin Shelar
On Thu, Nov 9, 2017 at 7:29 PM, Colin King  wrote:
> From: Colin Ian King 
>
> upcall may be assigned a NULL pointer as genlmsg_put can potentially
> return a NULL.  Add a null check to avoid a null pointer dereference
> on upcall.
>
> Detected by CoverityScan, CID#728404 ("Dereference null return value")
>
> Fixes: commit ccb1352e76cf ("net: Add Open vSwitch kernel components.")
> Signed-off-by: Colin Ian King 
> ---
>  net/openvswitch/datapath.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index 4d38ac044cee..e8fb3d76fa6e 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -461,6 +461,10 @@ static int queue_userspace_packet(struct datapath *dp, 
> struct sk_buff *skb,
>
> upcall = genlmsg_put(user_skb, 0, 0, _packet_genl_family,
>  0, upcall_info->cmd);
> +   if (!upcall) {
> +   err = -ENOBUFS;
> +   goto out;
> +   }
user_skb is allocated according to all required attributes, so
genlmsg_put() can not return null value.


> upcall->dp_ifindex = dp_ifindex;
>
> err = ovs_nla_put_key(key, key, OVS_PACKET_ATTR_KEY, false, user_skb);
> --
> 2.14.1
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [net-next v4 4/4] openvswitch: Add meter action support

2017-11-09 Thread Pravin Shelar
On Thu, Nov 9, 2017 at 11:00 AM, Andy Zhou  wrote:
> Implements OVS kernel meter action support.
>
> Signed-off-by: Andy Zhou 
> ---
I could not apply this patch on latest net-next.


Re: [net-next v4 3/4] openvswitch: Add meter infrastructure

2017-11-09 Thread Pravin Shelar
On Thu, Nov 9, 2017 at 11:00 AM, Andy Zhou  wrote:
> OVS kernel datapath so far does not support Openflow meter action.
> This is the first stab at adding kernel datapath meter support.
> This implementation supports only drop band type.
>
> Signed-off-by: Andy Zhou 

Acked-by: Pravin B Shelar 


Re: [net-next v4 1/4] openvswitch: Add meter netlink definitions

2017-11-09 Thread Pravin Shelar
On Thu, Nov 9, 2017 at 11:00 AM, Andy Zhou  wrote:
> Meter has its own netlink family. Define netlink messages and attributes
> for communicating with the user space programs.
>
> Signed-off-by: Andy Zhou 

Acked-by: Pravin B Shelar 


Re: [net-next v4 2/4] openvswitch: export get_dp() API.

2017-11-09 Thread Pravin Shelar
On Thu, Nov 9, 2017 at 11:00 AM, Andy Zhou  wrote:
> Later patches will invoke get_dp() outside of datapath.c. Export it.
>
> Signed-off-by: Andy Zhou 
Acked-by: Pravin B Shelar 


Re: [PATCH net-next v15] openvswitch: enable NSH support

2017-11-07 Thread Pravin Shelar
On Tue, Nov 7, 2017 at 3:55 AM, Yang, Yi <yi.y.y...@intel.com> wrote:
> On Tue, Nov 07, 2017 at 03:58:35AM -0800, Pravin Shelar wrote:
>> On Tue, Nov 7, 2017 at 3:28 AM, Yang, Yi <yi.y.y...@intel.com> wrote:
>> > On Tue, Nov 07, 2017 at 06:57:30PM +0800, Pravin Shelar wrote:
>> >> On Mon, Nov 6, 2017 at 4:22 AM, Jiri Benc <jb...@redhat.com> wrote:
>> >> > On Sat, 4 Nov 2017 07:29:46 -0700, Pravin Shelar wrote:
>> >> >> > +int nsh_push(struct sk_buff *skb, const struct nshhdr *pushed_nh)
>> >> >> > +{
>> >> >> > +   struct nshhdr *nh;
>> >> >> > +   size_t length = nsh_hdr_len(pushed_nh);
>> >> >> > +   u8 next_proto;
>> >> >> > +
>> >> >> > +   if (skb->mac_len) {
>> >> >> > +   next_proto = TUN_P_ETHERNET;
>> >> >> > +   } else {
>> >> >> > +   next_proto = tun_p_from_eth_p(skb->protocol);
>> >> >> > +   if (!next_proto)
>> >> >> > +   return -EAFNOSUPPORT;
>> >> >> check for supported protocols can be moved to flow install validation
>> >> >> in __ovs_nla_copy_actions().
>> >> >
>> >> > You mean the check for !next_proto? It needs to be present for
>> >> > correctness of nsh_push. This function has to be self contained, it
>> >> > will be used by more callers than opevswitch, namely tc.
>> >> >
>> >> > It's actually not so much a check for "supported protocols", it's
>> >> > rather a check of return value of a function that converts ethertype to
>> >> > a 1 byte tunnel type. Blindly using a result of a function that may
>> >> > return error would be wrong. Openvswitch is free to perform additional
>> >> > checks but this needs to stay.
>> >> >
>> >> I am not disputing validity of the checks, but it could be done at
>> >> flow install phase.
>> >> For other use case we could refactor code. If it is too complex, I am
>> >> fine with duplicate code that check the protocol in flow install for
>> >> now.
>> >
>> > Ok, I'll add check code in __ovs_nla_copy_actions for both nsh_push and
>> > nsh_pop, but how can we get value of skb->protocol in
>> > __ovs_nla_copy_actions? Is it argument eth_type of
>> > __ovs_nla_copy_actions?
>> >
>> Yes.
>
> So here is newly-added check code, is it ok?
>
> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
> index fa07a17..b64b754 100644
> --- a/net/openvswitch/flow_netlink.c
> +++ b/net/openvswitch/flow_netlink.c
> @@ -3001,20 +3001,34 @@ static int __ovs_nla_copy_actions(struct net *net, 
> const struct nlattr *attr,
> mac_proto = MAC_PROTO_ETHERNET;
> break;
>
> -   case OVS_ACTION_ATTR_PUSH_NSH:
> +   case OVS_ACTION_ATTR_PUSH_NSH: {
> +   u8 next_proto;
> +
next_proto can be moved to the if () block, otherwise, I am fine with
this change.

> +   if (mac_proto != MAC_PROTO_ETHERNET) {
> +   next_proto = tun_p_from_eth_p(eth_type);
> +   if (!next_proto)
> +   return -EINVAL;
> +   }
> mac_proto = MAC_PROTO_NONE;
> if (!validate_nsh(nla_data(a), false, true, true))
> return -EINVAL;
> break;
> +   }
> +
> +   case OVS_ACTION_ATTR_POP_NSH: {
> +   __be16 inner_proto;
>
> -   case OVS_ACTION_ATTR_POP_NSH:
> if (eth_type != htons(ETH_P_NSH))
> return -EINVAL;
> +   inner_proto = tun_p_to_eth_p(key->nsh.base.np);
> +   if (!inner_proto)
> +   return -EINVAL;
> if (key->nsh.base.np == TUN_P_ETHERNET)
> mac_proto = MAC_PROTO_ETHERNET;
> else
> mac_proto = MAC_PROTO_NONE;
> break;
> +   }
>
> default:
> OVS_NLERR(log, "Unknown Action type %d", type);


Re: [PATCH net-next v15] openvswitch: enable NSH support

2017-11-07 Thread Pravin Shelar
On Tue, Nov 7, 2017 at 3:28 AM, Yang, Yi <yi.y.y...@intel.com> wrote:
> On Tue, Nov 07, 2017 at 06:57:30PM +0800, Pravin Shelar wrote:
>> On Mon, Nov 6, 2017 at 4:22 AM, Jiri Benc <jb...@redhat.com> wrote:
>> > On Sat, 4 Nov 2017 07:29:46 -0700, Pravin Shelar wrote:
>> >> > +int nsh_push(struct sk_buff *skb, const struct nshhdr *pushed_nh)
>> >> > +{
>> >> > +   struct nshhdr *nh;
>> >> > +   size_t length = nsh_hdr_len(pushed_nh);
>> >> > +   u8 next_proto;
>> >> > +
>> >> > +   if (skb->mac_len) {
>> >> > +   next_proto = TUN_P_ETHERNET;
>> >> > +   } else {
>> >> > +   next_proto = tun_p_from_eth_p(skb->protocol);
>> >> > +   if (!next_proto)
>> >> > +   return -EAFNOSUPPORT;
>> >> check for supported protocols can be moved to flow install validation
>> >> in __ovs_nla_copy_actions().
>> >
>> > You mean the check for !next_proto? It needs to be present for
>> > correctness of nsh_push. This function has to be self contained, it
>> > will be used by more callers than opevswitch, namely tc.
>> >
>> > It's actually not so much a check for "supported protocols", it's
>> > rather a check of return value of a function that converts ethertype to
>> > a 1 byte tunnel type. Blindly using a result of a function that may
>> > return error would be wrong. Openvswitch is free to perform additional
>> > checks but this needs to stay.
>> >
>> I am not disputing validity of the checks, but it could be done at
>> flow install phase.
>> For other use case we could refactor code. If it is too complex, I am
>> fine with duplicate code that check the protocol in flow install for
>> now.
>
> Ok, I'll add check code in __ovs_nla_copy_actions for both nsh_push and
> nsh_pop, but how can we get value of skb->protocol in
> __ovs_nla_copy_actions? Is it argument eth_type of
> __ovs_nla_copy_actions?
>
Yes.


Re: [PATCH net-next v15] openvswitch: enable NSH support

2017-11-07 Thread Pravin Shelar
On Mon, Nov 6, 2017 at 4:22 AM, Jiri Benc <jb...@redhat.com> wrote:
> On Sat, 4 Nov 2017 07:29:46 -0700, Pravin Shelar wrote:
>> > +int nsh_push(struct sk_buff *skb, const struct nshhdr *pushed_nh)
>> > +{
>> > +   struct nshhdr *nh;
>> > +   size_t length = nsh_hdr_len(pushed_nh);
>> > +   u8 next_proto;
>> > +
>> > +   if (skb->mac_len) {
>> > +   next_proto = TUN_P_ETHERNET;
>> > +   } else {
>> > +   next_proto = tun_p_from_eth_p(skb->protocol);
>> > +   if (!next_proto)
>> > +   return -EAFNOSUPPORT;
>> check for supported protocols can be moved to flow install validation
>> in __ovs_nla_copy_actions().
>
> You mean the check for !next_proto? It needs to be present for
> correctness of nsh_push. This function has to be self contained, it
> will be used by more callers than opevswitch, namely tc.
>
> It's actually not so much a check for "supported protocols", it's
> rather a check of return value of a function that converts ethertype to
> a 1 byte tunnel type. Blindly using a result of a function that may
> return error would be wrong. Openvswitch is free to perform additional
> checks but this needs to stay.
>
I am not disputing validity of the checks, but it could be done at
flow install phase.
For other use case we could refactor code. If it is too complex, I am
fine with duplicate code that check the protocol in flow install for
now.

>> > +int nsh_pop(struct sk_buff *skb)
>> > +{
>> > +   struct nshhdr *nh;
>> > +   size_t length;
>> > +   __be16 inner_proto;
>> > +
>> > +   if (!pskb_may_pull(skb, NSH_BASE_HDR_LEN))
>> > +   return -ENOMEM;
>> > +   nh = (struct nshhdr *)(skb->data);
>> > +   length = nsh_hdr_len(nh);
>> > +   inner_proto = tun_p_to_eth_p(nh->np);
>> same as above, this check can be moved to flow install 
>> __ovs_nla_copy_actions().
>
> There's no check here.
>
>> > +   if (!pskb_may_pull(skb, length))
>> > +   return -ENOMEM;
>> > +
>> > +   if (!inner_proto)
>> > +   return -EAFNOSUPPORT;
>
> Did you mean this one instead? Then see above, this has to stay.
>
>  Jiri


Re: [PATCH net-next v15] openvswitch: enable NSH support

2017-11-06 Thread Pravin Shelar
On Sun, Nov 5, 2017 at 8:19 PM, Yang, Yi <yi.y.y...@intel.com> wrote:
> On Sat, Nov 04, 2017 at 10:29:46PM +0800, Pravin Shelar wrote:
>> On Tue, Oct 31, 2017 at 9:03 PM, Yi Yang <yi.y.y...@intel.com> wrote:
>> > +int nsh_push(struct sk_buff *skb, const struct nshhdr *pushed_nh)
>> > +{
>> > +   struct nshhdr *nh;
>> > +   size_t length = nsh_hdr_len(pushed_nh);
>> > +   u8 next_proto;
>> > +
>> > +   if (skb->mac_len) {
>> > +   next_proto = TUN_P_ETHERNET;
>> > +   } else {
>> > +   next_proto = tun_p_from_eth_p(skb->protocol);
>> > +   if (!next_proto)
>> > +   return -EAFNOSUPPORT;
>> check for supported protocols can be moved to flow install validation
>> in __ovs_nla_copy_actions().
>>
>> > +   }
>> > +
>> > +   /* Add the NSH header */
>> > +   if (skb_cow_head(skb, length) < 0)
>> > +   return -ENOMEM;
>> > +
>> > +   skb_push(skb, length);
>> > +   nh = (struct nshhdr *)(skb->data);
>> > +   memcpy(nh, pushed_nh, length);
>> > +   nh->np = next_proto;
>> > +
>> > +   skb->protocol = htons(ETH_P_NSH);
>> > +   skb_reset_mac_header(skb);
>> > +   skb_reset_network_header(skb);
>> > +   skb_reset_mac_len(skb);
>> > +
>> > +   return 0;
>> > +}
>> > +EXPORT_SYMBOL_GPL(nsh_push);
>> > +
>> > +int nsh_pop(struct sk_buff *skb)
>> > +{
>> > +   struct nshhdr *nh;
>> > +   size_t length;
>> > +   __be16 inner_proto;
>> > +
>> > +   if (!pskb_may_pull(skb, NSH_BASE_HDR_LEN))
>> > +   return -ENOMEM;
>> > +   nh = (struct nshhdr *)(skb->data);
>> > +   length = nsh_hdr_len(nh);
>> > +   inner_proto = tun_p_to_eth_p(nh->np);
>> same as above, this check can be moved to flow install 
>> __ovs_nla_copy_actions().
>
> Pravin, these two functions are not only for OVS, you can see it is
> net/nsh/nsh.c, Jiri and Eric mentioned they also could be used by TC.
>
I think it can be easily done by other caller of these function, or
you can refactor the APIs itself.

> I understand you expect some checks should be moved to slow path, but
> for there two cases, we can't remove them into __ovs_nla_copy_actions.
>
It is not just about optimization, but having these check in flow
install allows OVS userspace to probe level of  NSH support in
datapath.


Re: [net-next v3 0/4] Openvswitch meter action

2017-11-06 Thread Pravin Shelar
On Mon, Nov 6, 2017 at 12:30 AM, Andy Zhou  wrote:
> This patch series is the first attempt to add openvswitch
> meter support. We have previously experimented with adding
> metering support in nftables. However 1) It was not clear
> how to expose a named nftables object cleanly, and 2)
> the logic that implements metering is quite small, < 100 lines
> of code.
>
> With those two observations, it seems cleaner to add meter
> support in the openvswitch module directly.
>
> ---
>
> v1(RFC)->v2:  remove unused code improve locking
>   and other review comments
> v2 -> v3: rebase
>
Looks good to me.

Acked-by: Pravin B Shelar 

>
> Andy Zhou (4):
>   openvswitch: Add meter netlink definitions
>   openvswitch: export get_dp() API.
>   openvswitch: Add meter infrastructure
>   openvswitch: Add meter action support
>
>  include/uapi/linux/openvswitch.h |  54 
>  net/openvswitch/Makefile |   1 +
>  net/openvswitch/actions.c|   6 +
>  net/openvswitch/datapath.c   |  43 +--
>  net/openvswitch/datapath.h   |  35 +++
>  net/openvswitch/flow_netlink.c   |   6 +
>  net/openvswitch/meter.c  | 604 
> +++
>  net/openvswitch/meter.h  |  54 
>  8 files changed, 772 insertions(+), 31 deletions(-)
>  create mode 100644 net/openvswitch/meter.c
>  create mode 100644 net/openvswitch/meter.h
>
> --
> 1.8.3.1
>


Re: [PATCH net-next v15] openvswitch: enable NSH support

2017-11-04 Thread Pravin Shelar
On Tue, Oct 31, 2017 at 9:03 PM, Yi Yang  wrote:
> v14->v15
>  - Check size in nsh_hdr_from_nlattr
>  - Fixed four small issues pointed out By Jiri and Eric
>
> v13->v14
>  - Rename skb_push_nsh to nsh_push per Dave's comment
>  - Rename skb_pop_nsh to nsh_pop per Dave's comment
>
> v12->v13
>  - Fix NSH header length check in set_nsh
>
> v11->v12
>  - Fix missing changes old comments pointed out
>  - Fix new comments for v11
>
> v10->v11
>  - Fix the left three disputable comments for v9
>but not fixed in v10.
>
> v9->v10
>  - Change struct ovs_key_nsh to
>struct ovs_nsh_key_base base;
>__be32 context[NSH_MD1_CONTEXT_SIZE];
>  - Fix new comments for v9
>
> v8->v9
>  - Fix build error reported by daily intel build
>because nsh module isn't selected by openvswitch
>
> v7->v8
>  - Rework nested value and mask for OVS_KEY_ATTR_NSH
>  - Change pop_nsh to adapt to nsh kernel module
>  - Fix many issues per comments from Jiri Benc
>
> v6->v7
>  - Remove NSH GSO patches in v6 because Jiri Benc
>reworked it as another patch series and they have
>been merged.
>  - Change it to adapt to nsh kernel module added by NSH
>GSO patch series
>
> v5->v6
>  - Fix the rest comments for v4.
>  - Add NSH GSO support for VxLAN-gpe + NSH and
>Eth + NSH.
>
> v4->v5
>  - Fix many comments by Jiri Benc and Eric Garver
>for v4.
>
> v3->v4
>  - Add new NSH match field ttl
>  - Update NSH header to the latest format
>which will be final format and won't change
>per its author's confirmation.
>  - Fix comments for v3.
>
> v2->v3
>  - Change OVS_KEY_ATTR_NSH to nested key to handle
>length-fixed attributes and length-variable
>attriubte more flexibly.
>  - Remove struct ovs_action_push_nsh completely
>  - Add code to handle nested attribute for SET_MASKED
>  - Change PUSH_NSH to use the nested OVS_KEY_ATTR_NSH
>to transfer NSH header data.
>  - Fix comments and coding style issues by Jiri and Eric
>
> v1->v2
>  - Change encap_nsh and decap_nsh to push_nsh and pop_nsh
>  - Dynamically allocate struct ovs_action_push_nsh for
>length-variable metadata.
>
> OVS master and 2.8 branch has merged NSH userspace
> patch series, this patch is to enable NSH support
> in kernel data path in order that OVS can support
> NSH in compat mode by porting this.
>
> Signed-off-by: Yi Yang 
As commented earlier following are action related validations that can
be moved to flow install phase.

> ---
>  include/net/nsh.h|   3 +
>  include/uapi/linux/openvswitch.h |  29 
>  net/nsh/nsh.c|  59 
>  net/openvswitch/Kconfig  |   1 +
>  net/openvswitch/actions.c| 119 +++
>  net/openvswitch/flow.c   |  51 +++
>  net/openvswitch/flow.h   |   7 +
>  net/openvswitch/flow_netlink.c   | 315 
> ++-
>  net/openvswitch/flow_netlink.h   |   5 +
>  9 files changed, 588 insertions(+), 1 deletion(-)
>
...

> diff --git a/net/nsh/nsh.c b/net/nsh/nsh.c
> index 58fb827..2764682 100644
> --- a/net/nsh/nsh.c
> +++ b/net/nsh/nsh.c
> @@ -14,6 +14,65 @@
>  #include 
>  #include 
>
> +int nsh_push(struct sk_buff *skb, const struct nshhdr *pushed_nh)
> +{
> +   struct nshhdr *nh;
> +   size_t length = nsh_hdr_len(pushed_nh);
> +   u8 next_proto;
> +
> +   if (skb->mac_len) {
> +   next_proto = TUN_P_ETHERNET;
> +   } else {
> +   next_proto = tun_p_from_eth_p(skb->protocol);
> +   if (!next_proto)
> +   return -EAFNOSUPPORT;
check for supported protocols can be moved to flow install validation
in __ovs_nla_copy_actions().

> +   }
> +
> +   /* Add the NSH header */
> +   if (skb_cow_head(skb, length) < 0)
> +   return -ENOMEM;
> +
> +   skb_push(skb, length);
> +   nh = (struct nshhdr *)(skb->data);
> +   memcpy(nh, pushed_nh, length);
> +   nh->np = next_proto;
> +
> +   skb->protocol = htons(ETH_P_NSH);
> +   skb_reset_mac_header(skb);
> +   skb_reset_network_header(skb);
> +   skb_reset_mac_len(skb);
> +
> +   return 0;
> +}
> +EXPORT_SYMBOL_GPL(nsh_push);
> +
> +int nsh_pop(struct sk_buff *skb)
> +{
> +   struct nshhdr *nh;
> +   size_t length;
> +   __be16 inner_proto;
> +
> +   if (!pskb_may_pull(skb, NSH_BASE_HDR_LEN))
> +   return -ENOMEM;
> +   nh = (struct nshhdr *)(skb->data);
> +   length = nsh_hdr_len(nh);
> +   inner_proto = tun_p_to_eth_p(nh->np);
same as above, this check can be moved to flow install __ovs_nla_copy_actions().

> +   if (!pskb_may_pull(skb, length))
> +   return -ENOMEM;
> +
> +   if (!inner_proto)
> +   return -EAFNOSUPPORT;
> +
> +   skb_pull(skb, length);
> +   skb_reset_mac_header(skb);
> +   skb_reset_network_header(skb);
> +   skb_reset_mac_len(skb);
> +   skb->protocol = inner_proto;
> +

Re: [PATCH net-next v15] openvswitch: enable NSH support

2017-11-04 Thread Pravin Shelar
On Thu, Nov 2, 2017 at 6:40 PM, Yang, Yi <yi.y.y...@intel.com> wrote:
> On Thu, Nov 02, 2017 at 05:06:47AM -0700, Pravin Shelar wrote:
>> On Wed, Nov 1, 2017 at 7:50 PM, Yang, Yi <yi.y.y...@intel.com> wrote:
>> > On Thu, Nov 02, 2017 at 08:52:40AM +0800, Pravin Shelar wrote:
>> >> On Tue, Oct 31, 2017 at 9:03 PM, Yi Yang <yi.y.y...@intel.com> wrote:
>> >> >
>> >> > OVS master and 2.8 branch has merged NSH userspace
>> >> > patch series, this patch is to enable NSH support
>> >> > in kernel data path in order that OVS can support
>> >> > NSH in compat mode by porting this.
>> >> >
>> >> > Signed-off-by: Yi Yang <yi.y.y...@intel.com>
>> >> > ---
>> >> I have comment related to checksum, otherwise patch looks good to me.
>> >
>> > Pravin, thank you for your comments, the below part is incremental patch
>> > for checksum, please help check it, I'll send out v16 with this after
>> > you confirm.
>> >
>> This change looks good to me.
>> I noticed couple of more issues.
>> 1. Can you move the ovs_key_nsh to the union of ipv4 an ipv6?
>> ipv4/ipv6/nsh key data is mutually exclusive so there is no need for
>> separate space for nsh key in the ovs key.
>> 2. We need to fix match_validate() with nsh check. Datapath can not
>> allow any l3 or l4 match if the flow key contains nsh match and
>> vice-versa. such flow key should be rejected.
>
> Pravin, the below incremental patch should fix the issues you pionted
> out, please help confirm/ack, then I'll send out v16 with all acks
> from you all for merge. BTW, it has been verified in my sfc test
> environment.
>
Following patch looks good to me. But I think we needs similar
eth_type check for nsh set action in validate_set() and in
__ovs_nla_copy_actions() for NSH_POP action.

Can you send patch with all changes?

Thanks.

> diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
> index 8eeae749..c670dd2 100644
> --- a/net/openvswitch/flow.h
> +++ b/net/openvswitch/flow.h
> @@ -149,8 +149,8 @@ struct sw_flow_key {
> } nd;
> };
> } ipv6;
> +   struct ovs_key_nsh nsh; /* network service header */
> };
> -   struct ovs_key_nsh nsh; /* network service header */
> struct {
> /* Connection tracking fields not packed above. */
> struct {
> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
> index 0d7d4ae..090103c 100644
> --- a/net/openvswitch/flow_netlink.c
> +++ b/net/openvswitch/flow_netlink.c
> @@ -178,7 +178,8 @@ static bool match_validate(const struct sw_flow_match 
> *match,
> | (1 << OVS_KEY_ATTR_ICMPV6)
> | (1 << OVS_KEY_ATTR_ARP)
> | (1 << OVS_KEY_ATTR_ND)
> -   | (1 << OVS_KEY_ATTR_MPLS));
> +   | (1 << OVS_KEY_ATTR_MPLS)
> +   | (1 << OVS_KEY_ATTR_NSH));
>
> /* Always allowed mask fields. */
> mask_allowed |= ((1 << OVS_KEY_ATTR_TUNNEL)
> @@ -287,6 +288,14 @@ static bool match_validate(const struct sw_flow_match 
> *match,
> }
> }
>
> +   if (match->key->eth.type == htons(ETH_P_NSH)) {
> +   key_expected |= 1 << OVS_KEY_ATTR_NSH;
> +   if (match->mask &&
> +   match->mask->key.eth.type == htons(0x)) {
> +   mask_allowed |= 1 << OVS_KEY_ATTR_NSH;
> +   }
> +   }
> +
> if ((key_attrs & key_expected) != key_expected) {
> /* Key attributes check failed. */
> OVS_NLERR(log, "Missing key (keys=%llx, expected=%llx)",


Re: [net-next v2 3/4] openvswitch: Add meter infrastructure

2017-11-03 Thread Pravin Shelar
On Thu, Nov 2, 2017 at 7:43 PM, Andy Zhou <az...@ovn.org> wrote:
> On Thu, Nov 2, 2017 at 5:07 AM, Pravin Shelar <pshe...@ovn.org> wrote:
>> On Thu, Nov 2, 2017 at 3:07 AM, Andy Zhou <az...@ovn.org> wrote:
>>> On Fri, Oct 20, 2017 at 8:32 PM, Pravin Shelar <pshe...@ovn.org> wrote:
>>>> On Thu, Oct 19, 2017 at 5:58 PM, Andy Zhou <az...@ovn.org> wrote:
>>>>>
>>>>> On Thu, Oct 19, 2017 at 02:47 Pravin Shelar <pshe...@ovn.org> wrote:
>>>>>>
>>>>>> On Tue, Oct 17, 2017 at 12:36 AM, Andy Zhou <az...@ovn.org> wrote:
>>>>>> > OVS kernel datapath so far does not support Openflow meter action.
>>>>>> > This is the first stab at adding kernel datapath meter support.
>>>>>> > This implementation supports only drop band type.
>>>>>> >
>>>>>> > Signed-off-by: Andy Zhou <az...@ovn.org>
>>>>>> > ---
>>>>>> >  net/openvswitch/Makefile   |   1 +
>>>>>> >  net/openvswitch/datapath.c |  14 +-
>>>>>> >  net/openvswitch/datapath.h |   3 +
>>>>>> >  net/openvswitch/meter.c| 604
>>>>>> > +
>>>>>> >  net/openvswitch/meter.h|  54 
>>>>>> >  5 files changed, 674 insertions(+), 2 deletions(-)
>>>>>> >  create mode 100644 net/openvswitch/meter.c
>>>>>> >  create mode 100644 net/openvswitch/meter.h
>>>>>> >
>>>>>> This patch mostly looks good. I have one comment below.
>>>>>>
>>>>>> > +static int ovs_meter_cmd_set(struct sk_buff *skb, struct genl_info
>>>>>> > *info)
>>>>>> > +{
>>>>>> > +   struct nlattr **a = info->attrs;
>>>>>> > +   struct dp_meter *meter, *old_meter;
>>>>>> > +   struct sk_buff *reply;
>>>>>> > +   struct ovs_header *ovs_reply_header;
>>>>>> > +   struct ovs_header *ovs_header = info->userhdr;
>>>>>> > +   struct datapath *dp;
>>>>>> > +   int err;
>>>>>> > +   u32 meter_id;
>>>>>> > +   bool failed;
>>>>>> > +
>>>>>> > +   meter = dp_meter_create(a);
>>>>>> > +   if (IS_ERR_OR_NULL(meter))
>>>>>> > +   return PTR_ERR(meter);
>>>>>> > +
>>>>>> > +   reply = ovs_meter_cmd_reply_start(info, OVS_METER_CMD_SET,
>>>>>> > + _reply_header);
>>>>>> > +   if (IS_ERR(reply)) {
>>>>>> > +   err = PTR_ERR(reply);
>>>>>> > +   goto exit_free_meter;
>>>>>> > +   }
>>>>>> > +
>>>>>> > +   ovs_lock();
>>>>>> > +   dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
>>>>>> > +   if (!dp) {
>>>>>> > +   err = -ENODEV;
>>>>>> > +   goto exit_unlock;
>>>>>> > +   }
>>>>>> > +
>>>>>> > +   if (!a[OVS_METER_ATTR_ID]) {
>>>>>> > +   err = -ENODEV;
>>>>>> > +   goto exit_unlock;
>>>>>> > +   }
>>>>>> > +
>>>>>> > +   meter_id = nla_get_u32(a[OVS_METER_ATTR_ID]);
>>>>>> > +
>>>>>> > +   /* Cannot fail after this. */
>>>>>> > +   old_meter = lookup_meter(dp, meter_id);
>>>>>> I do not see RCU read lock taken here. This is not correctness issue
>>>>>> but it could cause RCU checker to spit out warning message. You could
>>>>>> do same trick that is done in get_dp() to avoid this issue.
>>>>>
>>>>> O.K.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Can you also test the code with rcu sparse check config option enabled?
>>>>>
>>>>>
>>>>> Do you mean to sparse compile with CONFIG_PROVE_LOCKING and
>>>>> CONFIG_DENUG_OBJECTS_RCU_HEAD?
>>>>
>>>> You could use all following options simultaneously:
>>>> CONFIG_PREEMPT
>>>> CONFIG_DEBUG_PREEMPT
>>>> CONFIG_DEBUG_SPINLOCK
>>>> CONFIG_DEBUG_ATOMIC_SLEEP
>>>> CONFIG_PROVE_RCU
>>>> CONFIG_DEBUG_OBJECTS_RCU_HEAD
>>>
>>> Thanks, I turned on those flags but did not get any error message. Do you
>>> mind share the RCU checker message?
>>
>> There would be assert failure and stack trace. so it would be pretty
>> obvious in kernel log messages.
>> Let me know if you do not see any stack trace while running meter
>> create, delete and execute.
>
> No I did not see them.
ok, Can you send out patch against latest net-next?


Re: [net-next v2 3/4] openvswitch: Add meter infrastructure

2017-11-02 Thread Pravin Shelar
On Thu, Nov 2, 2017 at 3:07 AM, Andy Zhou <az...@ovn.org> wrote:
> On Fri, Oct 20, 2017 at 8:32 PM, Pravin Shelar <pshe...@ovn.org> wrote:
>> On Thu, Oct 19, 2017 at 5:58 PM, Andy Zhou <az...@ovn.org> wrote:
>>>
>>> On Thu, Oct 19, 2017 at 02:47 Pravin Shelar <pshe...@ovn.org> wrote:
>>>>
>>>> On Tue, Oct 17, 2017 at 12:36 AM, Andy Zhou <az...@ovn.org> wrote:
>>>> > OVS kernel datapath so far does not support Openflow meter action.
>>>> > This is the first stab at adding kernel datapath meter support.
>>>> > This implementation supports only drop band type.
>>>> >
>>>> > Signed-off-by: Andy Zhou <az...@ovn.org>
>>>> > ---
>>>> >  net/openvswitch/Makefile   |   1 +
>>>> >  net/openvswitch/datapath.c |  14 +-
>>>> >  net/openvswitch/datapath.h |   3 +
>>>> >  net/openvswitch/meter.c| 604
>>>> > +
>>>> >  net/openvswitch/meter.h|  54 
>>>> >  5 files changed, 674 insertions(+), 2 deletions(-)
>>>> >  create mode 100644 net/openvswitch/meter.c
>>>> >  create mode 100644 net/openvswitch/meter.h
>>>> >
>>>> This patch mostly looks good. I have one comment below.
>>>>
>>>> > +static int ovs_meter_cmd_set(struct sk_buff *skb, struct genl_info
>>>> > *info)
>>>> > +{
>>>> > +   struct nlattr **a = info->attrs;
>>>> > +   struct dp_meter *meter, *old_meter;
>>>> > +   struct sk_buff *reply;
>>>> > +   struct ovs_header *ovs_reply_header;
>>>> > +   struct ovs_header *ovs_header = info->userhdr;
>>>> > +   struct datapath *dp;
>>>> > +   int err;
>>>> > +   u32 meter_id;
>>>> > +   bool failed;
>>>> > +
>>>> > +   meter = dp_meter_create(a);
>>>> > +   if (IS_ERR_OR_NULL(meter))
>>>> > +   return PTR_ERR(meter);
>>>> > +
>>>> > +   reply = ovs_meter_cmd_reply_start(info, OVS_METER_CMD_SET,
>>>> > + _reply_header);
>>>> > +   if (IS_ERR(reply)) {
>>>> > +   err = PTR_ERR(reply);
>>>> > +   goto exit_free_meter;
>>>> > +   }
>>>> > +
>>>> > +   ovs_lock();
>>>> > +   dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
>>>> > +   if (!dp) {
>>>> > +   err = -ENODEV;
>>>> > +   goto exit_unlock;
>>>> > +   }
>>>> > +
>>>> > +   if (!a[OVS_METER_ATTR_ID]) {
>>>> > +   err = -ENODEV;
>>>> > +   goto exit_unlock;
>>>> > +   }
>>>> > +
>>>> > +   meter_id = nla_get_u32(a[OVS_METER_ATTR_ID]);
>>>> > +
>>>> > +   /* Cannot fail after this. */
>>>> > +   old_meter = lookup_meter(dp, meter_id);
>>>> I do not see RCU read lock taken here. This is not correctness issue
>>>> but it could cause RCU checker to spit out warning message. You could
>>>> do same trick that is done in get_dp() to avoid this issue.
>>>
>>> O.K.
>>>>
>>>>
>>>>
>>>> Can you also test the code with rcu sparse check config option enabled?
>>>
>>>
>>> Do you mean to sparse compile with CONFIG_PROVE_LOCKING and
>>> CONFIG_DENUG_OBJECTS_RCU_HEAD?
>>
>> You could use all following options simultaneously:
>> CONFIG_PREEMPT
>> CONFIG_DEBUG_PREEMPT
>> CONFIG_DEBUG_SPINLOCK
>> CONFIG_DEBUG_ATOMIC_SLEEP
>> CONFIG_PROVE_RCU
>> CONFIG_DEBUG_OBJECTS_RCU_HEAD
>
> Thanks, I turned on those flags but did not get any error message. Do you
> mind share the RCU checker message?

There would be assert failure and stack trace. so it would be pretty
obvious in kernel log messages.
Let me know if you do not see any stack trace while running meter
create, delete and execute.


Re: [PATCH net-next v15] openvswitch: enable NSH support

2017-11-02 Thread Pravin Shelar
On Wed, Nov 1, 2017 at 7:50 PM, Yang, Yi <yi.y.y...@intel.com> wrote:
> On Thu, Nov 02, 2017 at 08:52:40AM +0800, Pravin Shelar wrote:
>> On Tue, Oct 31, 2017 at 9:03 PM, Yi Yang <yi.y.y...@intel.com> wrote:
>> >
>> > OVS master and 2.8 branch has merged NSH userspace
>> > patch series, this patch is to enable NSH support
>> > in kernel data path in order that OVS can support
>> > NSH in compat mode by porting this.
>> >
>> > Signed-off-by: Yi Yang <yi.y.y...@intel.com>
>> > ---
>> I have comment related to checksum, otherwise patch looks good to me.
>
> Pravin, thank you for your comments, the below part is incremental patch
> for checksum, please help check it, I'll send out v16 with this after
> you confirm.
>
This change looks good to me.
I noticed couple of more issues.
1. Can you move the ovs_key_nsh to the union of ipv4 an ipv6?
ipv4/ipv6/nsh key data is mutually exclusive so there is no need for
separate space for nsh key in the ovs key.
2. We need to fix match_validate() with nsh check. Datapath can not
allow any l3 or l4 match if the flow key contains nsh match and
vice-versa. such flow key should be rejected.


Re: [PATCH net-next v15] openvswitch: enable NSH support

2017-11-01 Thread Pravin Shelar
On Tue, Oct 31, 2017 at 9:03 PM, Yi Yang  wrote:
> v14->v15
>  - Check size in nsh_hdr_from_nlattr
>  - Fixed four small issues pointed out By Jiri and Eric
>
> v13->v14
>  - Rename skb_push_nsh to nsh_push per Dave's comment
>  - Rename skb_pop_nsh to nsh_pop per Dave's comment
>
> v12->v13
>  - Fix NSH header length check in set_nsh
>
> v11->v12
>  - Fix missing changes old comments pointed out
>  - Fix new comments for v11
>
> v10->v11
>  - Fix the left three disputable comments for v9
>but not fixed in v10.
>
> v9->v10
>  - Change struct ovs_key_nsh to
>struct ovs_nsh_key_base base;
>__be32 context[NSH_MD1_CONTEXT_SIZE];
>  - Fix new comments for v9
>
> v8->v9
>  - Fix build error reported by daily intel build
>because nsh module isn't selected by openvswitch
>
> v7->v8
>  - Rework nested value and mask for OVS_KEY_ATTR_NSH
>  - Change pop_nsh to adapt to nsh kernel module
>  - Fix many issues per comments from Jiri Benc
>
> v6->v7
>  - Remove NSH GSO patches in v6 because Jiri Benc
>reworked it as another patch series and they have
>been merged.
>  - Change it to adapt to nsh kernel module added by NSH
>GSO patch series
>
> v5->v6
>  - Fix the rest comments for v4.
>  - Add NSH GSO support for VxLAN-gpe + NSH and
>Eth + NSH.
>
> v4->v5
>  - Fix many comments by Jiri Benc and Eric Garver
>for v4.
>
> v3->v4
>  - Add new NSH match field ttl
>  - Update NSH header to the latest format
>which will be final format and won't change
>per its author's confirmation.
>  - Fix comments for v3.
>
> v2->v3
>  - Change OVS_KEY_ATTR_NSH to nested key to handle
>length-fixed attributes and length-variable
>attriubte more flexibly.
>  - Remove struct ovs_action_push_nsh completely
>  - Add code to handle nested attribute for SET_MASKED
>  - Change PUSH_NSH to use the nested OVS_KEY_ATTR_NSH
>to transfer NSH header data.
>  - Fix comments and coding style issues by Jiri and Eric
>
> v1->v2
>  - Change encap_nsh and decap_nsh to push_nsh and pop_nsh
>  - Dynamically allocate struct ovs_action_push_nsh for
>length-variable metadata.
>
> OVS master and 2.8 branch has merged NSH userspace
> patch series, this patch is to enable NSH support
> in kernel data path in order that OVS can support
> NSH in compat mode by porting this.
>
> Signed-off-by: Yi Yang 
> ---
I have comment related to checksum, otherwise patch looks good to me.

>  include/net/nsh.h|   3 +
>  include/uapi/linux/openvswitch.h |  29 
>  net/nsh/nsh.c|  59 
>  net/openvswitch/Kconfig  |   1 +
>  net/openvswitch/actions.c| 119 +++
>  net/openvswitch/flow.c   |  51 +++
>  net/openvswitch/flow.h   |   7 +
>  net/openvswitch/flow_netlink.c   | 315 
> ++-
>  net/openvswitch/flow_netlink.h   |   5 +
>  9 files changed, 588 insertions(+), 1 deletion(-)
>

...
> diff --git a/net/nsh/nsh.c b/net/nsh/nsh.c
> index 58fb827..2764682 100644
> --- a/net/nsh/nsh.c
> +++ b/net/nsh/nsh.c
> @@ -14,6 +14,65 @@
>  #include 
>  #include 
>
> +int nsh_push(struct sk_buff *skb, const struct nshhdr *pushed_nh)
> +{
> +   struct nshhdr *nh;
> +   size_t length = nsh_hdr_len(pushed_nh);
> +   u8 next_proto;
> +
> +   if (skb->mac_len) {
> +   next_proto = TUN_P_ETHERNET;
> +   } else {
> +   next_proto = tun_p_from_eth_p(skb->protocol);
> +   if (!next_proto)
> +   return -EAFNOSUPPORT;
> +   }
> +
> +   /* Add the NSH header */
> +   if (skb_cow_head(skb, length) < 0)
> +   return -ENOMEM;
> +
> +   skb_push(skb, length);
> +   nh = (struct nshhdr *)(skb->data);
> +   memcpy(nh, pushed_nh, length);
> +   nh->np = next_proto;
dont you need to update checksum for CHECKSUM_COMPLETE case?

> +
> +   skb->protocol = htons(ETH_P_NSH);
> +   skb_reset_mac_header(skb);
> +   skb_reset_network_header(skb);
> +   skb_reset_mac_len(skb);
> +
> +   return 0;
> +}
> +EXPORT_SYMBOL_GPL(nsh_push);
> +
> +int nsh_pop(struct sk_buff *skb)
> +{
> +   struct nshhdr *nh;
> +   size_t length;
> +   __be16 inner_proto;
> +
> +   if (!pskb_may_pull(skb, NSH_BASE_HDR_LEN))
> +   return -ENOMEM;
> +   nh = (struct nshhdr *)(skb->data);
> +   length = nsh_hdr_len(nh);
> +   inner_proto = tun_p_to_eth_p(nh->np);
> +   if (!pskb_may_pull(skb, length))
> +   return -ENOMEM;
> +
> +   if (!inner_proto)
> +   return -EAFNOSUPPORT;
> +
> +   skb_pull(skb, length);
same as above, we need to update the checksum.

> +   skb_reset_mac_header(skb);
> +   skb_reset_network_header(skb);
> +   skb_reset_mac_len(skb);
> +   skb->protocol = inner_proto;
> +
> +   return 0;
> +}
> +EXPORT_SYMBOL_GPL(nsh_pop);
> +
>  static struct sk_buff 

Re: [net-next v2 3/4] openvswitch: Add meter infrastructure

2017-10-20 Thread Pravin Shelar
On Thu, Oct 19, 2017 at 5:58 PM, Andy Zhou <az...@ovn.org> wrote:
>
> On Thu, Oct 19, 2017 at 02:47 Pravin Shelar <pshe...@ovn.org> wrote:
>>
>> On Tue, Oct 17, 2017 at 12:36 AM, Andy Zhou <az...@ovn.org> wrote:
>> > OVS kernel datapath so far does not support Openflow meter action.
>> > This is the first stab at adding kernel datapath meter support.
>> > This implementation supports only drop band type.
>> >
>> > Signed-off-by: Andy Zhou <az...@ovn.org>
>> > ---
>> >  net/openvswitch/Makefile   |   1 +
>> >  net/openvswitch/datapath.c |  14 +-
>> >  net/openvswitch/datapath.h |   3 +
>> >  net/openvswitch/meter.c| 604
>> > +
>> >  net/openvswitch/meter.h|  54 
>> >  5 files changed, 674 insertions(+), 2 deletions(-)
>> >  create mode 100644 net/openvswitch/meter.c
>> >  create mode 100644 net/openvswitch/meter.h
>> >
>> This patch mostly looks good. I have one comment below.
>>
>> > +static int ovs_meter_cmd_set(struct sk_buff *skb, struct genl_info
>> > *info)
>> > +{
>> > +   struct nlattr **a = info->attrs;
>> > +   struct dp_meter *meter, *old_meter;
>> > +   struct sk_buff *reply;
>> > +   struct ovs_header *ovs_reply_header;
>> > +   struct ovs_header *ovs_header = info->userhdr;
>> > +   struct datapath *dp;
>> > +   int err;
>> > +   u32 meter_id;
>> > +   bool failed;
>> > +
>> > +   meter = dp_meter_create(a);
>> > +   if (IS_ERR_OR_NULL(meter))
>> > +   return PTR_ERR(meter);
>> > +
>> > +   reply = ovs_meter_cmd_reply_start(info, OVS_METER_CMD_SET,
>> > + _reply_header);
>> > +   if (IS_ERR(reply)) {
>> > +   err = PTR_ERR(reply);
>> > +   goto exit_free_meter;
>> > +   }
>> > +
>> > +   ovs_lock();
>> > +   dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
>> > +   if (!dp) {
>> > +   err = -ENODEV;
>> > +   goto exit_unlock;
>> > +   }
>> > +
>> > +   if (!a[OVS_METER_ATTR_ID]) {
>> > +   err = -ENODEV;
>> > +   goto exit_unlock;
>> > +   }
>> > +
>> > +   meter_id = nla_get_u32(a[OVS_METER_ATTR_ID]);
>> > +
>> > +   /* Cannot fail after this. */
>> > +   old_meter = lookup_meter(dp, meter_id);
>> I do not see RCU read lock taken here. This is not correctness issue
>> but it could cause RCU checker to spit out warning message. You could
>> do same trick that is done in get_dp() to avoid this issue.
>
> O.K.
>>
>>
>>
>> Can you also test the code with rcu sparse check config option enabled?
>
>
> Do you mean to sparse compile with CONFIG_PROVE_LOCKING and
> CONFIG_DENUG_OBJECTS_RCU_HEAD?

You could use all following options simultaneously:
CONFIG_PREEMPT
CONFIG_DEBUG_PREEMPT
CONFIG_DEBUG_SPINLOCK
CONFIG_DEBUG_ATOMIC_SLEEP
CONFIG_PROVE_RCU
CONFIG_DEBUG_OBJECTS_RCU_HEAD

Thanks,
Pravin.


Re: [net-next v2 3/4] openvswitch: Add meter infrastructure

2017-10-18 Thread Pravin Shelar
On Tue, Oct 17, 2017 at 12:36 AM, Andy Zhou  wrote:
> OVS kernel datapath so far does not support Openflow meter action.
> This is the first stab at adding kernel datapath meter support.
> This implementation supports only drop band type.
>
> Signed-off-by: Andy Zhou 
> ---
>  net/openvswitch/Makefile   |   1 +
>  net/openvswitch/datapath.c |  14 +-
>  net/openvswitch/datapath.h |   3 +
>  net/openvswitch/meter.c| 604 
> +
>  net/openvswitch/meter.h|  54 
>  5 files changed, 674 insertions(+), 2 deletions(-)
>  create mode 100644 net/openvswitch/meter.c
>  create mode 100644 net/openvswitch/meter.h
>
This patch mostly looks good. I have one comment below.

> +static int ovs_meter_cmd_set(struct sk_buff *skb, struct genl_info *info)
> +{
> +   struct nlattr **a = info->attrs;
> +   struct dp_meter *meter, *old_meter;
> +   struct sk_buff *reply;
> +   struct ovs_header *ovs_reply_header;
> +   struct ovs_header *ovs_header = info->userhdr;
> +   struct datapath *dp;
> +   int err;
> +   u32 meter_id;
> +   bool failed;
> +
> +   meter = dp_meter_create(a);
> +   if (IS_ERR_OR_NULL(meter))
> +   return PTR_ERR(meter);
> +
> +   reply = ovs_meter_cmd_reply_start(info, OVS_METER_CMD_SET,
> + _reply_header);
> +   if (IS_ERR(reply)) {
> +   err = PTR_ERR(reply);
> +   goto exit_free_meter;
> +   }
> +
> +   ovs_lock();
> +   dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
> +   if (!dp) {
> +   err = -ENODEV;
> +   goto exit_unlock;
> +   }
> +
> +   if (!a[OVS_METER_ATTR_ID]) {
> +   err = -ENODEV;
> +   goto exit_unlock;
> +   }
> +
> +   meter_id = nla_get_u32(a[OVS_METER_ATTR_ID]);
> +
> +   /* Cannot fail after this. */
> +   old_meter = lookup_meter(dp, meter_id);
I do not see RCU read lock taken here. This is not correctness issue
but it could cause RCU checker to spit out warning message. You could
do same trick that is done in get_dp() to avoid this issue.

Can you also test the code with rcu sparse check config option enabled?

Thanks.


Re: [net-next RFC 3/4] openvswitch: Add meter infrastructure

2017-10-16 Thread Pravin Shelar
On Mon, Oct 16, 2017 at 12:05 AM, Andy Zhou <az...@ovn.org> wrote:
> On Fri, Oct 13, 2017 at 5:12 PM, Pravin Shelar <pshe...@ovn.org> wrote:
>> On Thu, Oct 12, 2017 at 3:38 PM, Andy Zhou <az...@ovn.org> wrote:
>>> OVS kernel datapath so far does not support Openflow meter action.
>>> This is the first stab at adding kernel datapath meter support.
>>> This implementation supports only drop band type.
>>>
>>> Signed-off-by: Andy Zhou <az...@ovn.org>
>>> ---
>>>  net/openvswitch/Makefile   |   1 +
>>>  net/openvswitch/datapath.c |  14 +-
>>>  net/openvswitch/datapath.h |   3 +
>>>  net/openvswitch/meter.c| 611 
>>> +
>>>  net/openvswitch/meter.h|  54 
>>>  5 files changed, 681 insertions(+), 2 deletions(-)
>>>  create mode 100644 net/openvswitch/meter.c
>>>  create mode 100644 net/openvswitch/meter.h
>>>
>> ...
>>
>>> diff --git a/net/openvswitch/meter.c b/net/openvswitch/meter.c
>>> new file mode 100644
>>> index ..f24ebb5f7af4
>>> --- /dev/null
>>> +++ b/net/openvswitch/meter.c
>>
>> 
>> 
>>> +static int ovs_meter_cmd_features(struct sk_buff *skb, struct genl_info 
>>> *info)
>>> +{
>>> +   struct datapath *dp;
>>> +   struct ovs_header *ovs_header = info->userhdr;
>>> +   struct sk_buff *reply;
>>> +   struct ovs_header *ovs_reply_header;
>>> +   struct nlattr *nla, *band_nla;
>>> +   int err;
>>> +
>>> +   /* Check that the datapath exists */
>>> +   ovs_lock();
>>> +   dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
>>> +   ovs_unlock();
>>> +   if (!dp)
>>> +   return -ENODEV;
>>> +
>> why dp check is required for this API?
> Is it possible for another core delete the dp, before ovs_lock()
> returns? Then, in theory, get_dp() can
> return NULL, no?

I do not see dp used anywhere in function. so the question was why is
dp lookup done here?

>>
>>> +   reply = ovs_meter_cmd_reply_start(info, OVS_METER_CMD_FEATURES,
>>> + _reply_header);
>>> +   if (!reply)
>>> +   return PTR_ERR(reply);
>>> +
>>> +   if (nla_put_u32(reply, OVS_METER_ATTR_MAX_METERS, U32_MAX) ||
>>> +   nla_put_u32(reply, OVS_METER_ATTR_MAX_BANDS, DP_MAX_BANDS))
>>> +   goto nla_put_failure;
>>> +
>>> +   nla = nla_nest_start(reply, OVS_METER_ATTR_BANDS);
>>> +   if (!nla)
>>> +   goto nla_put_failure;
>>> +
>>> +   band_nla = nla_nest_start(reply, OVS_BAND_ATTR_UNSPEC);
>>> +   if (!band_nla)
>>> +   goto nla_put_failure;
>>> +   /* Currently only DROP band type is supported. */
>>> +   if (nla_put_u32(reply, OVS_BAND_ATTR_TYPE, 
>>> OVS_METER_BAND_TYPE_DROP))
>>> +   goto nla_put_failure;
>>> +   nla_nest_end(reply, band_nla);
>>> +   nla_nest_end(reply, nla);
>>> +
>>> +   genlmsg_end(reply, ovs_reply_header);
>>> +   return genlmsg_reply(reply, info);
>>> +
>>> +nla_put_failure:
>>> +   nlmsg_free(reply);
>>> +   err = -EMSGSIZE;
>>> +   return err;
>>> +}
>>> +
>> 
>>
>>> +static int ovs_meter_cmd_set(struct sk_buff *skb, struct genl_info *info)
>>> +{
>>> +   struct nlattr **a = info->attrs;
>>> +   struct dp_meter *meter, *old_meter;
>>> +   struct sk_buff *reply;
>>> +   struct ovs_header *ovs_reply_header;
>>> +   struct ovs_header *ovs_header = info->userhdr;
>>> +   struct datapath *dp;
>>> +   int err;
>>> +   u32 meter_id;
>>> +   bool failed;
>>> +
>>> +   meter = dp_meter_create(a);
>>> +   if (IS_ERR_OR_NULL(meter))
>>> +   return PTR_ERR(meter);
>>> +
>>> +   reply = ovs_meter_cmd_reply_start(info, OVS_METER_CMD_SET,
>>> + _reply_header);
>>> +   if (IS_ERR(reply)) {
>>> +   err = PTR_ERR(reply);
>>> +   goto exit_free_meter;
>>> +   }
>>> +
>>> +   ovs_lock();
>>> +   dp = get_dp(sock_net(skb->sk), ovs_header->dp_i

Re: [net-next RFC 4/4] openvswitch: Add meter action support

2017-10-13 Thread Pravin Shelar
On Thu, Oct 12, 2017 at 3:38 PM, Andy Zhou  wrote:
> Implements OVS kernel meter action support.
>
> Signed-off-by: Andy Zhou 
> ---
>  include/uapi/linux/openvswitch.h |  1 +
>  net/openvswitch/actions.c| 12 
>  net/openvswitch/datapath.h   |  1 +
>  net/openvswitch/flow_netlink.c   |  6 ++
>  4 files changed, 20 insertions(+)
>
> diff --git a/include/uapi/linux/openvswitch.h 
> b/include/uapi/linux/openvswitch.h
> index 325049a129e4..11fe1a06cdd6 100644
> --- a/include/uapi/linux/openvswitch.h
> +++ b/include/uapi/linux/openvswitch.h
> @@ -835,6 +835,7 @@ enum ovs_action_attr {
> OVS_ACTION_ATTR_TRUNC,/* u32 struct ovs_action_trunc. */
> OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */
> OVS_ACTION_ATTR_POP_ETH,  /* No argument. */
> +   OVS_ACTION_ATTR_METER,/* u32 meter ID. */
>
> __OVS_ACTION_ATTR_MAX,/* Nothing past this will be accepted
>* from userspace. */
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index a54a556fcdb5..4eb160ac5a27 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -1210,6 +1210,12 @@ static int do_execute_actions(struct datapath *dp, 
> struct sk_buff *skb,
> case OVS_ACTION_ATTR_POP_ETH:
> err = pop_eth(skb, key);
> break;
> +
> +   case OVS_ACTION_ATTR_METER:
> +   if (ovs_meter_execute(dp, skb, key, nla_get_u32(a))) {
> +   consume_skb(skb);
> +   return 0;
> +   }
> }
>
> if (unlikely(err)) {
> @@ -1341,6 +1347,12 @@ int ovs_execute_actions(struct datapath *dp, struct 
> sk_buff *skb,
> err = do_execute_actions(dp, skb, key,
>  acts->actions, acts->actions_len);
>
> +   /* OVS action has dropped the packet, do not expose it
> +* to the user.
> +*/
> +   if (err == -ENODATA)
> +   err = 0;
> +
I am not sure who is returning this error code?


Re: [net-next RFC 3/4] openvswitch: Add meter infrastructure

2017-10-13 Thread Pravin Shelar
On Thu, Oct 12, 2017 at 3:38 PM, Andy Zhou  wrote:
> OVS kernel datapath so far does not support Openflow meter action.
> This is the first stab at adding kernel datapath meter support.
> This implementation supports only drop band type.
>
> Signed-off-by: Andy Zhou 
> ---
>  net/openvswitch/Makefile   |   1 +
>  net/openvswitch/datapath.c |  14 +-
>  net/openvswitch/datapath.h |   3 +
>  net/openvswitch/meter.c| 611 
> +
>  net/openvswitch/meter.h|  54 
>  5 files changed, 681 insertions(+), 2 deletions(-)
>  create mode 100644 net/openvswitch/meter.c
>  create mode 100644 net/openvswitch/meter.h
>
...

> diff --git a/net/openvswitch/meter.c b/net/openvswitch/meter.c
> new file mode 100644
> index ..f24ebb5f7af4
> --- /dev/null
> +++ b/net/openvswitch/meter.c



> +static int ovs_meter_cmd_features(struct sk_buff *skb, struct genl_info 
> *info)
> +{
> +   struct datapath *dp;
> +   struct ovs_header *ovs_header = info->userhdr;
> +   struct sk_buff *reply;
> +   struct ovs_header *ovs_reply_header;
> +   struct nlattr *nla, *band_nla;
> +   int err;
> +
> +   /* Check that the datapath exists */
> +   ovs_lock();
> +   dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
> +   ovs_unlock();
> +   if (!dp)
> +   return -ENODEV;
> +
why dp check is required for this API?

> +   reply = ovs_meter_cmd_reply_start(info, OVS_METER_CMD_FEATURES,
> + _reply_header);
> +   if (!reply)
> +   return PTR_ERR(reply);
> +
> +   if (nla_put_u32(reply, OVS_METER_ATTR_MAX_METERS, U32_MAX) ||
> +   nla_put_u32(reply, OVS_METER_ATTR_MAX_BANDS, DP_MAX_BANDS))
> +   goto nla_put_failure;
> +
> +   nla = nla_nest_start(reply, OVS_METER_ATTR_BANDS);
> +   if (!nla)
> +   goto nla_put_failure;
> +
> +   band_nla = nla_nest_start(reply, OVS_BAND_ATTR_UNSPEC);
> +   if (!band_nla)
> +   goto nla_put_failure;
> +   /* Currently only DROP band type is supported. */
> +   if (nla_put_u32(reply, OVS_BAND_ATTR_TYPE, OVS_METER_BAND_TYPE_DROP))
> +   goto nla_put_failure;
> +   nla_nest_end(reply, band_nla);
> +   nla_nest_end(reply, nla);
> +
> +   genlmsg_end(reply, ovs_reply_header);
> +   return genlmsg_reply(reply, info);
> +
> +nla_put_failure:
> +   nlmsg_free(reply);
> +   err = -EMSGSIZE;
> +   return err;
> +}
> +


> +static int ovs_meter_cmd_set(struct sk_buff *skb, struct genl_info *info)
> +{
> +   struct nlattr **a = info->attrs;
> +   struct dp_meter *meter, *old_meter;
> +   struct sk_buff *reply;
> +   struct ovs_header *ovs_reply_header;
> +   struct ovs_header *ovs_header = info->userhdr;
> +   struct datapath *dp;
> +   int err;
> +   u32 meter_id;
> +   bool failed;
> +
> +   meter = dp_meter_create(a);
> +   if (IS_ERR_OR_NULL(meter))
> +   return PTR_ERR(meter);
> +
> +   reply = ovs_meter_cmd_reply_start(info, OVS_METER_CMD_SET,
> + _reply_header);
> +   if (IS_ERR(reply)) {
> +   err = PTR_ERR(reply);
> +   goto exit_free_meter;
> +   }
> +
> +   ovs_lock();
> +   dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
> +   if (!dp) {
> +   err = -ENODEV;
> +   goto exit_unlock;
> +   }
> +
> +   if (!a[OVS_METER_ATTR_ID]) {
> +   err = -ENODEV;
> +   goto exit_unlock;
> +   }
> +
> +   meter_id = nla_get_u32(a[OVS_METER_ATTR_ID]);
> +
> +   /* Cannot fail after this. */
> +   old_meter = lookup_meter(dp, meter_id);
> +   attach_meter(dp, meter);
> +   ovs_unlock();
> +
After the unlock, it is not safe to keep the ref to old_meter. better
to release lock at the end. we could optimize it later if required.

> +   /* Build response with the meter_id and stats from
> +* the old meter, if any.
> +*/
> +   failed = nla_put_u32(reply, OVS_METER_ATTR_ID, meter_id);
> +   WARN_ON(failed);
> +   if (old_meter) {
> +   spin_lock_bh(_meter->lock);
> +   if (old_meter->keep_stats) {
> +   err = ovs_meter_cmd_reply_stats(reply, meter_id,
> +   old_meter);
> +   WARN_ON(err);
> +   }
> +   spin_unlock_bh(_meter->lock);
> +   ovs_meter_free(old_meter);
> +   }
> +
> +   genlmsg_end(reply, ovs_reply_header);
> +   return genlmsg_reply(reply, info);
> +
> +exit_unlock:
> +   ovs_unlock();
> +   nlmsg_free(reply);
> +exit_free_meter:
> +   kfree(meter);
> +   return err;
> +}
> +


> +bool ovs_meter_execute(struct datapath *dp, struct sk_buff *skb,
> +  

Re: [net-next RFC 0/4] Openvswitch meter action

2017-10-13 Thread Pravin Shelar
On Thu, Oct 12, 2017 at 3:38 PM, Andy Zhou  wrote:
> This patch series is the first attempt to add openvswitch
> meter support. We have previously experimented with adding
> metering support in nftables. However 1) It was not clear
> how to expose a named nftables object cleanly, and 2)
> the logic that implements metering is quite small, < 100 lines
> of code.
>
> With those two observations, it seems cleaner to add meter
> support in the openvswitch module directly.
>
>
Thanks for working on this feature. It looks good to me. I have couple
of comments inlined.

> Andy Zhou (4):
>   openvswitch: Add meter netlink definitions
>   openvswitch: export get_dp() API.
>   openvswitch: Add meter infrastructure
>   openvswitch: Add meter action support
>
>  include/uapi/linux/openvswitch.h |  52 
>  net/openvswitch/Makefile |   1 +
>  net/openvswitch/actions.c|  12 +
>  net/openvswitch/datapath.c   |  43 +--
>  net/openvswitch/datapath.h   |  35 +++
>  net/openvswitch/flow_netlink.c   |   6 +
>  net/openvswitch/meter.c  | 611 
> +++
>  net/openvswitch/meter.h  |  54 
>  8 files changed, 783 insertions(+), 31 deletions(-)
>  create mode 100644 net/openvswitch/meter.c
>  create mode 100644 net/openvswitch/meter.h
>
> --
> 1.8.3.1
>


Re: [PATCH net-next v2] openvswitch: add ct_clear action

2017-10-10 Thread Pravin Shelar
On Tue, Oct 10, 2017 at 1:54 PM, Eric Garver  wrote:
> This adds a ct_clear action for clearing conntrack state. ct_clear is
> currently implemented in OVS userspace, but is not backed by an action
> in the kernel datapath. This is useful for flows that may modify a
> packet tuple after a ct lookup has already occurred.
>
> Signed-off-by: Eric Garver 
> ---
> v2:
> - Use IP_CT_UNTRACKED for nf_ct_set()
> - Only fill key if previously conntracked
>
Looks good.
Acked-by: Pravin B Shelar 


Re: [PATCH net-next] openvswitch: add ct_clear action

2017-10-09 Thread Pravin Shelar
On Fri, Oct 6, 2017 at 9:44 AM, Eric Garver  wrote:
> This adds a ct_clear action for clearing conntrack state. ct_clear is
> currently implemented in OVS userspace, but is not backed by an action
> in the kernel datapath. This is useful for flows that may modify a
> packet tuple after a ct lookup has already occurred.
>
> Signed-off-by: Eric Garver 
Patch mostly looks good. I have following comments.

> ---
>  include/uapi/linux/openvswitch.h |  2 ++
>  net/openvswitch/actions.c|  5 +
>  net/openvswitch/conntrack.c  | 12 
>  net/openvswitch/conntrack.h  |  7 +++
>  net/openvswitch/flow_netlink.c   |  5 +
>  5 files changed, 31 insertions(+)
>
> diff --git a/include/uapi/linux/openvswitch.h 
> b/include/uapi/linux/openvswitch.h
> index 156ee4cab82e..1b6e510e2cc6 100644
> --- a/include/uapi/linux/openvswitch.h
> +++ b/include/uapi/linux/openvswitch.h
> @@ -806,6 +806,7 @@ struct ovs_action_push_eth {
>   * packet.
>   * @OVS_ACTION_ATTR_POP_ETH: Pop the outermost Ethernet header off the
>   * packet.
> + * @OVS_ACTION_ATTR_CT_CLEAR: Clear conntrack state from the packet.
>   *
>   * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not 
> all
>   * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
> @@ -835,6 +836,7 @@ enum ovs_action_attr {
> OVS_ACTION_ATTR_TRUNC,/* u32 struct ovs_action_trunc. */
> OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */
> OVS_ACTION_ATTR_POP_ETH,  /* No argument. */
> +   OVS_ACTION_ATTR_CT_CLEAR, /* No argument. */
>
> __OVS_ACTION_ATTR_MAX,/* Nothing past this will be accepted
>* from userspace. */
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index a54a556fcdb5..db9c7f2e662b 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -1203,6 +1203,10 @@ static int do_execute_actions(struct datapath *dp, 
> struct sk_buff *skb,
> return err == -EINPROGRESS ? 0 : err;
> break;
>
> +   case OVS_ACTION_ATTR_CT_CLEAR:
> +   err = ovs_ct_clear(skb, key);
> +   break;
> +
> case OVS_ACTION_ATTR_PUSH_ETH:
> err = push_eth(skb, key, nla_data(a));
> break;
> @@ -1210,6 +1214,7 @@ static int do_execute_actions(struct datapath *dp, 
> struct sk_buff *skb,
> case OVS_ACTION_ATTR_POP_ETH:
> err = pop_eth(skb, key);
> break;
> +
> }
Unrelated change.

>
> if (unlikely(err)) {
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index d558e882ca0c..f9b73c726ad7 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -1129,6 +1129,18 @@ int ovs_ct_execute(struct net *net, struct sk_buff 
> *skb,
> return err;
>  }
>
> +int ovs_ct_clear(struct sk_buff *skb, struct sw_flow_key *key)
> +{
> +   if (skb_nfct(skb)) {
> +   nf_conntrack_put(skb_nfct(skb));
> +   nf_ct_set(skb, NULL, 0);
Can the new conntract state be appropriate? may be IP_CT_UNTRACKED?

> +   }
> +
> +   ovs_ct_fill_key(skb, key);
> +
I do not see need to refill the key if there is no skb-nf-ct.

> +   return 0;
> +}
> +
>  static int ovs_ct_add_helper(struct ovs_conntrack_info *info, const char 
> *name,
>  const struct sw_flow_key *key, bool log)
>  {


Re: [PATCHv2 net-next] openvswitch: Add erspan tunnel support.

2017-10-08 Thread Pravin Shelar
On Wed, Oct 4, 2017 at 5:03 PM, William Tu  wrote:
> Add erspan netlink interface for OVS.
>
> Signed-off-by: William Tu 
> Cc: Pravin B Shelar 
> ---
> v1->v2: remove unnecessary compat code.

Looks good to me.


Re: [PATCH net-next] openvswitch: Add erspan tunnel support.

2017-10-04 Thread Pravin Shelar
On Wed, Oct 4, 2017 at 5:02 AM, William Tu  wrote:
> Add type II erspan vport implementation.  Since erspan protocol is
> on top of the GRE header, the implementation is extended from the
> existing gre implementation.
>
> Signed-off-by: William Tu 
> Cc: Pravin B Shelar 

Why are you adding ERSPAN support to compat code. Isn't this supported
over OVS netlink-rtnl (dpif-netlink-rtnl)?


Re: [PATCH net-next v9] openvswitch: enable NSH support

2017-09-28 Thread Pravin Shelar
On Tue, Sep 26, 2017 at 6:39 PM, Yang, Yi  wrote:
> On Tue, Sep 26, 2017 at 06:49:14PM +0800, Jiri Benc wrote:
>> On Tue, 26 Sep 2017 12:55:39 +0800, Yang, Yi wrote:
>> > After push_nsh, the packet won't be recirculated to flow pipeline, so
>> > key->eth.type must be set explicitly here, but for pop_nsh, the packet
>> > will be recirculated to flow pipeline, it will be reparsed, so
>> > key->eth.type will be set in packet parse function, we needn't handle it
>> > in pop_nsh.
>>
>> This seems to be a very different approach than what we currently have.
>> Looking at the code, the requirement after "destructive" actions such
>> as pushing or popping headers is to recirculate.
>
> This is optimization proposed by Jan Scheurich, recurculating after push_nsh
> will impact on performance, recurculating after pop_nsh is unavoidable, So
> also cc jan.scheur...@ericsson.com.
>
> Actucally all the keys before push_nsh are still there after push_nsh,
> push_nsh has updated all the nsh keys, so recirculating remains avoidable.
>


We should keep existing model for this patch. Later you can submit
optimization patch with specific use cases and performance
improvement. So that we can evaluate code complexity and benefits.

>>
>> Setting key->eth.type to satisfy conditions in the output path without
>> updating the rest of the key looks very hacky and fragile to me. There
>> might be other conditions and dependencies that are not obvious.
>> I don't think the code was written with such code path in mind.
>>
>> I'd like to hear what Pravin thinks about this.
>>
>>  Jiri


Re: [PATCH v2] geneve: Fix setting ttl value in collect metadata mode

2017-09-13 Thread Pravin Shelar
On Wed, Sep 13, 2017 at 4:15 AM, 严海双 <yanhaishu...@cmss.chinamobile.com> wrote:
>
>
>> On 2017年9月13日, at 上午7:43, Pravin Shelar <pshe...@ovn.org> wrote:
>>
>> On Tue, Sep 12, 2017 at 12:05 AM, Haishuang Yan
>> <yanhaishu...@cmss.chinamobile.com> wrote:
>>> Similar to vxlan/ipip tunnel, if key->tos is zero in collect metadata
>>> mode, tos should also fallback to ip{4,6}_dst_hoplimit.
>>>
>>> Signed-off-by: Haishuang Yan <yanhaishu...@cmss.chinamobile.com>
>>>
>>> ---
>>> Changes since v2:
>>>  * Make the commit message more clearer.
>>> ---
>>> drivers/net/geneve.c | 6 ++
>>> 1 file changed, 2 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
>>> index f640407..d52a65f 100644
>>> --- a/drivers/net/geneve.c
>>> +++ b/drivers/net/geneve.c
>>> @@ -834,11 +834,10 @@ static int geneve_xmit_skb(struct sk_buff *skb, 
>>> struct net_device *dev,
>>>sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
>>>if (geneve->collect_md) {
>>>tos = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
>>> -   ttl = key->ttl;
>>>} else {
>>>tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb);
>>> -   ttl = key->ttl ? : ip4_dst_hoplimit(>dst);
>>>}
>>> +   ttl = key->ttl ? : ip4_dst_hoplimit(>dst);
>>>df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
>>>
>> This changes user API of Geneve collect-metadata mode. I do not see
>> good reason for this. Why user can not set right TTL for the flow?
>>
>
> For example, I test this case with script samples/bpf/test_tunnel_bpf.sh,
> and modify samples/bpf/tcbpf2_kern.c with following patch:
>
But user is suppose to specify the TTL in collect-md mode for Geneve
tunnel. That is the API.


Re: [PATCH v2] geneve: Fix setting ttl value in collect metadata mode

2017-09-12 Thread Pravin Shelar
On Tue, Sep 12, 2017 at 12:05 AM, Haishuang Yan
 wrote:
> Similar to vxlan/ipip tunnel, if key->tos is zero in collect metadata
> mode, tos should also fallback to ip{4,6}_dst_hoplimit.
>
> Signed-off-by: Haishuang Yan 
>
> ---
> Changes since v2:
>   * Make the commit message more clearer.
> ---
>  drivers/net/geneve.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> index f640407..d52a65f 100644
> --- a/drivers/net/geneve.c
> +++ b/drivers/net/geneve.c
> @@ -834,11 +834,10 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
> net_device *dev,
> sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
> if (geneve->collect_md) {
> tos = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
> -   ttl = key->ttl;
> } else {
> tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb);
> -   ttl = key->ttl ? : ip4_dst_hoplimit(>dst);
> }
> +   ttl = key->ttl ? : ip4_dst_hoplimit(>dst);
> df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
>
This changes user API of Geneve collect-metadata mode. I do not see
good reason for this. Why user can not set right TTL for the flow?


Re: [PATCH v4 1/2] ip_tunnel: fix ip tunnel lookup in collect_md mode

2017-09-12 Thread Pravin Shelar
On Tue, Sep 12, 2017 at 2:47 AM, Haishuang Yan
 wrote:
> In collect_md mode, if the tun dev is down, it still can call
> ip_tunnel_rcv to receive on packets, and the rx statistics increase
> improperly.
>
> When the md tunnel is down, it's not neccessary to increase RX drops
> for the tunnel device, packets would be recieved on fallback tunnel,
> and the RX drops on fallback device will be increased as expected.
>
> Fixes: 2e15ea390e6f ("ip_gre: Add support to collect tunnel metadata.")
> Cc: Pravin B Shelar 
> Signed-off-by: Haishuang Yan 
Acked-by: Pravin B Shelar 


Re: [PATCH v2] openvswitch: Fix an error handling path in 'ovs_nla_init_match_and_action()'

2017-09-11 Thread Pravin Shelar
On Mon, Sep 11, 2017 at 12:56 PM, Christophe JAILLET
 wrote:
> All other error handling paths in this function go through the 'error'
> label. This one should do the same.
>
> Fixes: 9cc9a5cb176c ("datapath: Avoid using stack larger than 1024.")
> Signed-off-by: Christophe JAILLET 
> ---
> I think that the comment above the function could be improved. It looks
> like the commit log which has introduced this function.
>
> I'm also not sure that commit 9cc9a5cb176c is of any help. It is
> supposed to remove a warning, and I guess it does. But 
> 'ovs_nla_init_match_and_action()'
> is called unconditionnaly from 'ovs_flow_cmd_set()'. So even if the stack
> used by each function is reduced, the overall stack should be the same, if
> not larger.
>
It depends on which function stack depth are are looking at. for some
function it remains same. For nested function it goes down.

> So this commit sounds like adding a bug where the code was fine and states
> to fix an issue but, at the best, only hides it.
>
> Instead of fixing the code with the proposed patch, reverting the initial
> commit could also be considered.
>
> V2: update Subject line
> ---
>  net/openvswitch/datapath.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index 76cf273a56c7..c3aec6227c91 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -1112,7 +1112,8 @@ static int ovs_nla_init_match_and_action(struct net 
> *net,
> if (!a[OVS_FLOW_ATTR_KEY]) {
> OVS_NLERR(log,
>   "Flow key attribute not present in set 
> flow.");
> -   return -EINVAL;
> +   error = -EINVAL;
> +   goto error;
> }
>
Patch looks good to me.

Acked-by: Pravin B Shelar 


Re: [PATCH] geneve: Fix setting ttl value in collect metadata mode

2017-09-05 Thread Pravin Shelar
On Sun, Sep 3, 2017 at 5:49 AM, Haishuang Yan
 wrote:
> If key->tos is zero in collect metadata mode, tos should fallback to
> ip{4,6}_dst_hoplimit, same as normal mode.
>
> Signed-off-by: Haishuang Yan 
> ---
>  drivers/net/geneve.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> index f640407..d52a65f 100644
> --- a/drivers/net/geneve.c
> +++ b/drivers/net/geneve.c
> @@ -834,11 +834,10 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
> net_device *dev,
> sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
> if (geneve->collect_md) {
> tos = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
> -   ttl = key->ttl;
> } else {
> tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb);
> -   ttl = key->ttl ? : ip4_dst_hoplimit(>dst);
> }
> +   ttl = key->ttl ? : ip4_dst_hoplimit(>dst);
In collect md mode, geneve has to set TTL value from tunnel metadata.
That is the API exposed to userspace. is there reason for this change?


> df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
>
> err = geneve_build_skb(>dst, skb, info, xnet, sizeof(struct 
> iphdr));
> @@ -873,12 +872,11 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct 
> net_device *dev,
> sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
> if (geneve->collect_md) {
> prio = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
> -   ttl = key->ttl;
> } else {
> prio = ip_tunnel_ecn_encap(ip6_tclass(fl6.flowlabel),
>ip_hdr(skb), skb);
> -   ttl = key->ttl ? : ip6_dst_hoplimit(dst);
> }
> +   ttl = key->ttl ? : ip6_dst_hoplimit(dst);
> err = geneve_build_skb(dst, skb, info, xnet, sizeof(struct ipv6hdr));
> if (unlikely(err))
> return err;
> --
> 1.8.3.1
>
>
>


Re: [PATCH net V3] openvswitch: fix skb_panic due to the incorrect actions attrlen

2017-08-16 Thread Pravin Shelar
On Tue, Aug 15, 2017 at 10:30 PM, Liping Zhang  wrote:
> From: Liping Zhang 
>
> For sw_flow_actions, the actions_len only represents the kernel part's
> size, and when we dump the actions to the userspace, we will do the
> convertions, so it's true size may become bigger than the actions_len.
>
> But unfortunately, for OVS_PACKET_ATTR_ACTIONS, we use the actions_len
> to alloc the skbuff, so the user_skb's size may become insufficient and
> oops will happen like this:
>   skbuff: skb_over_panic: text:8148fabf len:1749 put:157 head:
>   881300f39000 data:881300f39000 tail:0x6d5 end:0x6c0 dev:
>   [ cut here ]
>   kernel BUG at net/core/skbuff.c:129!
>   [...]
>   Call Trace:
>
>[] skb_put+0x43/0x44
>[] skb_zerocopy+0x6c/0x1f4
>[] queue_userspace_packet+0x3a3/0x448 [openvswitch]
>[] ovs_dp_upcall+0x30/0x5c [openvswitch]
>[] output_userspace+0x132/0x158 [openvswitch]
>[] ? ip6_rcv_finish+0x74/0x77 [ipv6]
>[] do_execute_actions+0xcc1/0xdc8 [openvswitch]
>[] ovs_execute_actions+0x74/0x106 [openvswitch]
>[] ovs_dp_process_packet+0xe1/0xfd [openvswitch]
>[] ? key_extract+0x63c/0x8d5 [openvswitch]
>[] ovs_vport_receive+0xa1/0xc3 [openvswitch]
>   [...]
>
> Also we can find that the actions_len is much little than the orig_len:
>   crash> struct sw_flow_actions 0x8812f539d000
>   struct sw_flow_actions {
> rcu = {
>   next = 0x8812f5398800,
>   func = 0xe3b00035db32
> },
> orig_len = 1384,
> actions_len = 592,
> actions = 0x8812f539d01c
>   }
>
> So as a quick fix, use the orig_len instead of the actions_len to alloc
> the user_skb.
>
> Last, this oops happened on our system running a relative old kernel, but
> the same risk still exists on the mainline, since we use the wrong
> actions_len from the beginning.
>
> Fixes: ccea74457bbd ("openvswitch: include datapath actions with 
> sampled-packet upcall to userspace")
> Cc: Neil McKee 
> Signed-off-by: Liping Zhang 

Looks good.
Acked-by: Pravin B Shelar 


  1   2   3   4   5   >