from:"Yi\-Hung Wei"

Re: [PATCH net] openvswitch: Fix upcall OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS

2020-11-10 Thread Yi-Hung Wei

On Sat, Nov 7, 2020 at 11:46 AM Jakub Kicinski  wrote:
>
> On Tue,  3 Nov 2020 16:11:34 -0800 Yi-Hung Wei wrote:
> > TUNNEL_GENEVE_OPT is set on tun_flags in struct sw_flow_key when
> > a packet is coming from a geneve tunnel no matter the size of geneve
> > option is zero or not.  On the other hand, TUNNEL_VXLAN_OPT or
> > TUNNEL_ERSPAN_OPT is set when the VXLAN or ERSPAN option is available.
> > Currently, ovs kernel module only generates
> > OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS when the tun_opts_len is non-zero.
> > As a result, for a geneve packet without tun_metadata, when the packet
> > is reinserted from userspace after upcall, we miss the TUNNEL_GENEVE_OPT
> > in the tun_flags on struct sw_flow_key, and that will further cause
> > megaflow matching issue.
> >
> > This patch changes the way that we deal with the upcall netlink message
> > generation to make sure the geneve tun_flags is set consistently
> > as the packet is firstly received from the geneve tunnel in order to
> > avoid megaflow matching issue demonstrated by the following flows.
> > This issue is only observed on ovs kernel datapath.
> >
> > Consider the following two flows, and the two cases.
> > * flow1: icmp traffic from gnv0 to gnv1, without any tun_metadata
> > * flow2: icmp traffic form gnv0 to gnv1 with tun_metadata0
> >
> > Case 1)
> > Send flow2 first, and then send flow1.  When both flows are running,
> > both the following two flows are hit, which is expected.
> >
> > table=2, priority=200, in_port=gnv0, icmp, ct_state=+trk+est, reg9=0x0/0xff 
> > action=output:gnv1
> > table=2, priority=200, in_port=gnv0, icmp, ct_state=+trk+est, reg9=0x1/0xff 
> > action=output:gnv1
> >
> > Case 2)
> > Send flow1 first, then send flow2.  When both flows are running,
> > only the following flow is hit.
> > table=2, priority=200, in_port=gnv0, icmp, ct_state=+trk+est, reg9=0x0/0xff 
> > action=output:gnv1
> >
> > Example flows)
> >
> > table=0, arp, actions=NORMAL
> > table=0, in_port=gnv1, icmp, action=ct(table=1)
> > table=0, in_port=gnv0, icmp  
> > action=move:NXM_NX_TUN_METADATA0[0..7]->NXM_NX_REG9[0..7], resubmit(,1)
> > table=1, in_port=gnv1, icmp, action=output:gnv0
> > table=1, in_port=gnv0, icmp  action=ct(table=2)
> > table=2, priority=300, in_port=gnv0, icmp, ct_state=+trk+new, 
> > action=ct(commit),output:gnv1
> > table=2, priority=200, in_port=gnv0, icmp, ct_state=+trk+est, reg9=0x0/0xff 
> > action=output:gnv1
> > table=2, priority=200, in_port=gnv0, icmp, ct_state=+trk+est, reg9=0x1/0xff 
> > action=output:gnv1
> >
> > Fixes: fc4099f17240 ("openvswitch: Fix egress tunnel info.")
> > Signed-off-by: Yi-Hung Wei 
>
> Wouldn't it make more sense to make GENEVE behave like the other
> tunnels rather than hack this logic into consumer of the flag?

Thanks Jakub for the review.  Yes, it makes sense to fix on the tunnel
side. I submit another patch to fix it:

https://lore.kernel.org/netdev/1605053800-74072-1-git-send-email-yihung@gmail.com/T/#u

> Please make sure that you CC authors of the patch you're fixing
> and the maintainers of the code you're changing.
>
> ./scripts/get_maintainer.pl is your friend.

Thanks for reminding me.  Will do that from now on.


-Yi-Hung

Re: [PATCH] ip_tunnels: Set tunnel option flag when tunnel metadata is present

2020-11-10 Thread Yi-Hung Wei

On Tue, Nov 10, 2020 at 4:17 PM Yi-Hung Wei  wrote:
>
> Currently, we may set the tunnel option flag when the size of metadata
> is zero.  For example, we set TUNNEL_GENEVE_OPT in the receive function
> no matter the geneve option is present or not.  As this may result in
> issues on the tunnel flags consumers, this patch fixes the issue.
>
> Related discussion:
> * 
> https://lore.kernel.org/netdev/1604448694-19351-1-git-send-email-yihung@gmail.com/T/#u
>
> Fixes: 256c87c17c53 ("net: check tunnel option type in tunnel flags")
> Signed-off-by: Yi-Hung Wei 
> ---

Sorry that I did not indicate in the subject line of the previous
email.  It should be "[PATCH net]".

-Yi-Hung

[PATCH] ip_tunnels: Set tunnel option flag when tunnel metadata is present

2020-11-10 Thread Yi-Hung Wei

Currently, we may set the tunnel option flag when the size of metadata
is zero.  For example, we set TUNNEL_GENEVE_OPT in the receive function
no matter the geneve option is present or not.  As this may result in
issues on the tunnel flags consumers, this patch fixes the issue.

Related discussion:
* 
https://lore.kernel.org/netdev/1604448694-19351-1-git-send-email-yihung@gmail.com/T/#u

Fixes: 256c87c17c53 ("net: check tunnel option type in tunnel flags")
Signed-off-by: Yi-Hung Wei 
---
 drivers/net/geneve.c | 3 +--
 include/net/ip_tunnels.h | 7 ---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index d07008a818df..1426bfc009bc 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -224,8 +224,7 @@ static void geneve_rx(struct geneve_dev *geneve, struct 
geneve_sock *gs,
if (ip_tunnel_collect_metadata() || gs->collect_md) {
__be16 flags;
 
-   flags = TUNNEL_KEY | TUNNEL_GENEVE_OPT |
-   (gnvh->oam ? TUNNEL_OAM : 0) |
+   flags = TUNNEL_KEY | (gnvh->oam ? TUNNEL_OAM : 0) |
(gnvh->critical ? TUNNEL_CRIT_OPT : 0);
 
tun_dst = udp_tun_rx_dst(skb, geneve_get_sk_family(gs), flags,
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 02ccd32542d0..61620677b034 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -478,9 +478,11 @@ static inline void ip_tunnel_info_opts_set(struct 
ip_tunnel_info *info,
   const void *from, int len,
   __be16 flags)
 {
-   memcpy(ip_tunnel_info_opts(info), from, len);
info->options_len = len;
-   info->key.tun_flags |= flags;
+   if (len > 0) {
+   memcpy(ip_tunnel_info_opts(info), from, len);
+   info->key.tun_flags |= flags;
+   }
 }
 
 static inline struct ip_tunnel_info *lwt_tun_info(struct lwtunnel_state 
*lwtstate)
@@ -526,7 +528,6 @@ static inline void ip_tunnel_info_opts_set(struct 
ip_tunnel_info *info,
   __be16 flags)
 {
info->options_len = 0;
-   info->key.tun_flags |= flags;
 }
 
 #endif /* CONFIG_INET */
-- 
2.7.4

[PATCH net] openvswitch: Fix upcall OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS

2020-11-03 Thread Yi-Hung Wei

TUNNEL_GENEVE_OPT is set on tun_flags in struct sw_flow_key when
a packet is coming from a geneve tunnel no matter the size of geneve
option is zero or not.  On the other hand, TUNNEL_VXLAN_OPT or
TUNNEL_ERSPAN_OPT is set when the VXLAN or ERSPAN option is available.
Currently, ovs kernel module only generates
OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS when the tun_opts_len is non-zero.
As a result, for a geneve packet without tun_metadata, when the packet
is reinserted from userspace after upcall, we miss the TUNNEL_GENEVE_OPT
in the tun_flags on struct sw_flow_key, and that will further cause
megaflow matching issue.

This patch changes the way that we deal with the upcall netlink message
generation to make sure the geneve tun_flags is set consistently
as the packet is firstly received from the geneve tunnel in order to
avoid megaflow matching issue demonstrated by the following flows.
This issue is only observed on ovs kernel datapath.

Consider the following two flows, and the two cases.
* flow1: icmp traffic from gnv0 to gnv1, without any tun_metadata
* flow2: icmp traffic form gnv0 to gnv1 with tun_metadata0

Case 1)
Send flow2 first, and then send flow1.  When both flows are running,
both the following two flows are hit, which is expected.

table=2, priority=200, in_port=gnv0, icmp, ct_state=+trk+est, reg9=0x0/0xff 
action=output:gnv1
table=2, priority=200, in_port=gnv0, icmp, ct_state=+trk+est, reg9=0x1/0xff 
action=output:gnv1

Case 2)
Send flow1 first, then send flow2.  When both flows are running,
only the following flow is hit.
table=2, priority=200, in_port=gnv0, icmp, ct_state=+trk+est, reg9=0x0/0xff 
action=output:gnv1

Example flows)

table=0, arp, actions=NORMAL
table=0, in_port=gnv1, icmp, action=ct(table=1)
table=0, in_port=gnv0, icmp  
action=move:NXM_NX_TUN_METADATA0[0..7]->NXM_NX_REG9[0..7], resubmit(,1)
table=1, in_port=gnv1, icmp, action=output:gnv0
table=1, in_port=gnv0, icmp  action=ct(table=2)
table=2, priority=300, in_port=gnv0, icmp, ct_state=+trk+new, 
action=ct(commit),output:gnv1
table=2, priority=200, in_port=gnv0, icmp, ct_state=+trk+est, reg9=0x0/0xff 
action=output:gnv1
table=2, priority=200, in_port=gnv0, icmp, ct_state=+trk+est, reg9=0x1/0xff 
action=output:gnv1

Fixes: fc4099f17240 ("openvswitch: Fix egress tunnel info.")
Signed-off-by: Yi-Hung Wei 
---
 net/openvswitch/flow_netlink.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 9d3e50c4d29f..b03ec6a1a1fa 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -912,13 +912,13 @@ static int __ip_tun_to_nlattr(struct sk_buff *skb,
if ((output->tun_flags & TUNNEL_OAM) &&
nla_put_flag(skb, OVS_TUNNEL_KEY_ATTR_OAM))
return -EMSGSIZE;
-   if (swkey_tun_opts_len) {
-   if (output->tun_flags & TUNNEL_GENEVE_OPT &&
-   nla_put(skb, OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS,
+   if (output->tun_flags & TUNNEL_GENEVE_OPT) {
+   if (nla_put(skb, OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS,
swkey_tun_opts_len, tun_opts))
return -EMSGSIZE;
-   else if (output->tun_flags & TUNNEL_VXLAN_OPT &&
-vxlan_opt_to_nlattr(skb, tun_opts, swkey_tun_opts_len))
+   } else if (swkey_tun_opts_len) {
+   if (output->tun_flags & TUNNEL_VXLAN_OPT &&
+   vxlan_opt_to_nlattr(skb, tun_opts, swkey_tun_opts_len))
return -EMSGSIZE;
else if (output->tun_flags & TUNNEL_ERSPAN_OPT &&
 nla_put(skb, OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,
-- 
2.7.4

[PATCH net-next v2] openvswitch: Allow attaching helper in later commit

2019-10-04 Thread Yi-Hung Wei

This patch allows to attach conntrack helper to a confirmed conntrack
entry.  Currently, we can only attach alg helper to a conntrack entry
when it is in the unconfirmed state.  This patch enables an use case
that we can firstly commit a conntrack entry after it passed some
initial conditions.  After that the processing pipeline will further
check a couple of packets to determine if the connection belongs to
a particular application, and attach alg helper to the connection
in a later stage.

Signed-off-by: Yi-Hung Wei 
---
v1->v2, Use logical OR instead of bitwise OR as Dave suggested.

---
 net/openvswitch/conntrack.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 05249eb45082..df9c80bf621d 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -971,6 +971,8 @@ static int __ovs_ct_lookup(struct net *net, struct 
sw_flow_key *key,
 
ct = nf_ct_get(skb, &ctinfo);
if (ct) {
+   bool add_helper = false;
+
/* Packets starting a new connection must be NATted before the
 * helper, so that the helper knows about the NAT.  We enforce
 * this by delaying both NAT and helper calls for unconfirmed
@@ -988,16 +990,17 @@ static int __ovs_ct_lookup(struct net *net, struct 
sw_flow_key *key,
}
 
/* Userspace may decide to perform a ct lookup without a helper
-* specified followed by a (recirculate and) commit with one.
-* Therefore, for unconfirmed connections which we will commit,
-* we need to attach the helper here.
+* specified followed by a (recirculate and) commit with one,
+* or attach a helper in a later commit.  Therefore, for
+* connections which we will commit, we may need to attach
+* the helper here.
 */
-   if (!nf_ct_is_confirmed(ct) && info->commit &&
-   info->helper && !nfct_help(ct)) {
+   if (info->commit && info->helper && !nfct_help(ct)) {
int err = __nf_ct_try_assign_helper(ct, info->ct,
GFP_ATOMIC);
if (err)
return err;
+   add_helper = true;
 
/* helper installed, add seqadj if NAT is required */
if (info->nat && !nfct_seqadj(ct)) {
@@ -1007,11 +1010,13 @@ static int __ovs_ct_lookup(struct net *net, struct 
sw_flow_key *key,
}
 
/* Call the helper only if:
-* - nf_conntrack_in() was executed above ("!cached") for a
-*   confirmed connection, or
+* - nf_conntrack_in() was executed above ("!cached") or a
+*   helper was just attached ("add_helper") for a confirmed
+*   connection, or
 * - When committing an unconfirmed connection.
 */
-   if ((nf_ct_is_confirmed(ct) ? !cached : info->commit) &&
+   if ((nf_ct_is_confirmed(ct) ? !cached || add_helper :
+ info->commit) &&
ovs_ct_helper(skb, info->family) != NF_ACCEPT) {
return -EINVAL;
}
-- 
2.7.4

Re: [PATCH] openvswitch: Allow attaching helper in later commit

2019-10-03 Thread Yi-Hung Wei

On Thu, Oct 3, 2019 at 8:31 AM David Miller  wrote:
>
> From: Yi-Hung Wei 
> Date: Mon, 30 Sep 2019 12:39:04 -0700
>
> > - if ((nf_ct_is_confirmed(ct) ? !cached : info->commit) &&
> > + if ((nf_ct_is_confirmed(ct) ? !cached | add_helper :
>
> I would suggest using "||" instea of "|" here since you are computing
> a boolean.

Thanks for review.  It makes sense to use "||" rather than "|".  I
will wait a bit to gather more feedback before I send v2.

Thanks,

-Yi-Hung

[PATCH] openvswitch: Allow attaching helper in later commit

2019-09-30 Thread Yi-Hung Wei

This patch allows to attach conntrack helper to a confirmed conntrack
entry.  Currently, we can only attach alg helper to a conntrack entry
when it is in the unconfirmed state.  This patch enables an use case
that we can firstly commit a conntrack entry after it passed some
initial conditions.  After that the processing pipeline will further
check a couple of packets to determine if the connection belongs to
a particular application, and attach alg helper to the connection
in a later stage.

Signed-off-by: Yi-Hung Wei 
---
 net/openvswitch/conntrack.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 05249eb45082..4eb3d2748b65 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -971,6 +971,8 @@ static int __ovs_ct_lookup(struct net *net, struct 
sw_flow_key *key,
 
ct = nf_ct_get(skb, &ctinfo);
if (ct) {
+   bool add_helper = false;
+
/* Packets starting a new connection must be NATted before the
 * helper, so that the helper knows about the NAT.  We enforce
 * this by delaying both NAT and helper calls for unconfirmed
@@ -988,16 +990,17 @@ static int __ovs_ct_lookup(struct net *net, struct 
sw_flow_key *key,
}
 
/* Userspace may decide to perform a ct lookup without a helper
-* specified followed by a (recirculate and) commit with one.
-* Therefore, for unconfirmed connections which we will commit,
-* we need to attach the helper here.
+* specified followed by a (recirculate and) commit with one,
+* or attach a helper in a later commit.  Therefore, for
+* connections which we will commit, we may need to attach
+* the helper here.
 */
-   if (!nf_ct_is_confirmed(ct) && info->commit &&
-   info->helper && !nfct_help(ct)) {
+   if (info->commit && info->helper && !nfct_help(ct)) {
int err = __nf_ct_try_assign_helper(ct, info->ct,
GFP_ATOMIC);
if (err)
return err;
+   add_helper = true;
 
/* helper installed, add seqadj if NAT is required */
if (info->nat && !nfct_seqadj(ct)) {
@@ -1007,11 +1010,13 @@ static int __ovs_ct_lookup(struct net *net, struct 
sw_flow_key *key,
}
 
/* Call the helper only if:
-* - nf_conntrack_in() was executed above ("!cached") for a
-*   confirmed connection, or
+* - nf_conntrack_in() was executed above ("!cached") or a
+*   helper was just attached ("add_helper") for a confirmed
+*   connection, or
 * - When committing an unconfirmed connection.
 */
-   if ((nf_ct_is_confirmed(ct) ? !cached : info->commit) &&
+   if ((nf_ct_is_confirmed(ct) ? !cached | add_helper :
+ info->commit) &&
ovs_ct_helper(skb, info->family) != NF_ACCEPT) {
return -EINVAL;
}
-- 
2.7.4

Re: [PATCH net v2] openvswitch: Fix conntrack cache with timeout

2019-08-23 Thread Yi-Hung Wei

On Thu, Aug 22, 2019 at 11:51 PM Pravin Shelar  wrote:
>
> On Thu, Aug 22, 2019 at 1:28 PM Yi-Hung Wei  wrote:
> >
> > This patch addresses a conntrack cache issue with timeout policy.
> > Currently, we do not check if the timeout extension is set properly in the
> > cached conntrack entry.  Thus, after packet recirculate from conntrack
> > action, the timeout policy is not applied properly.  This patch fixes the
> > aforementioned issue.
> >
> > Fixes: 06bd2bdf19d2 ("openvswitch: Add timeout support to ct action")
> > Reported-by: kbuild test robot 
> > Signed-off-by: Yi-Hung Wei 
> > ---
> > v1->v2: Fix rcu dereference issue reported by kbuild test robot.
> > ---
> >  net/openvswitch/conntrack.c | 13 +
> >  1 file changed, 13 insertions(+)
> >
> > diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> > index 848c6eb55064..4d7896135e73 100644
> > --- a/net/openvswitch/conntrack.c
> > +++ b/net/openvswitch/conntrack.c
> > @@ -1657,6 +1666,10 @@ int ovs_ct_copy_action(struct net *net, const struct 
> > nlattr *attr,
> >   ct_info.timeout))
> > pr_info_ratelimited("Failed to associated timeout "
> > "policy `%s'\n", 
> > ct_info.timeout);
> > +   else
> > +   ct_info.nf_ct_timeout = rcu_dereference(
> > +   nf_ct_timeout_find(ct_info.ct)->timeout);
> Is this dereference safe from NULL pointer?

Hi Pravin,

Thanks for your review.  I am not sure if
nf_ct_timeout_find(ct_info.ct) will return NULL in this case.

We only run into this statement when ct_info.timeout[0] is set, and it
is only set in parse_ct() when CONFIG_NF_CONNTRACK_TIMEOUT is
configured.  Also, in this else condition the timeout extension is
supposed to be set properly by nf_ct_set_timeout().

Am I missing something?

Thanks,

-Yi-Hung

[PATCH net v2] openvswitch: Fix conntrack cache with timeout

2019-08-22 Thread Yi-Hung Wei

This patch addresses a conntrack cache issue with timeout policy.
Currently, we do not check if the timeout extension is set properly in the
cached conntrack entry.  Thus, after packet recirculate from conntrack
action, the timeout policy is not applied properly.  This patch fixes the
aforementioned issue.

Fixes: 06bd2bdf19d2 ("openvswitch: Add timeout support to ct action")
Reported-by: kbuild test robot 
Signed-off-by: Yi-Hung Wei 
---
v1->v2: Fix rcu dereference issue reported by kbuild test robot.
---
 net/openvswitch/conntrack.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 848c6eb55064..4d7896135e73 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -67,6 +67,7 @@ struct ovs_conntrack_info {
struct md_mark mark;
struct md_labels labels;
char timeout[CTNL_TIMEOUT_NAME_MAX];
+   struct nf_ct_timeout *nf_ct_timeout;
 #if IS_ENABLED(CONFIG_NF_NAT)
struct nf_nat_range2 range;  /* Only present for SRC NAT and DST NAT. */
 #endif
@@ -697,6 +698,14 @@ static bool skb_nfct_cached(struct net *net,
if (help && rcu_access_pointer(help->helper) != info->helper)
return false;
}
+   if (info->nf_ct_timeout) {
+   struct nf_conn_timeout *timeout_ext;
+
+   timeout_ext = nf_ct_timeout_find(ct);
+   if (!timeout_ext || info->nf_ct_timeout !=
+   rcu_dereference(timeout_ext->timeout))
+   return false;
+   }
/* Force conntrack entry direction to the current packet? */
if (info->force && CTINFO2DIR(ctinfo) != IP_CT_DIR_ORIGINAL) {
/* Delete the conntrack entry if confirmed, else just release
@@ -1657,6 +1666,10 @@ int ovs_ct_copy_action(struct net *net, const struct 
nlattr *attr,
  ct_info.timeout))
pr_info_ratelimited("Failed to associated timeout "
"policy `%s'\n", ct_info.timeout);
+   else
+   ct_info.nf_ct_timeout = rcu_dereference(
+   nf_ct_timeout_find(ct_info.ct)->timeout);
+
}
 
if (helper) {
-- 
2.7.4

Re: [PATCH net] openvswitch: Fix conntrack cache with timeout

2019-08-22 Thread Yi-Hung Wei

On Thu, Aug 22, 2019 at 11:12 AM kbuild test robot  wrote:
>
> Hi Yi-Hung,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on net/master]
>
> url:
> https://github.com/0day-ci/linux/commits/Yi-Hung-Wei/openvswitch-Fix-conntrack-cache-with-timeout/20190822-212539
> reproduce:
> # apt-get install sparse
> # sparse version: v0.6.1-rc1-7-g2b96cd8-dirty
> make ARCH=x86_64 allmodconfig
> make C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'
>
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot 
>
>
> sparse warnings: (new ones prefixed by >>)
>
>include/linux/sched.h:609:43: sparse: sparse: bad integer constant 
> expression
>include/linux/sched.h:609:73: sparse: sparse: invalid named zero-width 
> bitfield `value'
>include/linux/sched.h:610:43: sparse: sparse: bad integer constant 
> expression
>include/linux/sched.h:610:67: sparse: sparse: invalid named zero-width 
> bitfield `bucket_id'
> >> net/openvswitch/conntrack.c:706:41: sparse: sparse: incompatible types in 
> >> comparison expression (different address spaces):
> >> net/openvswitch/conntrack.c:706:41: sparse:struct nf_ct_timeout *
> >> net/openvswitch/conntrack.c:706:41: sparse:struct nf_ct_timeout 
> >> [noderef]  *

My v1 does not take care of the rcu pointer properly.  I will fix the
reported issue and send v2.

Thanks,

-Yi-Hung

[PATCH net] openvswitch: Fix log message in ovs conntrack

2019-08-21 Thread Yi-Hung Wei

Fixes: 06bd2bdf19d2 ("openvswitch: Add timeout support to ct action")
Signed-off-by: Yi-Hung Wei 
---
 net/openvswitch/conntrack.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 45498fcf540d..0d5ab4957ec0 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -1574,7 +1574,7 @@ static int parse_ct(const struct nlattr *attr, struct 
ovs_conntrack_info *info,
case OVS_CT_ATTR_TIMEOUT:
memcpy(info->timeout, nla_data(a), nla_len(a));
if (!memchr(info->timeout, '\0', nla_len(a))) {
-   OVS_NLERR(log, "Invalid conntrack helper");
+   OVS_NLERR(log, "Invalid conntrack timeout");
return -EINVAL;
}
break;
-- 
2.7.4

[PATCH net] openvswitch: Fix conntrack cache with timeout

2019-08-21 Thread Yi-Hung Wei

This patch addresses a conntrack cache issue with timeout policy.
Currently, we do not check if the timeout extension is set properly in the
cached conntrack entry.  Thus, after packet recirculate from conntrack
action, the timeout policy is not applied properly.  This patch fixes the
aforementioned issue.

Fixes: 06bd2bdf19d2 ("openvswitch: Add timeout support to ct action")
Signed-off-by: Yi-Hung Wei 
---
 net/openvswitch/conntrack.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 848c6eb55064..45498fcf540d 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -67,6 +67,7 @@ struct ovs_conntrack_info {
struct md_mark mark;
struct md_labels labels;
char timeout[CTNL_TIMEOUT_NAME_MAX];
+   struct nf_ct_timeout *nf_ct_timeout;
 #if IS_ENABLED(CONFIG_NF_NAT)
struct nf_nat_range2 range;  /* Only present for SRC NAT and DST NAT. */
 #endif
@@ -697,6 +698,14 @@ static bool skb_nfct_cached(struct net *net,
if (help && rcu_access_pointer(help->helper) != info->helper)
return false;
}
+   if (info->nf_ct_timeout) {
+   struct nf_conn_timeout *timeout_ext;
+
+   timeout_ext = nf_ct_timeout_find(ct);
+   if (!timeout_ext ||
+   info->nf_ct_timeout != timeout_ext->timeout)
+   return false;
+   }
/* Force conntrack entry direction to the current packet? */
if (info->force && CTINFO2DIR(ctinfo) != IP_CT_DIR_ORIGINAL) {
/* Delete the conntrack entry if confirmed, else just release
@@ -1657,6 +1666,10 @@ int ovs_ct_copy_action(struct net *net, const struct 
nlattr *attr,
  ct_info.timeout))
pr_info_ratelimited("Failed to associated timeout "
"policy `%s'\n", ct_info.timeout);
+   else
+   ct_info.nf_ct_timeout =
+   nf_ct_timeout_find(ct_info.ct)->timeout;
+
}
 
if (helper) {
-- 
2.7.4

Re: [PATCH net-next] openvswitch: use after free in __ovs_ct_free_action()

2019-04-02 Thread Yi-Hung Wei

On Mon, Apr 1, 2019 at 11:53 PM Dan Carpenter  wrote:
>
> We free "ct_info->ct" and then use it on the next line when we pass it
> to nf_ct_destroy_timeout().  This patch swaps the order to avoid the use
> after free.
>
> Fixes: 06bd2bdf19d2 ("openvswitch: Add timeout support to ct action")
> Signed-off-by: Dan Carpenter 
> ---
Thanks for the fix.

Acked-by: Yi-Hung Wei

[PATCH net-next v4 2/2] openvswitch: Add timeout support to ct action

2019-03-26 Thread Yi-Hung Wei

Add support for fine-grain timeout support to conntrack action.
The new OVS_CT_ATTR_TIMEOUT attribute of the conntrack action
specifies a timeout to be associated with this connection.
If no timeout is specified, it acts as is, that is the default
timeout for the connection will be automatically applied.

Example usage:
$ nfct timeout add timeout_1 inet tcp syn_sent 100 established 200
$ ovs-ofctl add-flow br0 in_port=1,ip,tcp,action=ct(commit,timeout=timeout_1)

CC: Pravin Shelar 
CC: Pablo Neira Ayuso 
Signed-off-by: Yi-Hung Wei 
---
v1-> v2: Utilize nf_ct_set_timeout().
v2-> v4: No change in this patch, the build issue is resolved by patch 1.
---
 include/uapi/linux/openvswitch.h |  3 +++
 net/openvswitch/conntrack.c  | 30 +-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index dbe0cbe4f1b7..00ec98836cf3 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -734,6 +734,7 @@ struct ovs_action_hash {
  * be received on NFNLGRP_CONNTRACK_NEW and NFNLGRP_CONNTRACK_DESTROY groups,
  * respectively.  Remaining bits control the changes for which an event is
  * delivered on the NFNLGRP_CONNTRACK_UPDATE group.
+ * @OVS_CT_ATTR_TIMEOUT: Variable length string defining conntrack timeout.
  */
 enum ovs_ct_attr {
OVS_CT_ATTR_UNSPEC,
@@ -746,6 +747,8 @@ enum ovs_ct_attr {
OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */
OVS_CT_ATTR_FORCE_COMMIT,  /* No argument */
OVS_CT_ATTR_EVENTMASK,  /* u32 mask of IPCT_* events. */
+   OVS_CT_ATTR_TIMEOUT,/* Associate timeout with this connection for
+* fine-grain timeout tuning. */
__OVS_CT_ATTR_MAX
 };
 
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 51080004677e..c4adbaeadda4 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -73,6 +74,7 @@ struct ovs_conntrack_info {
u32 eventmask;  /* Mask of 1 << IPCT_*. */
struct md_mark mark;
struct md_labels labels;
+   char timeout[CTNL_TIMEOUT_NAME_MAX];
 #ifdef CONFIG_NF_NAT_NEEDED
struct nf_nat_range2 range;  /* Only present for SRC NAT and DST NAT. */
 #endif
@@ -1465,6 +1467,8 @@ static const struct ovs_ct_len_tbl 
ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = {
 #endif
[OVS_CT_ATTR_EVENTMASK] = { .minlen = sizeof(u32),
.maxlen = sizeof(u32) },
+   [OVS_CT_ATTR_TIMEOUT] = { .minlen = 1,
+ .maxlen = CTNL_TIMEOUT_NAME_MAX },
 };
 
 static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info,
@@ -1550,6 +1554,15 @@ static int parse_ct(const struct nlattr *attr, struct 
ovs_conntrack_info *info,
info->have_eventmask = true;
info->eventmask = nla_get_u32(a);
break;
+#ifdef CONFIG_NF_CONNTRACK_TIMEOUT
+   case OVS_CT_ATTR_TIMEOUT:
+   memcpy(info->timeout, nla_data(a), nla_len(a));
+   if (!memchr(info->timeout, '\0', nla_len(a))) {
+   OVS_NLERR(log, "Invalid conntrack helper");
+   return -EINVAL;
+   }
+   break;
+#endif
 
default:
OVS_NLERR(log, "Unknown conntrack attr (%d)",
@@ -1631,6 +1644,14 @@ int ovs_ct_copy_action(struct net *net, const struct 
nlattr *attr,
OVS_NLERR(log, "Failed to allocate conntrack template");
return -ENOMEM;
}
+
+   if (ct_info.timeout[0]) {
+   if (nf_ct_set_timeout(net, ct_info.ct, family, key->ip.proto,
+ ct_info.timeout))
+   pr_info_ratelimited("Failed to associated timeout "
+   "policy `%s'\n", ct_info.timeout);
+   }
+
if (helper) {
err = ovs_ct_add_helper(&ct_info, helper, key, log);
if (err)
@@ -1751,6 +1772,10 @@ int ovs_ct_action_to_attr(const struct 
ovs_conntrack_info *ct_info,
if (ct_info->have_eventmask &&
nla_put_u32(skb, OVS_CT_ATTR_EVENTMASK, ct_info->eventmask))
return -EMSGSIZE;
+   if (ct_info->timeout[0]) {
+   if (nla_put_string(skb, OVS_CT_ATTR_TIMEOUT, ct_info->timeout))
+   return -EMSGSIZE;
+   }
 
 #ifdef CONFIG_NF_NAT_NEEDED
if (ct_info->nat && !ovs_ct_nat_to_attr(ct_info, skb))
@@ -1772,8 +1797,11 @@ static void __ovs_ct_free_action(struct 
ovs_conntrack_info *ct_info)
 {

[PATCH net-next v4 1/2] netfilter: Export nf_ct_{set,destroy}_timeout()

2019-03-26 Thread Yi-Hung Wei

This patch exports nf_ct_set_timeout() and nf_ct_destroy_timeout().
The two functions are derived from xt_ct_destroy_timeout() and
xt_ct_set_timeout() in xt_CT.c, and moved to nf_conntrack_timeout.c
without any functional change.
It would be useful for other users (i.e. OVS) that utilizes the
finer-grain conntrack timeout feature.

CC: Pablo Neira Ayuso 
CC: Pravin Shelar 
Signed-off-by: Yi-Hung Wei 
---
v1-> v2: Export nf_ct_set_timeout().
v2-> v3: Fix build issue when CONFIG_NF_CONNTRACK_TIMEOUT is not set.
v3-> v4: Remove unnessary #ifdef CONFIG_NF_CONNTRACK_TIMEOUT
---
 include/net/netfilter/nf_conntrack_timeout.h | 15 +
 net/netfilter/nf_conntrack_timeout.c | 89 ++
 net/netfilter/xt_CT.c| 93 ++--
 3 files changed, 110 insertions(+), 87 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_timeout.h 
b/include/net/netfilter/nf_conntrack_timeout.h
index 3394d75e1c80..00a8fbb2d735 100644
--- a/include/net/netfilter/nf_conntrack_timeout.h
+++ b/include/net/netfilter/nf_conntrack_timeout.h
@@ -88,6 +88,9 @@ static inline unsigned int *nf_ct_timeout_lookup(const struct 
nf_conn *ct)
 int nf_conntrack_timeout_init(void);
 void nf_conntrack_timeout_fini(void);
 void nf_ct_untimeout(struct net *net, struct nf_ct_timeout *timeout);
+int nf_ct_set_timeout(struct net *net, struct nf_conn *ct, u8 l3num, u8 l4num,
+ const char *timeout_name);
+void nf_ct_destroy_timeout(struct nf_conn *ct);
 #else
 static inline int nf_conntrack_timeout_init(void)
 {
@@ -98,6 +101,18 @@ static inline void nf_conntrack_timeout_fini(void)
 {
 return;
 }
+
+static inline int nf_ct_set_timeout(struct net *net, struct nf_conn *ct,
+   u8 l3num, u8 l4num,
+   const char *timeout_name)
+{
+   return -EOPNOTSUPP;
+}
+
+static inline void nf_ct_destroy_timeout(struct nf_conn *ct)
+{
+   return;
+}
 #endif /* CONFIG_NF_CONNTRACK_TIMEOUT */
 
 #ifdef CONFIG_NF_CONNTRACK_TIMEOUT
diff --git a/net/netfilter/nf_conntrack_timeout.c 
b/net/netfilter/nf_conntrack_timeout.c
index 91fbd183da2d..edac8ea4436d 100644
--- a/net/netfilter/nf_conntrack_timeout.c
+++ b/net/netfilter/nf_conntrack_timeout.c
@@ -48,6 +48,95 @@ void nf_ct_untimeout(struct net *net, struct nf_ct_timeout 
*timeout)
 }
 EXPORT_SYMBOL_GPL(nf_ct_untimeout);
 
+static void __nf_ct_timeout_put(struct nf_ct_timeout *timeout)
+{
+   typeof(nf_ct_timeout_put_hook) timeout_put;
+
+   timeout_put = rcu_dereference(nf_ct_timeout_put_hook);
+   if (timeout_put)
+   timeout_put(timeout);
+}
+
+int nf_ct_set_timeout(struct net *net, struct nf_conn *ct,
+ u8 l3num, u8 l4num, const char *timeout_name)
+{
+   typeof(nf_ct_timeout_find_get_hook) timeout_find_get;
+   struct nf_ct_timeout *timeout;
+   struct nf_conn_timeout *timeout_ext;
+   const char *errmsg = NULL;
+   int ret = 0;
+
+   rcu_read_lock();
+   timeout_find_get = rcu_dereference(nf_ct_timeout_find_get_hook);
+   if (!timeout_find_get) {
+   ret = -ENOENT;
+   errmsg = "Timeout policy base is empty";
+   goto out;
+   }
+
+   timeout = timeout_find_get(net, timeout_name);
+   if (!timeout) {
+   ret = -ENOENT;
+   pr_info_ratelimited("No such timeout policy \"%s\"\n",
+   timeout_name);
+   goto out;
+   }
+
+   if (timeout->l3num != l3num) {
+   ret = -EINVAL;
+   pr_info_ratelimited("Timeout policy `%s' can only be used by "
+   "L%d protocol number %d\n",
+   timeout_name, 3, timeout->l3num);
+   goto err_put_timeout;
+   }
+   /* Make sure the timeout policy matches any existing protocol tracker,
+* otherwise default to generic.
+*/
+   if (timeout->l4proto->l4proto != l4num) {
+   ret = -EINVAL;
+   pr_info_ratelimited("Timeout policy `%s' can only be used by "
+   "L%d protocol number %d\n",
+   timeout_name, 4, timeout->l4proto->l4proto);
+   goto err_put_timeout;
+   }
+   timeout_ext = nf_ct_timeout_ext_add(ct, timeout, GFP_ATOMIC);
+   if (!timeout_ext) {
+   ret = -ENOMEM;
+   goto err_put_timeout;
+   }
+
+   rcu_read_unlock();
+   return ret;
+
+err_put_timeout:
+   __nf_ct_timeout_put(timeout);
+out:
+   rcu_read_unlock();
+   if (errmsg)
+   pr_info_ratelimited("%s\n", errmsg);
+   return ret;
+}
+EXPORT_SYMBOL_GPL(nf_ct_set_timeout);
+
+void nf_ct_destroy_timeout(struct nf_conn *ct)
+{
+   struct nf_conn_tim

[PATCH net-next v3 2/2] openvswitch: Add timeout support to ct action

2019-03-25 Thread Yi-Hung Wei

Add support for fine-grain timeout support to conntrack action.
The new OVS_CT_ATTR_TIMEOUT attribute of the conntrack action
specifies a timeout to be associated with this connection.
If no timeout is specified, it acts as is, that is the default
timeout for the connection will be automatically applied.

Example usage:
$ nfct timeout add timeout_1 inet tcp syn_sent 100 established 200
$ ovs-ofctl add-flow br0 in_port=1,ip,tcp,action=ct(commit,timeout=timeout_1)

CC: Pravin Shelar 
CC: Pablo Neira Ayuso 
Signed-off-by: Yi-Hung Wei 
---
v1-> v2: Utilize nf_ct_set_timeout().
v2-> v3: No change in this patch, the build issue is resolved by patch 1.
---
 include/uapi/linux/openvswitch.h |  3 +++
 net/openvswitch/conntrack.c  | 30 +-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index dbe0cbe4f1b7..00ec98836cf3 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -734,6 +734,7 @@ struct ovs_action_hash {
  * be received on NFNLGRP_CONNTRACK_NEW and NFNLGRP_CONNTRACK_DESTROY groups,
  * respectively.  Remaining bits control the changes for which an event is
  * delivered on the NFNLGRP_CONNTRACK_UPDATE group.
+ * @OVS_CT_ATTR_TIMEOUT: Variable length string defining conntrack timeout.
  */
 enum ovs_ct_attr {
OVS_CT_ATTR_UNSPEC,
@@ -746,6 +747,8 @@ enum ovs_ct_attr {
OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */
OVS_CT_ATTR_FORCE_COMMIT,  /* No argument */
OVS_CT_ATTR_EVENTMASK,  /* u32 mask of IPCT_* events. */
+   OVS_CT_ATTR_TIMEOUT,/* Associate timeout with this connection for
+* fine-grain timeout tuning. */
__OVS_CT_ATTR_MAX
 };
 
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 51080004677e..c4adbaeadda4 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -73,6 +74,7 @@ struct ovs_conntrack_info {
u32 eventmask;  /* Mask of 1 << IPCT_*. */
struct md_mark mark;
struct md_labels labels;
+   char timeout[CTNL_TIMEOUT_NAME_MAX];
 #ifdef CONFIG_NF_NAT_NEEDED
struct nf_nat_range2 range;  /* Only present for SRC NAT and DST NAT. */
 #endif
@@ -1465,6 +1467,8 @@ static const struct ovs_ct_len_tbl 
ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = {
 #endif
[OVS_CT_ATTR_EVENTMASK] = { .minlen = sizeof(u32),
.maxlen = sizeof(u32) },
+   [OVS_CT_ATTR_TIMEOUT] = { .minlen = 1,
+ .maxlen = CTNL_TIMEOUT_NAME_MAX },
 };
 
 static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info,
@@ -1550,6 +1554,15 @@ static int parse_ct(const struct nlattr *attr, struct 
ovs_conntrack_info *info,
info->have_eventmask = true;
info->eventmask = nla_get_u32(a);
break;
+#ifdef CONFIG_NF_CONNTRACK_TIMEOUT
+   case OVS_CT_ATTR_TIMEOUT:
+   memcpy(info->timeout, nla_data(a), nla_len(a));
+   if (!memchr(info->timeout, '\0', nla_len(a))) {
+   OVS_NLERR(log, "Invalid conntrack helper");
+   return -EINVAL;
+   }
+   break;
+#endif
 
default:
OVS_NLERR(log, "Unknown conntrack attr (%d)",
@@ -1631,6 +1644,14 @@ int ovs_ct_copy_action(struct net *net, const struct 
nlattr *attr,
OVS_NLERR(log, "Failed to allocate conntrack template");
return -ENOMEM;
}
+
+   if (ct_info.timeout[0]) {
+   if (nf_ct_set_timeout(net, ct_info.ct, family, key->ip.proto,
+ ct_info.timeout))
+   pr_info_ratelimited("Failed to associated timeout "
+   "policy `%s'\n", ct_info.timeout);
+   }
+
if (helper) {
err = ovs_ct_add_helper(&ct_info, helper, key, log);
if (err)
@@ -1751,6 +1772,10 @@ int ovs_ct_action_to_attr(const struct 
ovs_conntrack_info *ct_info,
if (ct_info->have_eventmask &&
nla_put_u32(skb, OVS_CT_ATTR_EVENTMASK, ct_info->eventmask))
return -EMSGSIZE;
+   if (ct_info->timeout[0]) {
+   if (nla_put_string(skb, OVS_CT_ATTR_TIMEOUT, ct_info->timeout))
+   return -EMSGSIZE;
+   }
 
 #ifdef CONFIG_NF_NAT_NEEDED
if (ct_info->nat && !ovs_ct_nat_to_attr(ct_info, skb))
@@ -1772,8 +1797,11 @@ static void __ovs_ct_free_action(struct 
ovs_conntrack_info *ct_info)
 {

[PATCH net-next v3 1/2] netfilter: Export nf_ct_{set,destroy}_timeout()

2019-03-25 Thread Yi-Hung Wei

This patch exports nf_ct_set_timeout() and nf_ct_destroy_timeout().
The two functions are derived from xt_ct_destroy_timeout() and
xt_ct_set_timeout() in xt_CT.c, and moved to nf_conntrack_timeout.c
without any functional change.
It would be useful for other users (i.e. OVS) that utilizes the
finer-grain conntrack timeout feature.

CC: Pablo Neira Ayuso 
CC: Pravin Shelar 
Signed-off-by: Yi-Hung Wei 
---
v1-> v2: Export nf_ct_set_timeout().
v2-> v3: Fix build issue when CONFIG_NF_CONNTRACK_TIMEOUT is not set.
---
 include/net/netfilter/nf_conntrack_timeout.h | 15 +
 net/netfilter/nf_conntrack_timeout.c | 97 
 net/netfilter/xt_CT.c| 93 ++
 3 files changed, 118 insertions(+), 87 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_timeout.h 
b/include/net/netfilter/nf_conntrack_timeout.h
index 3394d75e1c80..00a8fbb2d735 100644
--- a/include/net/netfilter/nf_conntrack_timeout.h
+++ b/include/net/netfilter/nf_conntrack_timeout.h
@@ -88,6 +88,9 @@ static inline unsigned int *nf_ct_timeout_lookup(const struct 
nf_conn *ct)
 int nf_conntrack_timeout_init(void);
 void nf_conntrack_timeout_fini(void);
 void nf_ct_untimeout(struct net *net, struct nf_ct_timeout *timeout);
+int nf_ct_set_timeout(struct net *net, struct nf_conn *ct, u8 l3num, u8 l4num,
+ const char *timeout_name);
+void nf_ct_destroy_timeout(struct nf_conn *ct);
 #else
 static inline int nf_conntrack_timeout_init(void)
 {
@@ -98,6 +101,18 @@ static inline void nf_conntrack_timeout_fini(void)
 {
 return;
 }
+
+static inline int nf_ct_set_timeout(struct net *net, struct nf_conn *ct,
+   u8 l3num, u8 l4num,
+   const char *timeout_name)
+{
+   return -EOPNOTSUPP;
+}
+
+static inline void nf_ct_destroy_timeout(struct nf_conn *ct)
+{
+   return;
+}
 #endif /* CONFIG_NF_CONNTRACK_TIMEOUT */
 
 #ifdef CONFIG_NF_CONNTRACK_TIMEOUT
diff --git a/net/netfilter/nf_conntrack_timeout.c 
b/net/netfilter/nf_conntrack_timeout.c
index 91fbd183da2d..c2dee577548e 100644
--- a/net/netfilter/nf_conntrack_timeout.c
+++ b/net/netfilter/nf_conntrack_timeout.c
@@ -48,6 +48,103 @@ void nf_ct_untimeout(struct net *net, struct nf_ct_timeout 
*timeout)
 }
 EXPORT_SYMBOL_GPL(nf_ct_untimeout);
 
+#ifdef CONFIG_NF_CONNTRACK_TIMEOUT
+static void __nf_ct_timeout_put(struct nf_ct_timeout *timeout)
+{
+   typeof(nf_ct_timeout_put_hook) timeout_put;
+
+   timeout_put = rcu_dereference(nf_ct_timeout_put_hook);
+   if (timeout_put)
+   timeout_put(timeout);
+}
+#endif
+
+int nf_ct_set_timeout(struct net *net, struct nf_conn *ct,
+ u8 l3num, u8 l4num, const char *timeout_name)
+{
+#ifdef CONFIG_NF_CONNTRACK_TIMEOUT
+   typeof(nf_ct_timeout_find_get_hook) timeout_find_get;
+   struct nf_ct_timeout *timeout;
+   struct nf_conn_timeout *timeout_ext;
+   const char *errmsg = NULL;
+   int ret = 0;
+
+   rcu_read_lock();
+   timeout_find_get = rcu_dereference(nf_ct_timeout_find_get_hook);
+   if (!timeout_find_get) {
+   ret = -ENOENT;
+   errmsg = "Timeout policy base is empty";
+   goto out;
+   }
+
+   timeout = timeout_find_get(net, timeout_name);
+   if (!timeout) {
+   ret = -ENOENT;
+   pr_info_ratelimited("No such timeout policy \"%s\"\n",
+   timeout_name);
+   goto out;
+   }
+
+   if (timeout->l3num != l3num) {
+   ret = -EINVAL;
+   pr_info_ratelimited("Timeout policy `%s' can only be used by "
+   "L%d protocol number %d\n",
+   timeout_name, 3, timeout->l3num);
+   goto err_put_timeout;
+   }
+   /* Make sure the timeout policy matches any existing protocol tracker,
+* otherwise default to generic.
+*/
+   if (timeout->l4proto->l4proto != l4num) {
+   ret = -EINVAL;
+   pr_info_ratelimited("Timeout policy `%s' can only be used by "
+   "L%d protocol number %d\n",
+   timeout_name, 4, timeout->l4proto->l4proto);
+   goto err_put_timeout;
+   }
+   timeout_ext = nf_ct_timeout_ext_add(ct, timeout, GFP_ATOMIC);
+   if (!timeout_ext) {
+   ret = -ENOMEM;
+   goto err_put_timeout;
+   }
+
+   rcu_read_unlock();
+   return ret;
+
+err_put_timeout:
+   __nf_ct_timeout_put(timeout);
+out:
+   rcu_read_unlock();
+   if (errmsg)
+   pr_info_ratelimited("%s\n", errmsg);
+   return ret;
+#else
+   return -EOPNOTSUPP;
+#endif
+}
+EXPORT_SYMBOL_GPL(nf_ct_set_timeout);
+
+void nf_ct_dest

[PATCH 2/2] openvswitch: Add timeout support to ct action

2019-03-22 Thread Yi-Hung Wei

Add support for fine-grain timeout support to conntrack action.
The new OVS_CT_ATTR_TIMEOUT attribute of the conntrack action
specifies a timeout to be associated with this connection.
If no timeout is specified, it acts as is, that is the default
timeout for the connection will be automatically applied.

Example usage:
$ nfct timeout add timeout_1 inet tcp syn_sent 100 established 200
$ ovs-ofctl add-flow br0 in_port=1,ip,tcp,action=ct(commit,timeout=timeout_1)

CC: Pravin Shelar 
Signed-off-by: Yi-Hung Wei 
---
 include/uapi/linux/openvswitch.h |  3 +++
 net/openvswitch/conntrack.c  | 30 +-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index dbe0cbe4f1b7..00ec98836cf3 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -734,6 +734,7 @@ struct ovs_action_hash {
  * be received on NFNLGRP_CONNTRACK_NEW and NFNLGRP_CONNTRACK_DESTROY groups,
  * respectively.  Remaining bits control the changes for which an event is
  * delivered on the NFNLGRP_CONNTRACK_UPDATE group.
+ * @OVS_CT_ATTR_TIMEOUT: Variable length string defining conntrack timeout.
  */
 enum ovs_ct_attr {
OVS_CT_ATTR_UNSPEC,
@@ -746,6 +747,8 @@ enum ovs_ct_attr {
OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */
OVS_CT_ATTR_FORCE_COMMIT,  /* No argument */
OVS_CT_ATTR_EVENTMASK,  /* u32 mask of IPCT_* events. */
+   OVS_CT_ATTR_TIMEOUT,/* Associate timeout with this connection for
+* fine-grain timeout tuning. */
__OVS_CT_ATTR_MAX
 };
 
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 1b6896896fff..ce2e148711de 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -73,6 +74,7 @@ struct ovs_conntrack_info {
u32 eventmask;  /* Mask of 1 << IPCT_*. */
struct md_mark mark;
struct md_labels labels;
+   char timeout[CTNL_TIMEOUT_NAME_MAX];
 #ifdef CONFIG_NF_NAT_NEEDED
struct nf_nat_range2 range;  /* Only present for SRC NAT and DST NAT. */
 #endif
@@ -1465,6 +1467,8 @@ static const struct ovs_ct_len_tbl 
ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = {
 #endif
[OVS_CT_ATTR_EVENTMASK] = { .minlen = sizeof(u32),
.maxlen = sizeof(u32) },
+   [OVS_CT_ATTR_TIMEOUT] = { .minlen = 1,
+ .maxlen = CTNL_TIMEOUT_NAME_MAX },
 };
 
 static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info,
@@ -1550,6 +1554,15 @@ static int parse_ct(const struct nlattr *attr, struct 
ovs_conntrack_info *info,
info->have_eventmask = true;
info->eventmask = nla_get_u32(a);
break;
+#ifdef CONFIG_NF_CONNTRACK_TIMEOUT
+   case OVS_CT_ATTR_TIMEOUT:
+   memcpy(info->timeout, nla_data(a), nla_len(a));
+   if (!memchr(info->timeout, '\0', nla_len(a))) {
+   OVS_NLERR(log, "Invalid conntrack helper");
+   return -EINVAL;
+   }
+   break;
+#endif
 
default:
OVS_NLERR(log, "Unknown conntrack attr (%d)",
@@ -1631,6 +1644,14 @@ int ovs_ct_copy_action(struct net *net, const struct 
nlattr *attr,
OVS_NLERR(log, "Failed to allocate conntrack template");
return -ENOMEM;
}
+
+   if (ct_info.timeout[0]) {
+   if (nf_ct_set_timeout(net, ct_info.ct, family, key->ip.proto,
+ ct_info.timeout))
+   pr_info_ratelimited("Failed to associated timeout "
+   "policy `%s'\n", ct_info.timeout);
+   }
+
if (helper) {
err = ovs_ct_add_helper(&ct_info, helper, key, log);
if (err)
@@ -1751,6 +1772,10 @@ int ovs_ct_action_to_attr(const struct 
ovs_conntrack_info *ct_info,
if (ct_info->have_eventmask &&
nla_put_u32(skb, OVS_CT_ATTR_EVENTMASK, ct_info->eventmask))
return -EMSGSIZE;
+   if (ct_info->timeout[0]) {
+   if (nla_put_string(skb, OVS_CT_ATTR_TIMEOUT, ct_info->timeout))
+   return -EMSGSIZE;
+   }
 
 #ifdef CONFIG_NF_NAT_NEEDED
if (ct_info->nat && !ovs_ct_nat_to_attr(ct_info, skb))
@@ -1772,8 +1797,11 @@ static void __ovs_ct_free_action(struct 
ovs_conntrack_info *ct_info)
 {
if (ct_info->helper)
nf_conntrack_helper_put(ct_info->helper);
-   if (ct_info->ct)
+   if (ct_info->ct) {

[PATCH 1/2] netfilter: Export nf_ct_{set,destroy}_timeout()

2019-03-22 Thread Yi-Hung Wei

This patch exports nf_ct_set_timeout() and nf_ct_destroy_timeout().
The two functions are derived from xt_ct_destroy_timeout() and
xt_ct_set_timeout() in xt_CT.c, and moved to nf_conntrack_timeout.c
without any functional change.
It would be useful for other users (i.e. OVS) that utilizes the
finer-grain conntrack timeout feature.

CC: Pablo Neira Ayuso 
Signed-off-by: Yi-Hung Wei 
---
 include/net/netfilter/nf_conntrack_timeout.h |  3 +
 net/netfilter/nf_conntrack_timeout.c | 97 
 net/netfilter/xt_CT.c| 93 ++
 3 files changed, 106 insertions(+), 87 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_timeout.h 
b/include/net/netfilter/nf_conntrack_timeout.h
index 3394d75e1c80..738f96348a7a 100644
--- a/include/net/netfilter/nf_conntrack_timeout.h
+++ b/include/net/netfilter/nf_conntrack_timeout.h
@@ -105,4 +105,7 @@ extern struct nf_ct_timeout 
*(*nf_ct_timeout_find_get_hook)(struct net *net, con
 extern void (*nf_ct_timeout_put_hook)(struct nf_ct_timeout *timeout);
 #endif
 
+int nf_ct_set_timeout(struct net *net, struct nf_conn *ct, u8 l3num, u8 l4num,
+ const char *timeout_name);
+void nf_ct_destroy_timeout(struct nf_conn *ct);
 #endif /* _NF_CONNTRACK_TIMEOUT_H */
diff --git a/net/netfilter/nf_conntrack_timeout.c 
b/net/netfilter/nf_conntrack_timeout.c
index 91fbd183da2d..c2dee577548e 100644
--- a/net/netfilter/nf_conntrack_timeout.c
+++ b/net/netfilter/nf_conntrack_timeout.c
@@ -48,6 +48,103 @@ void nf_ct_untimeout(struct net *net, struct nf_ct_timeout 
*timeout)
 }
 EXPORT_SYMBOL_GPL(nf_ct_untimeout);
 
+#ifdef CONFIG_NF_CONNTRACK_TIMEOUT
+static void __nf_ct_timeout_put(struct nf_ct_timeout *timeout)
+{
+   typeof(nf_ct_timeout_put_hook) timeout_put;
+
+   timeout_put = rcu_dereference(nf_ct_timeout_put_hook);
+   if (timeout_put)
+   timeout_put(timeout);
+}
+#endif
+
+int nf_ct_set_timeout(struct net *net, struct nf_conn *ct,
+ u8 l3num, u8 l4num, const char *timeout_name)
+{
+#ifdef CONFIG_NF_CONNTRACK_TIMEOUT
+   typeof(nf_ct_timeout_find_get_hook) timeout_find_get;
+   struct nf_ct_timeout *timeout;
+   struct nf_conn_timeout *timeout_ext;
+   const char *errmsg = NULL;
+   int ret = 0;
+
+   rcu_read_lock();
+   timeout_find_get = rcu_dereference(nf_ct_timeout_find_get_hook);
+   if (!timeout_find_get) {
+   ret = -ENOENT;
+   errmsg = "Timeout policy base is empty";
+   goto out;
+   }
+
+   timeout = timeout_find_get(net, timeout_name);
+   if (!timeout) {
+   ret = -ENOENT;
+   pr_info_ratelimited("No such timeout policy \"%s\"\n",
+   timeout_name);
+   goto out;
+   }
+
+   if (timeout->l3num != l3num) {
+   ret = -EINVAL;
+   pr_info_ratelimited("Timeout policy `%s' can only be used by "
+   "L%d protocol number %d\n",
+   timeout_name, 3, timeout->l3num);
+   goto err_put_timeout;
+   }
+   /* Make sure the timeout policy matches any existing protocol tracker,
+* otherwise default to generic.
+*/
+   if (timeout->l4proto->l4proto != l4num) {
+   ret = -EINVAL;
+   pr_info_ratelimited("Timeout policy `%s' can only be used by "
+   "L%d protocol number %d\n",
+   timeout_name, 4, timeout->l4proto->l4proto);
+   goto err_put_timeout;
+   }
+   timeout_ext = nf_ct_timeout_ext_add(ct, timeout, GFP_ATOMIC);
+   if (!timeout_ext) {
+   ret = -ENOMEM;
+   goto err_put_timeout;
+   }
+
+   rcu_read_unlock();
+   return ret;
+
+err_put_timeout:
+   __nf_ct_timeout_put(timeout);
+out:
+   rcu_read_unlock();
+   if (errmsg)
+   pr_info_ratelimited("%s\n", errmsg);
+   return ret;
+#else
+   return -EOPNOTSUPP;
+#endif
+}
+EXPORT_SYMBOL_GPL(nf_ct_set_timeout);
+
+void nf_ct_destroy_timeout(struct nf_conn *ct)
+{
+#ifdef CONFIG_NF_CONNTRACK_TIMEOUT
+   struct nf_conn_timeout *timeout_ext;
+   typeof(nf_ct_timeout_put_hook) timeout_put;
+
+   rcu_read_lock();
+   timeout_put = rcu_dereference(nf_ct_timeout_put_hook);
+
+   if (timeout_put) {
+   timeout_ext = nf_ct_timeout_find(ct);
+   if (timeout_ext) {
+   timeout_put(timeout_ext->timeout);
+   RCU_INIT_POINTER(timeout_ext->timeout, NULL);
+   }
+   }
+   rcu_read_unlock();
+#endif
+}
+EXPORT_SYMBOL_GPL(nf_ct_destroy_timeout);
+
 static const struct nf_ct_ext_type timeout_extend = {
.len= sizeof(struct nf_conn_t

Re: [PATCH net-next 2/2] openvswitch: Add timeout support to ct action

2019-03-21 Thread Yi-Hung Wei

> > +static void ovs_ct_add_timeout(struct net *net, struct nf_conn *ct,
> > +  const char *timeout_name, u16 l3num, u8 
> > l4num)
> > +{
> This code looks very similar to xt_ct_set_timeout(), can you refactor
> it to avoid code duplication?

Thanks Prvain's feedback. I will export the set timeout function and
use that to avoid duplication.

> >  static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info 
> > *info,
> > @@ -1550,6 +1607,15 @@ static int parse_ct(const struct nlattr *attr, 
> > struct ovs_conntrack_info *info,
> > info->have_eventmask = true;
> > info->eventmask = nla_get_u32(a);
> > break;
> > +#ifdef CONFIG_NF_CONNTRACK_TIMEOUT
> > +   case OVS_CT_ATTR_TIMEOUT:
> > +   memcpy(info->timeout, nla_data(a), nla_len(a));
> Before copying timeout, we need to check sizeof source string.
> 'nla_len(a)' needs to be less than CTNL_TIMEOUT_NAME_MAX.

There are some exiting checking in parse_ct() as below. That should
check the sizeof source string already.

nla_for_each_nested(a, attr, rem) {
int type = nla_type(a);
int maxlen;
int minlen;
..
maxlen = ovs_ct_attr_lens[type].maxlen;
minlen = ovs_ct_attr_lens[type].minlen;
if (nla_len(a) < minlen || nla_len(a) > maxlen) {
OVS_NLERR(log,
  "Conntrack attr type has unexpected
length (type=%d, length=%d, expected=%d)",
  type, nla_len(a), maxlen);
return -EINVAL;
}

I will respin v2 soon.

Thanks,

-Yi-Hung

[PATCH net-next 1/2] netfilter: Export nf_ct_destroy_timeout()

2019-03-20 Thread Yi-Hung Wei

This patch moves xt_ct_destroy_timeout from xt_CT.c to
nf_conntrack_timeout.c, renames it to nf_ct_destroy_timeout(), and
exports it.  It does not contain any functional change.
It would be useful for other users (i.e. OVS) that utilizes the
finer-grain conntrack timeout feature.

CC: Pablo Neira Ayuso 
Signed-off-by: Yi-Hung Wei 
---
 include/net/netfilter/nf_conntrack_timeout.h |  1 +
 net/netfilter/nf_conntrack_timeout.c | 21 +
 net/netfilter/xt_CT.c| 22 +-
 3 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_timeout.h 
b/include/net/netfilter/nf_conntrack_timeout.h
index 3394d75e1c80..5ce62fa1e1e2 100644
--- a/include/net/netfilter/nf_conntrack_timeout.h
+++ b/include/net/netfilter/nf_conntrack_timeout.h
@@ -105,4 +105,5 @@ extern struct nf_ct_timeout 
*(*nf_ct_timeout_find_get_hook)(struct net *net, con
 extern void (*nf_ct_timeout_put_hook)(struct nf_ct_timeout *timeout);
 #endif
 
+void nf_ct_destroy_timeout(struct nf_conn *ct);
 #endif /* _NF_CONNTRACK_TIMEOUT_H */
diff --git a/net/netfilter/nf_conntrack_timeout.c 
b/net/netfilter/nf_conntrack_timeout.c
index 91fbd183da2d..11b7f58e7f88 100644
--- a/net/netfilter/nf_conntrack_timeout.c
+++ b/net/netfilter/nf_conntrack_timeout.c
@@ -48,6 +48,27 @@ void nf_ct_untimeout(struct net *net, struct nf_ct_timeout 
*timeout)
 }
 EXPORT_SYMBOL_GPL(nf_ct_untimeout);
 
+void nf_ct_destroy_timeout(struct nf_conn *ct)
+{
+#ifdef CONFIG_NF_CONNTRACK_TIMEOUT
+   struct nf_conn_timeout *timeout_ext;
+   typeof(nf_ct_timeout_put_hook) timeout_put;
+
+   rcu_read_lock();
+   timeout_put = rcu_dereference(nf_ct_timeout_put_hook);
+
+   if (timeout_put) {
+   timeout_ext = nf_ct_timeout_find(ct);
+   if (timeout_ext) {
+   timeout_put(timeout_ext->timeout);
+   RCU_INIT_POINTER(timeout_ext->timeout, NULL);
+   }
+   }
+   rcu_read_unlock();
+#endif
+}
+EXPORT_SYMBOL_GPL(nf_ct_destroy_timeout);
+
 static const struct nf_ct_ext_type timeout_extend = {
.len= sizeof(struct nf_conn_timeout),
.align  = __alignof__(struct nf_conn_timeout),
diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c
index 0fa863f57575..14600ad731ce 100644
--- a/net/netfilter/xt_CT.c
+++ b/net/netfilter/xt_CT.c
@@ -328,26 +328,6 @@ static int xt_ct_tg_check_v2(const struct xt_tgchk_param 
*par)
return xt_ct_tg_check(par, par->targinfo);
 }
 
-static void xt_ct_destroy_timeout(struct nf_conn *ct)
-{
-#ifdef CONFIG_NF_CONNTRACK_TIMEOUT
-   struct nf_conn_timeout *timeout_ext;
-   typeof(nf_ct_timeout_put_hook) timeout_put;
-
-   rcu_read_lock();
-   timeout_put = rcu_dereference(nf_ct_timeout_put_hook);
-
-   if (timeout_put) {
-   timeout_ext = nf_ct_timeout_find(ct);
-   if (timeout_ext) {
-   timeout_put(timeout_ext->timeout);
-   RCU_INIT_POINTER(timeout_ext->timeout, NULL);
-   }
-   }
-   rcu_read_unlock();
-#endif
-}
-
 static void xt_ct_tg_destroy(const struct xt_tgdtor_param *par,
 struct xt_ct_target_info_v1 *info)
 {
@@ -361,7 +341,7 @@ static void xt_ct_tg_destroy(const struct xt_tgdtor_param 
*par,
 
nf_ct_netns_put(par->net, par->family);
 
-   xt_ct_destroy_timeout(ct);
+   nf_ct_destroy_timeout(ct);
nf_ct_put(info->ct);
}
 }
-- 
2.7.4

[PATCH net-next 2/2] openvswitch: Add timeout support to ct action

2019-03-20 Thread Yi-Hung Wei

Add support for fine-grain timeout support to conntrack action.
The new OVS_CT_ATTR_TIMEOUT attribute of the conntrack action
specifies a timeout to be associated with this connection.
If no timeout is specified, it acts as is, that is the default
timeout for the connection will be automatically applied.

Example usage:
$ nfct timeout add timeout_1 inet tcp syn_sent 100 established 200
$ ovs-ofctl add-flow br0 in_port=1,ip,tcp,action=ct(commit,timeout=timeout_1)

CC: Pravin Shelar 
Signed-off-by: Yi-Hung Wei 
---
 include/uapi/linux/openvswitch.h |  3 ++
 net/openvswitch/conntrack.c  | 81 +++-
 2 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index dbe0cbe4f1b7..9bccc6b9ed3d 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -734,6 +734,7 @@ struct ovs_action_hash {
  * be received on NFNLGRP_CONNTRACK_NEW and NFNLGRP_CONNTRACK_DESTROY groups,
  * respectively.  Remaining bits control the changes for which an event is
  * delivered on the NFNLGRP_CONNTRACK_UPDATE group.
+ * @OVS_CT_ATTR_TIMEOUT: Variable length string defining conntrack timeout.
  */
 enum ovs_ct_attr {
OVS_CT_ATTR_UNSPEC,
@@ -746,6 +747,8 @@ enum ovs_ct_attr {
OVS_CT_ATTR_NAT,/* Nested OVS_NAT_ATTR_* */
OVS_CT_ATTR_FORCE_COMMIT,  /* No argument */
OVS_CT_ATTR_EVENTMASK,  /* u32 mask of IPCT_* events. */
+   OVS_CT_ATTR_TIMEOUT,/* Associate timeout with this connection for
+  fine-grain timeout tuning. */
__OVS_CT_ATTR_MAX
 };
 
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 1b6896896fff..10a2c73f22f2 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -73,6 +74,7 @@ struct ovs_conntrack_info {
u32 eventmask;  /* Mask of 1 << IPCT_*. */
struct md_mark mark;
struct md_labels labels;
+   char timeout[CTNL_TIMEOUT_NAME_MAX];
 #ifdef CONFIG_NF_NAT_NEEDED
struct nf_nat_range2 range;  /* Only present for SRC NAT and DST NAT. */
 #endif
@@ -1139,6 +1141,59 @@ static int ovs_ct_check_limit(struct net *net,
 }
 #endif
 
+static void ovs_ct_add_timeout(struct net *net, struct nf_conn *ct,
+  const char *timeout_name, u16 l3num, u8 l4num)
+{
+#ifdef CONFIG_NF_CONNTRACK_TIMEOUT
+   typeof(nf_ct_timeout_find_get_hook) timeout_find_get;
+   typeof(nf_ct_timeout_put_hook) timeout_put;
+   struct nf_ct_timeout *timeout;
+   struct nf_conn_timeout *timeout_ext;
+
+   rcu_read_lock();
+   timeout_find_get = rcu_dereference(nf_ct_timeout_find_get_hook);
+   if (!timeout_find_get) {
+   net_info_ratelimited("Timeout policy base is empty");
+   goto out;
+   }
+
+   timeout = timeout_find_get(net, timeout_name);
+   if (!timeout) {
+   net_info_ratelimited("No such timeout policy \"%s\"\n",
+timeout_name);
+   goto out;
+   }
+
+   if (timeout->l3num != l3num) {
+   net_info_ratelimited("Timeout policy `%s' can only be used by "
+"L3 protocol number %d\n", timeout_name,
+timeout->l3num);
+   goto err_put_timeout;
+   }
+
+   if (timeout->l4proto->l4proto != l4num) {
+   net_info_ratelimited("Timeout policy `%s' can only be used by "
+"L4 protocol number %d\n", timeout_name,
+timeout->l4proto->l4proto);
+   goto err_put_timeout;
+   }
+
+   timeout_ext = nf_ct_timeout_ext_add(ct, timeout, GFP_ATOMIC);
+   if (!timeout_ext)
+   goto err_put_timeout;
+
+   goto out;
+
+err_put_timeout:
+   timeout_put = rcu_dereference(nf_ct_timeout_put_hook);
+   if (timeout_put)
+   timeout_put(timeout);
+out:
+   rcu_read_unlock();
+   return;
+#endif
+}
+
 /* Lookup connection and confirm if unconfirmed. */
 static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
 const struct ovs_conntrack_info *info,
@@ -1465,6 +1520,8 @@ static const struct ovs_ct_len_tbl 
ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = {
 #endif
[OVS_CT_ATTR_EVENTMASK] = { .minlen = sizeof(u32),
.maxlen = sizeof(u32) },
+   [OVS_CT_ATTR_TIMEOUT] = { .minlen = 1,
+ .maxlen = CTNL_TIMEOUT_NAME_MAX },
 };
 
 static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info,
@@ -1550,6 +1607,15 @@ static int parse_ct(const st

Re: [ovs-dev] [PATCH net-next V2 1/1] openvswitch: Declare ovs key structures using macros

2019-02-04 Thread Yi-Hung Wei

On Sun, Feb 3, 2019 at 1:13 AM Eli Britstein  wrote:
>
> Declare ovs key structures using macros as a pre-step towards to
> enable retrieving fields information, as a work done in proposed
> commit in the OVS tree https://patchwork.ozlabs.org/patch/1023406/
> ("odp-util: Do not rewrite fields with the same values as matched"),
> with no functional change.
>
> Signed-off-by: Eli Britstein 
> Reviewed-by: Roi Dayan 
> ---
>  include/uapi/linux/openvswitch.h | 102 
> ++-
>  1 file changed, 69 insertions(+), 33 deletions(-)
>
> diff --git a/include/uapi/linux/openvswitch.h 
> b/include/uapi/linux/openvswitch.h
> index dbe0cbe4f1b7..dc8246f871fd 100644
> --- a/include/uapi/linux/openvswitch.h
> +++ b/include/uapi/linux/openvswitch.h
> @@ -387,73 +387,109 @@ enum ovs_frag_type {
>
>  #define OVS_FRAG_TYPE_MAX (__OVS_FRAG_TYPE_MAX - 1)
>
> +#define OVS_KEY_FIELD_ARR(type, name, elements) type name[elements];
> +#define OVS_KEY_FIELD(type, name) type name;
..
> +#define OVS_KEY_IPV6_FIELDS \
> +OVS_KEY_FIELD_ARR(__be32, ipv6_src, 4) \
> +OVS_KEY_FIELD_ARR(__be32, ipv6_dst, 4) \
> +OVS_KEY_FIELD(__be32, ipv6_label /* 20-bits in least-significant bits. 
> */) \
> +OVS_KEY_FIELD(__u8, ipv6_proto) \
> +OVS_KEY_FIELD(__u8, ipv6_tclass) \
> +OVS_KEY_FIELD(__u8, ipv6_hlimit) \
> +OVS_KEY_FIELD(__u8, ipv6_frag /* One of OVS_FRAG_TYPE_*. */)
> +
>  struct ovs_key_ipv6 {
> -   __be32 ipv6_src[4];
> -   __be32 ipv6_dst[4];
> -   __be32 ipv6_label;  /* 20-bits in least-significant bits. */
> -   __u8   ipv6_proto;
> -   __u8   ipv6_tclass;
> -   __u8   ipv6_hlimit;
> -   __u8   ipv6_frag;   /* One of OVS_FRAG_TYPE_*. */
> +OVS_KEY_IPV6_FIELDS
>  };

Hi Eli,

Thanks for the patch.  In my personal opinion, I feel this patch makes
the header file harder to read.

For example, to see how 'struct ovs_key_ipv6' is defined, now we need
to trace how OVS_KEY_IPV6_FIELDS is defined, and how OVS_KEY_FIELD_ARR
and OVS_KEY_FIELD defined.  I think it makes the header file to be
more complicated.

There are also some discussion on ovs-dev mailing list about this
patch: https://patchwork.ozlabs.org/cover/1023404/

Thanks,

-Yi-Hung

[PATCH net] openvswitch: Fix IPv6 later frags parsing

2019-01-03 Thread Yi-Hung Wei

The previous commit fa642f08839b
("openvswitch: Derive IP protocol number for IPv6 later frags")
introduces IP protocol number parsing for IPv6 later frags that can mess
up the network header length calculation logic, i.e. nh_len < 0.
However, the network header length calculation is mainly for deriving
the transport layer header in the key extraction process which the later
fragment does not apply.

Therefore, this commit skips the network header length calculation to
fix the issue.

Reported-by: Chris Mi 
Reported-by: Greg Rose 
Fixes: fa642f08839b ("openvswitch: Derive IP protocol number for IPv6 later 
frags")
Signed-off-by: Yi-Hung Wei 
---
 net/openvswitch/flow.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 57e07768c9d1..f54cf17ef7a8 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -276,10 +276,12 @@ static int parse_ipv6hdr(struct sk_buff *skb, struct 
sw_flow_key *key)
 
nexthdr = ipv6_find_hdr(skb, &payload_ofs, -1, &frag_off, &flags);
if (flags & IP6_FH_F_FRAG) {
-   if (frag_off)
+   if (frag_off) {
key->ip.frag = OVS_FRAG_TYPE_LATER;
-   else
-   key->ip.frag = OVS_FRAG_TYPE_FIRST;
+   key->ip.proto = nexthdr;
+   return 0;
+   }
+   key->ip.frag = OVS_FRAG_TYPE_FIRST;
} else {
key->ip.frag = OVS_FRAG_TYPE_NONE;
}
-- 
2.7.4

[PATCH net-next v2] openvswitch: Derive IP protocol number for IPv6 later frags

2018-09-04 Thread Yi-Hung Wei

Currently, OVS only parses the IP protocol number for the first
IPv6 fragment, but sets the IP protocol number for the later fragments
to be NEXTHDF_FRAGMENT.  This patch tries to derive the IP protocol
number for the IPV6 later frags so that we can match that.

Signed-off-by: Yi-Hung Wei 
---
 net/openvswitch/flow.c | 22 +-
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 56b8e7167790..35966da84769 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -254,21 +254,18 @@ static bool icmphdr_ok(struct sk_buff *skb)
 
 static int parse_ipv6hdr(struct sk_buff *skb, struct sw_flow_key *key)
 {
+   unsigned short frag_off;
+   unsigned int payload_ofs = 0;
unsigned int nh_ofs = skb_network_offset(skb);
unsigned int nh_len;
-   int payload_ofs;
struct ipv6hdr *nh;
-   uint8_t nexthdr;
-   __be16 frag_off;
-   int err;
+   int err, nexthdr, flags = 0;
 
err = check_header(skb, nh_ofs + sizeof(*nh));
if (unlikely(err))
return err;
 
nh = ipv6_hdr(skb);
-   nexthdr = nh->nexthdr;
-   payload_ofs = (u8 *)(nh + 1) - skb->data;
 
key->ip.proto = NEXTHDR_NONE;
key->ip.tos = ipv6_get_dsfield(nh);
@@ -277,10 +274,9 @@ static int parse_ipv6hdr(struct sk_buff *skb, struct 
sw_flow_key *key)
key->ipv6.addr.src = nh->saddr;
key->ipv6.addr.dst = nh->daddr;
 
-   payload_ofs = ipv6_skip_exthdr(skb, payload_ofs, &nexthdr, &frag_off);
-
-   if (frag_off) {
-   if (frag_off & htons(~0x7))
+   nexthdr = ipv6_find_hdr(skb, &payload_ofs, -1, &frag_off, &flags);
+   if (flags & IP6_FH_F_FRAG) {
+   if (frag_off)
key->ip.frag = OVS_FRAG_TYPE_LATER;
else
key->ip.frag = OVS_FRAG_TYPE_FIRST;
@@ -288,11 +284,11 @@ static int parse_ipv6hdr(struct sk_buff *skb, struct 
sw_flow_key *key)
key->ip.frag = OVS_FRAG_TYPE_NONE;
}
 
-   /* Delayed handling of error in ipv6_skip_exthdr() as it
-* always sets frag_off to a valid value which may be
+   /* Delayed handling of error in ipv6_find_hdr() as it
+* always sets flags and frag_off to a valid value which may be
 * used to set key->ip.frag above.
 */
-   if (unlikely(payload_ofs < 0))
+   if (unlikely(nexthdr < 0))
return -EPROTO;
 
nh_len = payload_ofs - nh_ofs;
-- 
2.7.4

Re: [PATCH net-next] openvswitch: Derive IP protocol number for IPv6 later frags

2018-08-13 Thread Yi-Hung Wei

On Mon, Aug 13, 2018 at 10:48 AM William Tu  wrote:
> > > --- a/net/openvswitch/flow.c
> > > +++ b/net/openvswitch/flow.c
> > > @@ -297,7 +297,13 @@ static int parse_ipv6hdr(struct sk_buff *skb, struct 
> > > sw_flow_key *key)
> > >
> > > nh_len = payload_ofs - nh_ofs;
> > > skb_set_transport_header(skb, nh_ofs + nh_len);
> > > -   key->ip.proto = nexthdr;
> > > +   if (key->ip.frag == OVS_FRAG_TYPE_LATER) {
> > > +   unsigned int offset = 0;
>
> How about we start the 2nd time parsing from
> unsigned int offset = payload_ofs;
>
> > > +
> > > +   key->ip.proto = ipv6_find_hdr(skb, &offset, -1, NULL, 
> > > NULL);
>
> Then we only find the last header from previous parsed offset.
>
> William
>
> > > +   } else {
> > > +   key->ip.proto = nexthdr;
> > > +   }
> > parsing ipv6 ipv6_skip_exthdr() is called to find fragment hdr and
> > then this patch calls ipv6_find_hdr() to find next protocol. I think
> > we could call ipv6_find_hdr() to get fragment type and next hdr, that
> > would save parsing same packet twice in some cases.
> >
> > Other option would be calling ipv6_find_hdr() after setting 
> > OVS_FRAG_TYPE_LATER.

Thanks Pravin and William's feedback.

After looking into ipv6_find_hdr() more closely, I think we can just
call ipv6_find_hdr() once and derive everything we need.

I will submit the new patch once net-next is open.

Thanks,

-Yi-Hung

[PATCH net-next] openvswitch: Derive IP protocol number for IPv6 later frags

2018-08-10 Thread Yi-Hung Wei

Currently, OVS only parses the IP protocol number for the first
IPv6 fragment, but sets the IP protocol number for the later fragments
to be NEXTHDF_FRAGMENT.  This patch tries to derive the IP protocol
number for the IPV6 later frags so that we can match that.

Signed-off-by: Yi-Hung Wei 
---
 net/openvswitch/flow.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 56b8e7167790..3d654c4f71be 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -297,7 +297,13 @@ static int parse_ipv6hdr(struct sk_buff *skb, struct 
sw_flow_key *key)
 
nh_len = payload_ofs - nh_ofs;
skb_set_transport_header(skb, nh_ofs + nh_len);
-   key->ip.proto = nexthdr;
+   if (key->ip.frag == OVS_FRAG_TYPE_LATER) {
+   unsigned int offset = 0;
+
+   key->ip.proto = ipv6_find_hdr(skb, &offset, -1, NULL, NULL);
+   } else {
+   key->ip.proto = nexthdr;
+   }
return nh_len;
 }
 
-- 
2.7.4

[PATCH net-next v5 0/2] openvswitch: Support conntrack zone limit

2018-05-24 Thread Yi-Hung Wei

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, OVS will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The first patch defines the conntrack limit netlink definition, and the
second patch provides the implementation.

v4->v5:
  - Addresses comments from Parvin that include log error msg in
ovs_ct_limit_init(), handle deletion for default limit, and
add a common helper for get zone limit.
  - Rebases to master.

v3->v4:
  - Addresses comments from Parvin that include simplify netlink API,
and remove unncessary RCU lockings.
  - Rebases to master.

v2->v3:
  - Addresses comments from Parvin that include using static keys to check
if ovs_ct_limit features is used, only check ct_limit when a ct entry
is unconfirmed, and reports rate limited warning messages when the ct
limit is reached.
  - Rebases to master.

v1->v2:
  - Fixes commit log typos suggested by Greg.
  - Fixes memory free issue that Julia found.

Yi-Hung Wei (2):
  openvswitch: Add conntrack limit netlink definition
  openvswitch: Support conntrack zone limit

 include/uapi/linux/openvswitch.h |  28 ++
 net/openvswitch/Kconfig  |   3 +-
 net/openvswitch/conntrack.c  | 551 ++-
 net/openvswitch/conntrack.h  |   9 +-
 net/openvswitch/datapath.c   |   7 +-
 net/openvswitch/datapath.h   |   3 +
 6 files changed, 595 insertions(+), 6 deletions(-)

-- 
2.7.4

[PATCH net-next v5 1/2] openvswitch: Add conntrack limit netlink definition

2018-05-24 Thread Yi-Hung Wei

Define netlink messages and attributes to support user kernel
communication that uses the conntrack limit feature.

Signed-off-by: Yi-Hung Wei 
---
 include/uapi/linux/openvswitch.h | 28 
 1 file changed, 28 insertions(+)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 713e56ce681f..863aabaa5cc9 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -937,4 +937,32 @@ enum ovs_meter_band_type {
 
 #define OVS_METER_BAND_TYPE_MAX (__OVS_METER_BAND_TYPE_MAX - 1)
 
+/* Conntrack limit */
+#define OVS_CT_LIMIT_FAMILY  "ovs_ct_limit"
+#define OVS_CT_LIMIT_MCGROUP "ovs_ct_limit"
+#define OVS_CT_LIMIT_VERSION 0x1
+
+enum ovs_ct_limit_cmd {
+   OVS_CT_LIMIT_CMD_UNSPEC,
+   OVS_CT_LIMIT_CMD_SET,   /* Add or modify ct limit. */
+   OVS_CT_LIMIT_CMD_DEL,   /* Delete ct limit. */
+   OVS_CT_LIMIT_CMD_GET/* Get ct limit. */
+};
+
+enum ovs_ct_limit_attr {
+   OVS_CT_LIMIT_ATTR_UNSPEC,
+   OVS_CT_LIMIT_ATTR_ZONE_LIMIT,   /* Nested struct ovs_zone_limit. */
+   __OVS_CT_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_ATTR_MAX (__OVS_CT_LIMIT_ATTR_MAX - 1)
+
+#define OVS_ZONE_LIMIT_DEFAULT_ZONE -1
+
+struct ovs_zone_limit {
+   int zone_id;
+   __u32 limit;
+   __u32 count;
+};
+
 #endif /* _LINUX_OPENVSWITCH_H */
-- 
2.7.4

[PATCH net-next v5 2/2] openvswitch: Support conntrack zone limit

2018-05-24 Thread Yi-Hung Wei

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, ovs will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The following high leve APIs are provided to the userspace:
  - OVS_CT_LIMIT_CMD_SET:
* set default connection limit for all zones
* set the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_DEL:
* remove the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_GET:
* get the default connection limit for all zones
* get the connection limit for a particular zone

Signed-off-by: Yi-Hung Wei 
---
 net/openvswitch/Kconfig |   3 +-
 net/openvswitch/conntrack.c | 551 +++-
 net/openvswitch/conntrack.h |   9 +-
 net/openvswitch/datapath.c  |   7 +-
 net/openvswitch/datapath.h  |   3 +
 5 files changed, 567 insertions(+), 6 deletions(-)

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 2650205cdaf9..89da9512ec1e 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -9,7 +9,8 @@ config OPENVSWITCH
   (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
 (!NF_NAT || NF_NAT) && \
 (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
-(!NF_NAT_IPV6 || NF_NAT_IPV6)))
+(!NF_NAT_IPV6 || NF_NAT_IPV6) && \
+(!NETFILTER_CONNCOUNT || 
NETFILTER_CONNCOUNT)))
select LIBCRC32C
select MPLS
select NET_MPLS_GSO
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 02fc343feb66..284aca2a252d 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -16,8 +16,11 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -76,6 +79,31 @@ struct ovs_conntrack_info {
 #endif
 };
 
+#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+#define OVS_CT_LIMIT_UNLIMITED 0
+#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
+#define CT_LIMIT_HASH_BUCKETS 512
+static DEFINE_STATIC_KEY_FALSE(ovs_ct_limit_enabled);
+
+struct ovs_ct_limit {
+   /* Elements in ovs_ct_limit_info->limits hash table */
+   struct hlist_node hlist_node;
+   struct rcu_head rcu;
+   u16 zone;
+   u32 limit;
+};
+
+struct ovs_ct_limit_info {
+   u32 default_limit;
+   struct hlist_head *limits;
+   struct nf_conncount_data *data;
+};
+
+static const struct nla_policy ct_limit_policy[OVS_CT_LIMIT_ATTR_MAX + 1] = {
+   [OVS_CT_LIMIT_ATTR_ZONE_LIMIT] = { .type = NLA_NESTED, },
+};
+#endif
+
 static bool labels_nonzero(const struct ovs_key_ct_labels *labels);
 
 static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info);
@@ -1036,6 +1064,89 @@ static bool labels_nonzero(const struct 
ovs_key_ct_labels *labels)
return false;
 }
 
+#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static struct hlist_head *ct_limit_hash_bucket(
+   const struct ovs_ct_limit_info *info, u16 zone)
+{
+   return &info->limits[zone & (CT_LIMIT_HASH_BUCKETS - 1)];
+}
+
+/* Call with ovs_mutex */
+static void ct_limit_set(const struct ovs_ct_limit_info *info,
+struct ovs_ct_limit *new_ct_limit)
+{
+   struct ovs_ct_limit *ct_limit;
+   struct hlist_head *head;
+
+   head = ct_limit_hash_bucket(info, new_ct_limit->zone

[PATCH net-next v4 0/2] openvswitch: Support conntrack zone limit

2018-05-21 Thread Yi-Hung Wei

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, OVS will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The first patch defines the conntrack limit netlink definition, and the
second patch provides the implementation.

v3->v4:
  - Addresses comments from Parvin that include simplify netlink API,
and remove unncessary RCU lockings.
  - Rebases to master.

v2->v3:
  - Addresses comments from Parvin that include using static keys to check
if ovs_ct_limit features is used, only check ct_limit when a ct entry
is unconfirmed, and reports rate limited warning messages when the ct
limit is reached.
  - Rebases to master.

v1->v2:
  - Fixes commit log typos suggested by Greg.
  - Fixes memory free issue that Julia found.


Yi-Hung Wei (2):
  openvswitch: Add conntrack limit netlink definition
  openvswitch: Support conntrack zone limit

 include/uapi/linux/openvswitch.h |  26 ++
 net/openvswitch/Kconfig  |   3 +-
 net/openvswitch/conntrack.c  | 541 ++-
 net/openvswitch/conntrack.h  |   9 +-
 net/openvswitch/datapath.c   |   7 +-
 net/openvswitch/datapath.h   |   3 +
 6 files changed, 583 insertions(+), 6 deletions(-)

-- 
2.7.4

[PATCH net-next v4 1/2] openvswitch: Add conntrack limit netlink definition

2018-05-21 Thread Yi-Hung Wei

Define netlink messages and attributes to support user kernel
communication that uses the conntrack limit feature.

Signed-off-by: Yi-Hung Wei 
---
 include/uapi/linux/openvswitch.h | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 713e56ce681f..d8da2b7591f5 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -937,4 +937,30 @@ enum ovs_meter_band_type {
 
 #define OVS_METER_BAND_TYPE_MAX (__OVS_METER_BAND_TYPE_MAX - 1)
 
+/* Conntrack limit */
+#define OVS_CT_LIMIT_FAMILY  "ovs_ct_limit"
+#define OVS_CT_LIMIT_MCGROUP "ovs_ct_limit"
+#define OVS_CT_LIMIT_VERSION 0x1
+
+enum ovs_ct_limit_cmd {
+   OVS_CT_LIMIT_CMD_UNSPEC,
+   OVS_CT_LIMIT_CMD_SET,   /* Add or modify ct limit. */
+   OVS_CT_LIMIT_CMD_DEL,   /* Delete ct limit. */
+   OVS_CT_LIMIT_CMD_GET/* Get ct limit. */
+};
+
+enum ovs_ct_limit_attr {
+   OVS_CT_LIMIT_ATTR_UNSPEC,
+   OVS_CT_LIMIT_ATTR_ZONE_LIMIT,   /* Nested struct ovs_zone_limit. */
+   __OVS_CT_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_ATTR_MAX (__OVS_CT_LIMIT_ATTR_MAX - 1)
+
+struct ovs_zone_limit {
+   int zone_id;
+   __u32 limit;
+   __u32 count;
+};
+
 #endif /* _LINUX_OPENVSWITCH_H */
-- 
2.7.4

[PATCH net-next v4 2/2] openvswitch: Support conntrack zone limit

2018-05-21 Thread Yi-Hung Wei

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, ovs will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The following high leve APIs are provided to the userspace:
  - OVS_CT_LIMIT_CMD_SET:
* set default connection limit for all zones
* set the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_DEL:
* remove the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_GET:
* get the default connection limit for all zones
* get the connection limit for a particular zone

Signed-off-by: Yi-Hung Wei 
---
 net/openvswitch/Kconfig |   3 +-
 net/openvswitch/conntrack.c | 541 +++-
 net/openvswitch/conntrack.h |   9 +-
 net/openvswitch/datapath.c  |   7 +-
 net/openvswitch/datapath.h  |   3 +
 5 files changed, 557 insertions(+), 6 deletions(-)

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 2650205cdaf9..89da9512ec1e 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -9,7 +9,8 @@ config OPENVSWITCH
   (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
 (!NF_NAT || NF_NAT) && \
 (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
-(!NF_NAT_IPV6 || NF_NAT_IPV6)))
+(!NF_NAT_IPV6 || NF_NAT_IPV6) && \
+(!NETFILTER_CONNCOUNT || 
NETFILTER_CONNCOUNT)))
select LIBCRC32C
select MPLS
select NET_MPLS_GSO
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 02fc343feb66..e8bb91420ca9 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -16,8 +16,11 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -76,6 +79,31 @@ struct ovs_conntrack_info {
 #endif
 };
 
+#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+#define OVS_CT_LIMIT_UNLIMITED 0
+#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
+#define CT_LIMIT_HASH_BUCKETS 512
+static DEFINE_STATIC_KEY_FALSE(ovs_ct_limit_enabled);
+
+struct ovs_ct_limit {
+   /* Elements in ovs_ct_limit_info->limits hash table */
+   struct hlist_node hlist_node;
+   struct rcu_head rcu;
+   u16 zone;
+   u32 limit;
+};
+
+struct ovs_ct_limit_info {
+   u32 default_limit;
+   struct hlist_head *limits;
+   struct nf_conncount_data *data __aligned(8);
+};
+
+static const struct nla_policy ct_limit_policy[OVS_CT_LIMIT_ATTR_MAX + 1] = {
+   [OVS_CT_LIMIT_ATTR_ZONE_LIMIT] = { .type = NLA_NESTED, },
+};
+#endif
+
 static bool labels_nonzero(const struct ovs_key_ct_labels *labels);
 
 static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info);
@@ -1036,6 +1064,89 @@ static bool labels_nonzero(const struct 
ovs_key_ct_labels *labels)
return false;
 }
 
+#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static struct hlist_head *ct_limit_hash_bucket(
+   const struct ovs_ct_limit_info *info, u16 zone)
+{
+   return &info->limits[zone & (CT_LIMIT_HASH_BUCKETS - 1)];
+}
+
+/* Call with ovs_mutex */
+static void ct_limit_set(const struct ovs_ct_limit_info *info,
+struct ovs_ct_limit *new_ct_limit)
+{
+   struct ovs_ct_limit *ct_limit;
+   struct hlist_head *head;
+
+   head = ct_limit_hash_bucket(info, new_ct_limit->zone

[PATCH net-next v3 2/2] openvswitch: Support conntrack zone limit

2018-04-30 Thread Yi-Hung Wei

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, ovs will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The following high leve APIs are provided to the userspace:
  - OVS_CT_LIMIT_CMD_SET:
* set default connection limit for all zones
* set the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_DEL:
* remove the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_GET:
* get the default connection limit for all zones
* get the connection limit for a particular zone

Signed-off-by: Yi-Hung Wei 
---
 net/openvswitch/Kconfig |   3 +-
 net/openvswitch/conntrack.c | 508 +++-
 net/openvswitch/conntrack.h |   9 +-
 net/openvswitch/datapath.c  |   7 +-
 net/openvswitch/datapath.h  |   1 +
 5 files changed, 522 insertions(+), 6 deletions(-)

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 2650205cdaf9..89da9512ec1e 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -9,7 +9,8 @@ config OPENVSWITCH
   (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
 (!NF_NAT || NF_NAT) && \
 (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
-(!NF_NAT_IPV6 || NF_NAT_IPV6)))
+(!NF_NAT_IPV6 || NF_NAT_IPV6) && \
+(!NETFILTER_CONNCOUNT || 
NETFILTER_CONNCOUNT)))
select LIBCRC32C
select MPLS
select NET_MPLS_GSO
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index c5904f629091..8234964889d9 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -16,8 +16,11 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -76,6 +79,39 @@ struct ovs_conntrack_info {
 #endif
 };
 
+#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+#define OVS_CT_LIMIT_UNLIMITED 0
+#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
+#define CT_LIMIT_HASH_BUCKETS 512
+DEFINE_STATIC_KEY_FALSE(ovs_ct_limit_enabled);
+
+struct ovs_ct_limit {
+   /* Elements in ovs_ct_limit_info->limits hash table */
+   struct hlist_node hlist_node;
+   struct rcu_head rcu;
+   u16 zone;
+   u32 limit;
+};
+
+struct ovs_ct_limit_info {
+   u32 default_limit;
+   struct hlist_head *limits;
+   struct nf_conncount_data *data __aligned(8);
+};
+
+static const struct nla_policy ct_limit_policy[OVS_CT_LIMIT_ATTR_MAX + 1] = {
+   [OVS_CT_LIMIT_ATTR_OPTION] = { .type = NLA_NESTED, },
+};
+
+static const struct nla_policy
+   ct_zone_limit_policy[OVS_CT_ZONE_LIMIT_ATTR_MAX + 1] = {
+   [OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT] = { .type = NLA_U32, },
+   [OVS_CT_ZONE_LIMIT_ATTR_ZONE] = { .type = NLA_U16, },
+   [OVS_CT_ZONE_LIMIT_ATTR_LIMIT] = { .type = NLA_U32, },
+   [OVS_CT_ZONE_LIMIT_ATTR_COUNT] = { .type = NLA_U32, },
+};
+#endif
+
 static bool labels_nonzero(const struct ovs_key_ct_labels *labels);
 
 static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info);
@@ -1036,6 +1072,94 @@ static bool labels_nonzero(const struct 
ovs_key_ct_labels *labels)
return false;
 }
 
+#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static struct hlist_head *ct_limit_hash_bucket(
+   const struct ovs_ct_limit_info *info

[PATCH net-next v3 0/2] openvswitch: Support conntrack zone limit

2018-04-30 Thread Yi-Hung Wei

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, OVS will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The first patch defines the conntrack limit netlink definition, and the
second patch provides the implementation.

v2->v3:
  - Addresses comments from Parvin that include using static keys to check
if ovs_ct_limit features is used, only check ct_limit when a ct entry
is unconfirmed, and reports rate limited warning messages when the ct
limit is reached.
  - Rebases to master.

v1->v2:
  - Fixes commit log typos suggested by Greg.
  - Fixes memory free issue that Julia found.


Yi-Hung Wei (2):
  openvswitch: Add conntrack limit netlink definition
  openvswitch: Support conntrack zone limit

 include/uapi/linux/openvswitch.h |  62 +
 net/openvswitch/Kconfig  |   3 +-
 net/openvswitch/conntrack.c  | 508 ++-
 net/openvswitch/conntrack.h  |   9 +-
 net/openvswitch/datapath.c   |   7 +-
 net/openvswitch/datapath.h   |   1 +
 6 files changed, 584 insertions(+), 6 deletions(-)

-- 
2.7.4

[PATCH net-next v3 1/2] openvswitch: Add conntrack limit netlink definition

2018-04-30 Thread Yi-Hung Wei

Define netlink messages and attributes to support user kernel
communication that uses the conntrack limit feature.

Signed-off-by: Yi-Hung Wei 
---
 include/uapi/linux/openvswitch.h | 62 
 1 file changed, 62 insertions(+)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 713e56ce681f..ca63c16375ce 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -937,4 +937,66 @@ enum ovs_meter_band_type {
 
 #define OVS_METER_BAND_TYPE_MAX (__OVS_METER_BAND_TYPE_MAX - 1)
 
+/* Conntrack limit */
+#define OVS_CT_LIMIT_FAMILY  "ovs_ct_limit"
+#define OVS_CT_LIMIT_MCGROUP "ovs_ct_limit"
+#define OVS_CT_LIMIT_VERSION 0x1
+
+enum ovs_ct_limit_cmd {
+   OVS_CT_LIMIT_CMD_UNSPEC,
+   OVS_CT_LIMIT_CMD_SET,   /* Add or modify ct limit. */
+   OVS_CT_LIMIT_CMD_DEL,   /* Delete ct limit. */
+   OVS_CT_LIMIT_CMD_GET/* Get ct limit. */
+};
+
+enum ovs_ct_limit_attr {
+   OVS_CT_LIMIT_ATTR_UNSPEC,
+   OVS_CT_LIMIT_ATTR_OPTION,   /* Nested OVS_CT_LIMIT_ATTR_* */
+   __OVS_CT_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_ATTR_MAX (__OVS_CT_LIMIT_ATTR_MAX - 1)
+
+/**
+ * @OVS_CT_ZONE_LIMIT_ATTR_SET_REQ: Contains either
+ * OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT or a pair of
+ * OVS_CT_ZONE_LIMIT_ATTR_ZONE and OVS_CT_ZONE_LIMIT_ATTR_LIMIT.
+ * @OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ: Contains OVS_CT_ZONE_LIMIT_ATTR_ZONE.
+ * @OVS_CT_ZONE_LIMIT_ATTR_GET_REQ: Contains OVS_CT_ZONE_LIMIT_ATTR_ZONE.
+ * @OVS_CT_ZONE_LIMIT_ATTR_GET_RLY: Contains either
+ * OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT or a triple of
+ * OVS_CT_ZONE_LIMIT_ATTR_ZONE, OVS_CT_ZONE_LIMIT_ATTR_LIMIT and
+ * OVS_CT_ZONE_LIMIT_ATTR_COUNT.
+ */
+enum ovs_ct_limit_option_attr {
+   OVS_CT_LIMIT_OPTION_ATTR_UNSPEC,
+   OVS_CT_ZONE_LIMIT_ATTR_SET_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   OVS_CT_ZONE_LIMIT_ATTR_GET_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   OVS_CT_ZONE_LIMIT_ATTR_GET_RLY, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   __OVS_CT_LIMIT_OPTION_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_OPTION_ATTR_MAX (__OVS_CT_LIMIT_OPTION_ATTR_MAX - 1)
+
+enum ovs_ct_zone_limit_attr {
+   OVS_CT_ZONE_LIMIT_ATTR_UNSPEC,
+   OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT,   /* u32 default conntrack limit
+* for all zones. */
+   OVS_CT_ZONE_LIMIT_ATTR_ZONE,/* u16 conntrack zone id. */
+   OVS_CT_ZONE_LIMIT_ATTR_LIMIT,   /* u32 max number of conntrack
+* entries allowed in the
+* corresponding zone. */
+   OVS_CT_ZONE_LIMIT_ATTR_COUNT,   /* u32 number of conntrack
+* entries in the corresponding
+* zone. */
+   __OVS_CT_ZONE_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_ZONE_LIMIT_ATTR_MAX (__OVS_CT_ZONE_LIMIT_ATTR_MAX - 1)
+
 #endif /* _LINUX_OPENVSWITCH_H */
-- 
2.7.4

Re: [PATCH net-next v2 2/2] openvswitch: Support conntrack zone limit

2018-04-25 Thread Yi-Hung Wei

>> +#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
>> +#define OVS_CT_LIMIT_UNLIMITED 0
>> +#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
>> +#define CT_LIMIT_HASH_BUCKETS 512
>> +
> Can you use static key when the limit is not set.
> This would avoid overhead in datapath when these limits are not used.
>
Thanks Parvin for the review.  I'm not sure about the 'static key'
part, are you suggesting that say if we can have a flag to check if no
one has ever set the ct_limit?   If the ct_limit feature is not used
so far, then instead of look up the hash table for the zone limit, we
just return the default unlimited value.  So that we can avoid the
overhead of checking ct_limit.

>> +struct ovs_ct_limit {
>> +   /* Elements in ovs_ct_limit_info->limits hash table */
>> +   struct hlist_node hlist_node;
>> +   struct rcu_head rcu;
>> +   u16 zone;
>> +   u32 limit;
>> +};
>> +
> ...


>>  /* Lookup connection and confirm if unconfirmed. */
>>  static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
>>  const struct ovs_conntrack_info *info,
>> @@ -1054,6 +1176,13 @@ static int ovs_ct_commit(struct net *net, struct 
>> sw_flow_key *key,
>> if (!ct)
>> return 0;
>>
>> +#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
>> +   err = ovs_ct_check_limit(net, info,
>> +&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
>> +   if (err)
>> +   return err;
>> +#endif
>> +
>
> This could be checked during flow install time, so that only permitted
> flows would have 'ct commit' action, we can avoid per packet cost
> checking the limit.
It seems to me that it would be hard to check the # of connections of
a flow in the flow installation stage.  For example, if we have a
datapath flow that performs “ct commit” action on all UDP traffic from
in_port 1, then it could be various combinations of 5-tuple that form
various # of connections.  Therefore, it would be hard to do the
admission control there.


> returning error code form ovs_ct_commit() is lost in datapath and it
> would be hard to debug packet lost in case of the limit is reached. So
> another advantage of checking the limit in flow install be better
> traceability. datapath would return error to usespace and it can log
> the error code.
Thanks for pointing out the error code from ovs_ct_commit() will be
lost in the datapath.  In this case, shall we consider to report the
packet drop by some rate_limit logging such as net_warn_ratelimited()
or net_info_ratelimited()?

Thanks,

-Yi-Hung

Re: [PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit

2018-04-24 Thread Yi-Hung Wei

On Tue, Apr 24, 2018 at 10:42 AM, David Miller  wrote:
> From: Pravin Shelar 
> Date: Mon, 23 Apr 2018 23:34:48 -0700
>
>> OK. Thanks for the info.
>
> So, ACK, Reviewed-by, etc.? :-)
>

Parvin provides feedback in a previous email.  I will address them and
send out v3.

Thanks,

-Yi-Hung

Re: [PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit

2018-04-23 Thread Yi-Hung Wei

On Mon, Apr 23, 2018 at 1:10 PM, Pravin Shelar  wrote:
> On Mon, Apr 23, 2018 at 6:39 AM, David Miller  wrote:
>> From: Yi-Hung Wei 
>> Date: Tue, 17 Apr 2018 17:30:27 -0700
>>
>>> Currently, nf_conntrack_max is used to limit the maximum number of
>>> conntrack entries in the conntrack table for every network namespace.
>>> For the VMs and containers that reside in the same namespace,
>>> they share the same conntrack table, and the total # of conntrack entries
>>> for all the VMs and containers are limited by nf_conntrack_max.  In this
>>> case, if one of the VM/container abuses the usage the conntrack entries,
>>> it blocks the others from committing valid conntrack entries into the
>>> conntrack table.  Even if we can possibly put the VM in different network
>>> namespace, the current nf_conntrack_max configuration is kind of rigid
>>> that we cannot limit different VM/container to have different # conntrack
>>> entries.
>>>
>
> Hi
> This looks like general problem related to nf zone usage limit, Did
> you considered changing nf-conntrack to have a per zone limit, so that
> all users of nf-filter can use it. I prefer this to adding a wrapper
> in OVS nf-filter layer.
>
> Thanks,
> Pravin.
>

Hi Prvain,

Thanks for your comment.  Originally, I was thinking to add this
feature in nf_conntrack and had some discussion with Florian.  It
turns out that iptables and nft have their own way to keep track of
the connection limits, and it sounds reasonable to share the backend
that counts the number of connections, but each module can enforce the
connection limit in their own way.  Therefore, Florian helped to pull
out the common backend to nf_conncount in the following commit. The
nf_conncount then can be used by xtables, nft, and ovs.

commit 625c556118f3c2fd28bb8ef6da18c53bd4037be4
Author: Florian Westphal 
Date:   Sat Dec 9 21:01:08 2017 +0100

netfilter: connlimit: split xt_connlimit into front and backend

This allows to reuse xt_connlimit infrastructure from nf_tables.
The upcoming nf_tables frontend can just pass in an nftables register
as input key, this allows limiting by any nft-supported key, including
concatenations.  For xt_connlimit, pass in the zone and the ip/ipv6 addres.

Basically, to achieve conntrack zone limit in OVS.  We need the
following 3 parts.
1. Count the number of connections (this is provided by netfilter's
nf_conncount backend)
2. Keep track of the connection limits of zones, and check if it
exceeds the limit.
3. An API for userspace to set/delete/get the conntrack zone limit.

This patch series implements item 2 and 3, and it reuses the
nf_conncount from netfiler for the first part.

Thanks,

-Yi-Hung

Re: [PATCH net-next 2/2] openvswitch: Support conntrack zone limit

2018-04-17 Thread Yi-Hung Wei

> s/to commit/from committing/
> s/entry/entries/

Thanks, will fix that in both patches in v2.


> I think this is a great idea but I suggest porting to the iproute2 package
> so everyone can use it.  Then git rid of the OVS specific prefixes.
> Presuming of course that the conntrack connection
> limit backend works there as well I guess.  If it doesn't, then I'd suggest
> extending
> it.  This is a nice feature for all users in my opinion and then OVS
> can take advantage of it as well.

Thanks for the comment.  And yes, I think currently, iptables’s
connlimit extension does support limiting the # of connections.  Users
need to configure the zone properly, and the iptable’s connlimit
extension is using netfilter's nf_conncount backend already.

The main goal for this patch is to utilize netfilter backend
(nf_conncount) to count and limit the number of connections. OVS needs
the proposed OVS_CT_LIMIT netlink API and the corresponding booking
data structure because the current nf_conncount backend only counts
the # of connections, but it does not keep track of the connection
limit in nf_conncount.

Thanks,

-Yi-Hung

[PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit

2018-04-17 Thread Yi-Hung Wei

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, OVS will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The first patch defines the conntrack limit netlink definition, and the
second patch provides the implementation.

v1->v2:
  - Fixes commit log typos suggested by Greg.
  - Fixes memory free issue that Julia found.

Yi-Hung Wei (2):
  openvswitch: Add conntrack limit netlink definition
  openvswitch: Support conntrack zone limit

 include/uapi/linux/openvswitch.h |  62 +
 net/openvswitch/Kconfig  |   3 +-
 net/openvswitch/conntrack.c  | 498 ++-
 net/openvswitch/conntrack.h  |   9 +-
 net/openvswitch/datapath.c   |   7 +-
 net/openvswitch/datapath.h   |   1 +
 6 files changed, 574 insertions(+), 6 deletions(-)

-- 
2.7.4

[PATCH net-next v2 1/2] openvswitch: Add conntrack limit netlink definition

2018-04-17 Thread Yi-Hung Wei

Define netlink messages and attributes to support user kernel
communication that uses the conntrack limit feature.

Signed-off-by: Yi-Hung Wei 
---
 include/uapi/linux/openvswitch.h | 62 
 1 file changed, 62 insertions(+)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 713e56ce681f..ca63c16375ce 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -937,4 +937,66 @@ enum ovs_meter_band_type {
 
 #define OVS_METER_BAND_TYPE_MAX (__OVS_METER_BAND_TYPE_MAX - 1)
 
+/* Conntrack limit */
+#define OVS_CT_LIMIT_FAMILY  "ovs_ct_limit"
+#define OVS_CT_LIMIT_MCGROUP "ovs_ct_limit"
+#define OVS_CT_LIMIT_VERSION 0x1
+
+enum ovs_ct_limit_cmd {
+   OVS_CT_LIMIT_CMD_UNSPEC,
+   OVS_CT_LIMIT_CMD_SET,   /* Add or modify ct limit. */
+   OVS_CT_LIMIT_CMD_DEL,   /* Delete ct limit. */
+   OVS_CT_LIMIT_CMD_GET/* Get ct limit. */
+};
+
+enum ovs_ct_limit_attr {
+   OVS_CT_LIMIT_ATTR_UNSPEC,
+   OVS_CT_LIMIT_ATTR_OPTION,   /* Nested OVS_CT_LIMIT_ATTR_* */
+   __OVS_CT_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_ATTR_MAX (__OVS_CT_LIMIT_ATTR_MAX - 1)
+
+/**
+ * @OVS_CT_ZONE_LIMIT_ATTR_SET_REQ: Contains either
+ * OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT or a pair of
+ * OVS_CT_ZONE_LIMIT_ATTR_ZONE and OVS_CT_ZONE_LIMIT_ATTR_LIMIT.
+ * @OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ: Contains OVS_CT_ZONE_LIMIT_ATTR_ZONE.
+ * @OVS_CT_ZONE_LIMIT_ATTR_GET_REQ: Contains OVS_CT_ZONE_LIMIT_ATTR_ZONE.
+ * @OVS_CT_ZONE_LIMIT_ATTR_GET_RLY: Contains either
+ * OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT or a triple of
+ * OVS_CT_ZONE_LIMIT_ATTR_ZONE, OVS_CT_ZONE_LIMIT_ATTR_LIMIT and
+ * OVS_CT_ZONE_LIMIT_ATTR_COUNT.
+ */
+enum ovs_ct_limit_option_attr {
+   OVS_CT_LIMIT_OPTION_ATTR_UNSPEC,
+   OVS_CT_ZONE_LIMIT_ATTR_SET_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   OVS_CT_ZONE_LIMIT_ATTR_GET_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   OVS_CT_ZONE_LIMIT_ATTR_GET_RLY, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   __OVS_CT_LIMIT_OPTION_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_OPTION_ATTR_MAX (__OVS_CT_LIMIT_OPTION_ATTR_MAX - 1)
+
+enum ovs_ct_zone_limit_attr {
+   OVS_CT_ZONE_LIMIT_ATTR_UNSPEC,
+   OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT,   /* u32 default conntrack limit
+* for all zones. */
+   OVS_CT_ZONE_LIMIT_ATTR_ZONE,/* u16 conntrack zone id. */
+   OVS_CT_ZONE_LIMIT_ATTR_LIMIT,   /* u32 max number of conntrack
+* entries allowed in the
+* corresponding zone. */
+   OVS_CT_ZONE_LIMIT_ATTR_COUNT,   /* u32 number of conntrack
+* entries in the corresponding
+* zone. */
+   __OVS_CT_ZONE_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_ZONE_LIMIT_ATTR_MAX (__OVS_CT_ZONE_LIMIT_ATTR_MAX - 1)
+
 #endif /* _LINUX_OPENVSWITCH_H */
-- 
2.7.4

[PATCH net-next v2 2/2] openvswitch: Support conntrack zone limit

2018-04-17 Thread Yi-Hung Wei

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, ovs will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The following high leve APIs are provided to the userspace:
  - OVS_CT_LIMIT_CMD_SET:
* set default connection limit for all zones
* set the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_DEL:
* remove the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_GET:
* get the default connection limit for all zones
* get the connection limit for a particular zone

Signed-off-by: Yi-Hung Wei 
---
 net/openvswitch/Kconfig |   3 +-
 net/openvswitch/conntrack.c | 498 +++-
 net/openvswitch/conntrack.h |   9 +-
 net/openvswitch/datapath.c  |   7 +-
 net/openvswitch/datapath.h  |   1 +
 5 files changed, 512 insertions(+), 6 deletions(-)

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 2650205cdaf9..89da9512ec1e 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -9,7 +9,8 @@ config OPENVSWITCH
   (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
 (!NF_NAT || NF_NAT) && \
 (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
-(!NF_NAT_IPV6 || NF_NAT_IPV6)))
+(!NF_NAT_IPV6 || NF_NAT_IPV6) && \
+(!NETFILTER_CONNCOUNT || 
NETFILTER_CONNCOUNT)))
select LIBCRC32C
select MPLS
select NET_MPLS_GSO
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index c5904f629091..d09b572f72b4 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -17,7 +17,9 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -76,6 +78,38 @@ struct ovs_conntrack_info {
 #endif
 };
 
+#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+#define OVS_CT_LIMIT_UNLIMITED 0
+#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
+#define CT_LIMIT_HASH_BUCKETS 512
+
+struct ovs_ct_limit {
+   /* Elements in ovs_ct_limit_info->limits hash table */
+   struct hlist_node hlist_node;
+   struct rcu_head rcu;
+   u16 zone;
+   u32 limit;
+};
+
+struct ovs_ct_limit_info {
+   u32 default_limit;
+   struct hlist_head *limits;
+   struct nf_conncount_data *data __aligned(8);
+};
+
+static const struct nla_policy ct_limit_policy[OVS_CT_LIMIT_ATTR_MAX + 1] = {
+   [OVS_CT_LIMIT_ATTR_OPTION] = { .type = NLA_NESTED, },
+};
+
+static const struct nla_policy
+   ct_zone_limit_policy[OVS_CT_ZONE_LIMIT_ATTR_MAX + 1] = {
+   [OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT] = { .type = NLA_U32, },
+   [OVS_CT_ZONE_LIMIT_ATTR_ZONE] = { .type = NLA_U16, },
+   [OVS_CT_ZONE_LIMIT_ATTR_LIMIT] = { .type = NLA_U32, },
+   [OVS_CT_ZONE_LIMIT_ATTR_COUNT] = { .type = NLA_U32, },
+};
+#endif
+
 static bool labels_nonzero(const struct ovs_key_ct_labels *labels);
 
 static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info);
@@ -1036,6 +1070,94 @@ static bool labels_nonzero(const struct 
ovs_key_ct_labels *labels)
return false;
 }
 
+#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static struct hlist_head *ct_limit_hash_bucket(
+   const struct ovs_ct_limit_info *info, u16 zone)
+{
+   return &info->limits[zone & (CT_LIMIT_HASH_

[PATCH net-next 1/2] openvswitch: Add conntrack limit netlink definition

2018-04-16 Thread Yi-Hung Wei

Define netlink messages and attributes to support user kernel
communication that using conntrack limit feature.

Signed-off-by: Yi-Hung Wei 
---
 include/uapi/linux/openvswitch.h | 62 
 1 file changed, 62 insertions(+)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 713e56ce681f..ca63c16375ce 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -937,4 +937,66 @@ enum ovs_meter_band_type {
 
 #define OVS_METER_BAND_TYPE_MAX (__OVS_METER_BAND_TYPE_MAX - 1)
 
+/* Conntrack limit */
+#define OVS_CT_LIMIT_FAMILY  "ovs_ct_limit"
+#define OVS_CT_LIMIT_MCGROUP "ovs_ct_limit"
+#define OVS_CT_LIMIT_VERSION 0x1
+
+enum ovs_ct_limit_cmd {
+   OVS_CT_LIMIT_CMD_UNSPEC,
+   OVS_CT_LIMIT_CMD_SET,   /* Add or modify ct limit. */
+   OVS_CT_LIMIT_CMD_DEL,   /* Delete ct limit. */
+   OVS_CT_LIMIT_CMD_GET/* Get ct limit. */
+};
+
+enum ovs_ct_limit_attr {
+   OVS_CT_LIMIT_ATTR_UNSPEC,
+   OVS_CT_LIMIT_ATTR_OPTION,   /* Nested OVS_CT_LIMIT_ATTR_* */
+   __OVS_CT_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_ATTR_MAX (__OVS_CT_LIMIT_ATTR_MAX - 1)
+
+/**
+ * @OVS_CT_ZONE_LIMIT_ATTR_SET_REQ: Contains either
+ * OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT or a pair of
+ * OVS_CT_ZONE_LIMIT_ATTR_ZONE and OVS_CT_ZONE_LIMIT_ATTR_LIMIT.
+ * @OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ: Contains OVS_CT_ZONE_LIMIT_ATTR_ZONE.
+ * @OVS_CT_ZONE_LIMIT_ATTR_GET_REQ: Contains OVS_CT_ZONE_LIMIT_ATTR_ZONE.
+ * @OVS_CT_ZONE_LIMIT_ATTR_GET_RLY: Contains either
+ * OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT or a triple of
+ * OVS_CT_ZONE_LIMIT_ATTR_ZONE, OVS_CT_ZONE_LIMIT_ATTR_LIMIT and
+ * OVS_CT_ZONE_LIMIT_ATTR_COUNT.
+ */
+enum ovs_ct_limit_option_attr {
+   OVS_CT_LIMIT_OPTION_ATTR_UNSPEC,
+   OVS_CT_ZONE_LIMIT_ATTR_SET_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   OVS_CT_ZONE_LIMIT_ATTR_GET_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   OVS_CT_ZONE_LIMIT_ATTR_GET_RLY, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   __OVS_CT_LIMIT_OPTION_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_OPTION_ATTR_MAX (__OVS_CT_LIMIT_OPTION_ATTR_MAX - 1)
+
+enum ovs_ct_zone_limit_attr {
+   OVS_CT_ZONE_LIMIT_ATTR_UNSPEC,
+   OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT,   /* u32 default conntrack limit
+* for all zones. */
+   OVS_CT_ZONE_LIMIT_ATTR_ZONE,/* u16 conntrack zone id. */
+   OVS_CT_ZONE_LIMIT_ATTR_LIMIT,   /* u32 max number of conntrack
+* entries allowed in the
+* corresponding zone. */
+   OVS_CT_ZONE_LIMIT_ATTR_COUNT,   /* u32 number of conntrack
+* entries in the corresponding
+* zone. */
+   __OVS_CT_ZONE_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_ZONE_LIMIT_ATTR_MAX (__OVS_CT_ZONE_LIMIT_ATTR_MAX - 1)
+
 #endif /* _LINUX_OPENVSWITCH_H */
-- 
2.7.4

[PATCH net-next 2/2] openvswitch: Support conntrack zone limit

2018-04-16 Thread Yi-Hung Wei

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others to commit valid conntrack entry into the conntrack
table.  Even if we can possibly put the VM in different network namespace,
the current nf_conntrack_max configuration is kind of rigid that we cannot
limit different VM/container to have different # conntrack entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, ovs will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The following high leve APIs are provided to the userspace:
  - OVS_CT_LIMIT_CMD_SET:
* set default connection limit for all zones
* set the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_DEL:
* remove the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_GET:
* get the default connection limit for all zones
* get the connection limit for a particular zone

Signed-off-by: Yi-Hung Wei 
---
 net/openvswitch/Kconfig |   3 +-
 net/openvswitch/conntrack.c | 497 +++-
 net/openvswitch/conntrack.h |   9 +-
 net/openvswitch/datapath.c  |   7 +-
 net/openvswitch/datapath.h  |   1 +
 5 files changed, 511 insertions(+), 6 deletions(-)

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 2650205cdaf9..89da9512ec1e 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -9,7 +9,8 @@ config OPENVSWITCH
   (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
 (!NF_NAT || NF_NAT) && \
 (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
-(!NF_NAT_IPV6 || NF_NAT_IPV6)))
+(!NF_NAT_IPV6 || NF_NAT_IPV6) && \
+(!NETFILTER_CONNCOUNT || 
NETFILTER_CONNCOUNT)))
select LIBCRC32C
select MPLS
select NET_MPLS_GSO
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index c5904f629091..2f51da91d056 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -17,7 +17,9 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -76,6 +78,38 @@ struct ovs_conntrack_info {
 #endif
 };
 
+#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+#define OVS_CT_LIMIT_UNLIMITED 0
+#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
+#define CT_LIMIT_HASH_BUCKETS 512
+
+struct ovs_ct_limit {
+   /* Elements in ovs_ct_limit_info->limits hash table */
+   struct hlist_node hlist_node;
+   struct rcu_head rcu;
+   u16 zone;
+   u32 limit;
+};
+
+struct ovs_ct_limit_info {
+   u32 default_limit;
+   struct hlist_head *limits;
+   struct nf_conncount_data *data __aligned(8);
+};
+
+static const struct nla_policy ct_limit_policy[OVS_CT_LIMIT_ATTR_MAX + 1] = {
+   [OVS_CT_LIMIT_ATTR_OPTION] = { .type = NLA_NESTED, },
+};
+
+static const struct nla_policy
+   ct_zone_limit_policy[OVS_CT_ZONE_LIMIT_ATTR_MAX + 1] = {
+   [OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT] = { .type = NLA_U32, },
+   [OVS_CT_ZONE_LIMIT_ATTR_ZONE] = { .type = NLA_U16, },
+   [OVS_CT_ZONE_LIMIT_ATTR_LIMIT] = { .type = NLA_U32, },
+   [OVS_CT_ZONE_LIMIT_ATTR_COUNT] = { .type = NLA_U32, },
+};
+#endif
+
 static bool labels_nonzero(const struct ovs_key_ct_labels *labels);
 
 static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info);
@@ -1036,6 +1070,94 @@ static bool labels_nonzero(const struct 
ovs_key_ct_labels *labels)
return false;
 }
 
+#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static struct hlist_head *ct_limit_hash_bucket(
+   const struct ovs_ct_limit_info *info, u16 zone)
+{
+   return &info->limits[zone & (CT_LIMIT_HASH_BUCKETS - 1)]

[PATCH net-next 0/2] openvswitch: Support conntrack zone limit

2018-04-16 Thread Yi-Hung Wei

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others to commit valid conntrack entry into the conntrack
table.  Even if we can possibly put the VM in different network namespace,
the current nf_conntrack_max configuration is kind of rigid that we cannot
limit different VM/container to have different # conntrack entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, ovs will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The first patch defines the conntrack limit netlink definition, and the
scecond patch provides the implementation.


Yi-Hung Wei (2):
  openvswitch: Add conntrack limit netlink definition
  openvswitch: Support conntrack zone limit

 include/uapi/linux/openvswitch.h |  62 +
 net/openvswitch/Kconfig  |   3 +-
 net/openvswitch/conntrack.c  | 497 ++-
 net/openvswitch/conntrack.h  |   9 +-
 net/openvswitch/datapath.c   |   7 +-
 net/openvswitch/datapath.h   |   1 +
 6 files changed, 573 insertions(+), 6 deletions(-)

-- 
2.7.4

[PATCH net v2] openvswitch: Fix ovs_flow_key_update()

2017-03-30 Thread Yi-Hung Wei

ovs_flow_key_update() is called when the flow key is invalid, and it is
used to update and revalidate the flow key. Commit 329f45bc4f19
("openvswitch: add mac_proto field to the flow key") introduces mac_proto
field to flow key and use it to determine whether the flow key is valid.
However, the commit does not update the code path in ovs_flow_key_update()
to revalidate the flow key which may cause BUG_ON() on execute_recirc().
This patch addresses the aforementioned issue.

Fixes: 329f45bc4f19 ("openvswitch: add mac_proto field to the flow key")
Signed-off-by: Yi-Hung Wei 
---
 net/openvswitch/flow.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 9d4bb8eb63f2..3f76cb765e5b 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -527,7 +527,7 @@ static int key_extract(struct sk_buff *skb, struct 
sw_flow_key *key)
 
/* Link layer. */
clear_vlan(key);
-   if (key->mac_proto == MAC_PROTO_NONE) {
+   if (ovs_key_mac_proto(key) == MAC_PROTO_NONE) {
if (unlikely(eth_type_vlan(skb->protocol)))
return -EINVAL;
 
@@ -745,7 +745,13 @@ static int key_extract(struct sk_buff *skb, struct 
sw_flow_key *key)
 
 int ovs_flow_key_update(struct sk_buff *skb, struct sw_flow_key *key)
 {
-   return key_extract(skb, key);
+   int res;
+
+   res = key_extract(skb, key);
+   if (!res)
+   key->mac_proto &= ~SW_FLOW_KEY_INVALID;
+
+   return res;
 }
 
 static int key_extract_mac_proto(struct sk_buff *skb)
-- 
2.7.4

Re: [PATCH net] openvswitch: Fix ovs_flow_key_update()

2017-03-30 Thread Yi-Hung Wei

On Thu, Mar 30, 2017 at 6:22 AM, Jiri Benc  wrote:
> On Wed, 29 Mar 2017 17:14:10 -0700, Yi-Hung Wei wrote:
>> ovs_flow_key_update() is called when the flow key is invalid, and it is
>> used to update and revalidate the flow key. Commit 329f45bc4f19
>> ("openvswitch: add mac_proto field to the flow key") introduces mac_proto
>> field to flow key and use it to determine whether the flow key is valid.
>> However, the commit does not update the code path in ovs_flow_key_update()
>> to revalidate the flow key which may cause BUG_ON() on execute_recirc().
>> This patch addresses the aforementioned issue.
>
> Good catch.
>
>>  int ovs_flow_key_update(struct sk_buff *skb, struct sw_flow_key *key)
>>  {
>> + int res;
>> +
>> + res = key_extract_mac_proto(skb);
>> + if (res < 0)
>> + return res;
>> + key->mac_proto = res;
>> +
>>   return key_extract(skb, key);
>>  }
>
> But this should just reset the SW_FLOW_KEY_INVALID flag, there's no
> need to recompute mac_proto.
>
> Something like this:
>
>  int ovs_flow_key_update(struct sk_buff *skb, struct sw_flow_key *key)
>  {
> -   return key_extract(skb, key);
> +   int res;
> +
> +   res = key_extract(skb, key);
> +   if (!res)
> +   key->mac_proto &= ~SW_FLOW_KEY_INVALID;
> +   return res;
>  }
Hi Jiri,

One case that I worry is that key_extract() currently relies on mac_proto to
decide whether to process the link layer.  So if we update key->mac_proto
after key_extract(), wouldn't we run into a problem like the following?

If we invalidate a flow key of a L3 packet, the flow's mac_proto is like this
(MAC_PROTO_NONE | SW_FLOW_KEY_INVALID), then key_extract() will
process the link layer of this L3 packet since mac_proto !=MAC_PROTO_NONE?

In this case, shall we update key_extract() like this
static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)

/* Link layer. */
clear_vlan(key);
-   if (key->mac_proto == MAC_PROTO_NONE) {
+   if (key->mac_proto & MAC_PROTO_NONE) {
if (unlikely(eth_type_vlan(skb->protocol)))
return -EINVAL;

Thanks,

-Yi-Hung

>
> Thanks,
>
>  Jiri

[PATCH net] openvswitch: Fix ovs_flow_key_update()

2017-03-29 Thread Yi-Hung Wei

ovs_flow_key_update() is called when the flow key is invalid, and it is
used to update and revalidate the flow key. Commit 329f45bc4f19
("openvswitch: add mac_proto field to the flow key") introduces mac_proto
field to flow key and use it to determine whether the flow key is valid.
However, the commit does not update the code path in ovs_flow_key_update()
to revalidate the flow key which may cause BUG_ON() on execute_recirc().
This patch addresses the aforementioned issue.

Fixes: 329f45bc4f19 ("openvswitch: add mac_proto field to the flow key")
Signed-off-by: Yi-Hung Wei 
---
 net/openvswitch/flow.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 9d4bb8eb63f2..bb33f5341294 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -53,6 +53,8 @@
 #include "flow_netlink.h"
 #include "vport.h"
 
+static int key_extract_mac_proto(struct sk_buff *skb);
+
 u64 ovs_flow_used_time(unsigned long flow_jiffies)
 {
struct timespec cur_ts;
@@ -745,6 +747,13 @@ static int key_extract(struct sk_buff *skb, struct 
sw_flow_key *key)
 
 int ovs_flow_key_update(struct sk_buff *skb, struct sw_flow_key *key)
 {
+   int res;
+
+   res = key_extract_mac_proto(skb);
+   if (res < 0)
+   return res;
+   key->mac_proto = res;
+
return key_extract(skb, key);
 }
 
-- 
2.7.4

49 matches

Mail list logo