Re: [PATCH v2] sched: report if filter is too large to dump

2018-02-21 Thread Roman Kapl

On 02/21/2018 09:42 AM, Phil Sutter wrote:

Hi Roman,

On Wed, Feb 21, 2018 at 09:38:52AM +0100, Roman Kapl wrote:

On 02/21/2018 08:45 AM, Phil Sutter wrote:

On Mon, Feb 19, 2018 at 09:32:51PM +0100, Roman Kapl wrote:

Note: The error case can happen pretty easily if you create a filter
with 32 actions and have 4kb pages. Also recent versions of iproute try
to be clever with their buffer allocation size, which in turn leads to

I'm curious, what does it lead to? :)

Thanks, Phil

tc will dump all filters up to the big one, then it will stop silently.
So it will seem as if you have less filters.

The new behavior is the same, but tc will at leas print the EMSGSIZE
error. It does not handle it in any other way.

I got that, yes. Though your commit message stops mid-sentence and I
wondered what the dynamic buffer allocation "in turn leads to".

Thanks, Phil


Aaah... It leads to "smaller SKBs in kernel". I've lost that line in v2.

Should I send a v3 or something, David?

Sorry, Roman Kapl



Re: [PATCH v2] sched: report if filter is too large to dump

2018-02-21 Thread Roman Kapl

On 02/21/2018 08:45 AM, Phil Sutter wrote:

Hi Roman,

On Mon, Feb 19, 2018 at 09:32:51PM +0100, Roman Kapl wrote:

So far, if the filter was too large to fit in the allocated skb, the
kernel did not return any error and stopped dumping. Modify the dumper
so that it returns -EMSGSIZE when a filter fails to dump and it is the
first filter in the skb. If we are not first, we will get a next chance
with more room.

I understand this is pretty near to being an API change, but the
original design (silent truncation) can be considered a bug.

Note: The error case can happen pretty easily if you create a filter
with 32 actions and have 4kb pages. Also recent versions of iproute try
to be clever with their buffer allocation size, which in turn leads to

I'm curious, what does it lead to? :)

Thanks, Phil


tc will dump all filters up to the big one, then it will stop silently. 
So it will seem as if you have less filters.


The new behavior is the same, but tc will at leas print the EMSGSIZE 
error. It does not handle it in any other way.




[PATCH v2] sched: report if filter is too large to dump

2018-02-19 Thread Roman Kapl
So far, if the filter was too large to fit in the allocated skb, the
kernel did not return any error and stopped dumping. Modify the dumper
so that it returns -EMSGSIZE when a filter fails to dump and it is the
first filter in the skb. If we are not first, we will get a next chance
with more room.

I understand this is pretty near to being an API change, but the
original design (silent truncation) can be considered a bug.

Note: The error case can happen pretty easily if you create a filter
with 32 actions and have 4kb pages. Also recent versions of iproute try
to be clever with their buffer allocation size, which in turn leads to

Signed-off-by: Roman Kapl <c...@rkapl.cz>
---
v1 -> v2: add the "progress" comment, fixed error name in commit message.

I've looked at other dumpers in rtnetnlink, there are various ways to handle
that. For example rtnl_stats_dump has:
WARN_ON((err == -EMSGSIZE) && (skb->len == 0));
rtnl_dump_ifinfo has the same logic I am proposing:
if (err < 0) 
if (skb->len) goto out;
goto out_err;
Other functions handle the error in the "wrong" way (= what we currently do).
Although it might be OK if there is no array in what they return. I have not
tested the behavior, since the only way I found was to have a device with many
VFs.

 net/sched/cls_api.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index f21610c5da1a..92e9308bb920 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1399,13 +1399,18 @@ static int tc_dump_tfilter(struct sk_buff *skb, struct 
netlink_callback *cb)
nla_get_u32(tca[TCA_CHAIN]) != chain->index)
continue;
if (!tcf_chain_dump(chain, q, parent, skb, cb,
-   index_start, ))
+   index_start, )) {
+   err = -EMSGSIZE;
break;
+   }
}
 
cb->args[0] = index;
 
 out:
+   /* If we did no progress, the error (EMSGSIZE) is real */
+   if (skb->len == 0 && err)
+   return err;
return skb->len;
 }
 
-- 
2.16.1



[PATCH] net: sched: report if filter is too large to dump

2018-02-18 Thread Roman Kapl
So far, if the filter was too large to fit in the allocated skb, the
kernel did not return any error and stopped dumping. Modify the dumper
so that it returns -ENOMSG when a filter fails to dump and it is the
first filter in the skb. If we are not first, we will get a next chance
with more room.

I understand this is pretty near to being an API change, but the
original design (silent truncation) can be considered a bug.

Note: The error case can happen pretty easily if you create a filter
with 32 actions and have 4kb pages. Also recent versions of iproute try
to be clever with their buffer allocation size, which in turn leads to
smaller SKBs in kernel.

Signed-off-by: Roman Kapl <c...@rkapl.cz>
---
 net/sched/cls_api.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index f21610c5da1a..b5771a586c2d 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1399,13 +1399,17 @@ static int tc_dump_tfilter(struct sk_buff *skb, struct 
netlink_callback *cb)
nla_get_u32(tca[TCA_CHAIN]) != chain->index)
continue;
if (!tcf_chain_dump(chain, q, parent, skb, cb,
-   index_start, ))
+   index_start, )) {
+   err = -EMSGSIZE;
break;
+   }
}
 
cb->args[0] = index;
 
 out:
+   if (skb->len == 0 && err)
+   return err;
return skb->len;
 }
 
-- 
2.16.1



Re: [PATCH] net: make sure skb_dst is valid before using

2018-01-24 Thread Roman Kapl

On 01/24/2018 10:16 AM, Xin Long wrote:

On Wed, Jan 24, 2018 at 6:42 AM, Roman Kapl <c...@rkapl.cz> wrote:

Tunnel devices often use skb_dst(skb)->ops, but ops are not implemented
for metadata tunnel destinations. Use skb_valid_dst to check if skb_dst
is a real (non-metadata) destination.

Such packets can easily be crafted using tc tunnel_key actions or BPF
and will crash the tunnels, as observed at least with sit, geneve and
vxlan. Some of these tunnels are also intended to be used with metadata
destinations, so this represents a loss of functionality.

There might be more places with similar patterns, but sometimes it is
hard to tell if metadata dst packets can reach them, so this patch is
not exhaustive.

This patch is trying to fix a lot of places, and there may be more.
But all because of the lack of .update_pmtu or .neigh_lookup.
(dst_link_failure is safe, btw)

Not sure if it will be better to add .update_pmtu, .neigh_lookup and
even .redirect for md_dst_ops, but just leave them empty ?
like fake_dst_ops in br_nf.
That's what I was suggesting in the original mail. However, it might 
have slight impact on performance. Also it was not done when metadata 
destinations were originally added, was there any good reason?

Let's see other's opinions.

Exactly.



Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path")
Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path")
Signed-off-by: Roman Kapl <c...@rkapl.cz>
---
  drivers/infiniband/ulp/ipoib/ipoib_cm.c | 3 ++-
  drivers/net/geneve.c| 5 +++--
  drivers/net/vxlan.c | 5 +++--
  drivers/s390/net/qeth_l3_main.c | 7 +--
  include/net/ip6_tunnel.h| 1 +
  net/ipv4/ip_tunnel.c| 2 +-
  net/ipv4/ip_vti.c   | 3 ++-
  net/ipv6/sit.c  | 7 ---
  8 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 2c13123bfd69..65eacafc898f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -40,6 +40,7 @@
  #include 
  #include 
  #include 
+#include 

  #include "ipoib.h"

@@ -1456,7 +1457,7 @@ void ipoib_cm_skb_too_long(struct net_device *dev, struct 
sk_buff *skb,
 struct ipoib_dev_priv *priv = ipoib_priv(dev);
 int e = skb_queue_empty(>cm.skb_queue);

-   if (skb_dst(skb))
+   if (skb_valid_dst(skb))
 skb_dst(skb)->ops->update_pmtu(skb_dst(skb), NULL, skb, mtu);

 skb_queue_tail(>cm.skb_queue, skb);
diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 195e0d0add8d..2e5d2bc8d851 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -19,6 +19,7 @@
  #include 
  #include 
  #include 
+#include 

  #define GENEVE_NETDEV_VER  "0.6"

@@ -825,7 +826,7 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
 if (IS_ERR(rt))
 return PTR_ERR(rt);

-   if (skb_dst(skb)) {
+   if (skb_valid_dst(skb)) {
 int mtu = dst_mtu(>dst) - sizeof(struct iphdr) -
   GENEVE_BASE_HLEN - info->options_len - 14;

@@ -871,7 +872,7 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
 if (IS_ERR(dst))
 return PTR_ERR(dst);

-   if (skb_dst(skb)) {
+   if (skb_valid_dst(skb)) {
 int mtu = dst_mtu(dst) - sizeof(struct ipv6hdr) -
   GENEVE_BASE_HLEN - info->options_len - 14;

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 82090ae7ced1..b0e96746bc2e 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -28,6 +28,7 @@
  #include 
  #include 
  #include 
+#include 

  #if IS_ENABLED(CONFIG_IPV6)
  #include 
@@ -2155,7 +2156,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
 }

 ndst = >dst;
-   if (skb_dst(skb)) {
+   if (skb_valid_dst(skb)) {
 int mtu = dst_mtu(ndst) - VXLAN_HEADROOM;

 skb_dst(skb)->ops->update_pmtu(skb_dst(skb), NULL,
@@ -2197,7 +2198,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
 goto out_unlock;
 }

-   if (skb_dst(skb)) {
+   if (skb_valid_dst(skb)) {
 int mtu = dst_mtu(ndst) - VXLAN6_HEADROOM;

 skb_dst(skb)->ops->update_pmtu(skb_dst(skb), NULL,
diff --git a/drivers/s390/net/qeth_l3_main.c b/drivers/s390/net/qeth_l3_main.c
index b0c888e86cd4..694efab9cccd 100644
--- a/drivers/s390/net/qeth_l3_main.c
+++ b/drivers/s390/net/qeth_l3_main.c
@@ -35,6 +35,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 

  #include "qeth_l3

[PATCH] net: make sure skb_dst is valid before using

2018-01-23 Thread Roman Kapl
Tunnel devices often use skb_dst(skb)->ops, but ops are not implemented
for metadata tunnel destinations. Use skb_valid_dst to check if skb_dst
is a real (non-metadata) destination.

Such packets can easily be crafted using tc tunnel_key actions or BPF
and will crash the tunnels, as observed at least with sit, geneve and
vxlan. Some of these tunnels are also intended to be used with metadata
destinations, so this represents a loss of functionality.

There might be more places with similar patterns, but sometimes it is
hard to tell if metadata dst packets can reach them, so this patch is
not exhaustive.

Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path")
Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path")
Signed-off-by: Roman Kapl <c...@rkapl.cz>
---
 drivers/infiniband/ulp/ipoib/ipoib_cm.c | 3 ++-
 drivers/net/geneve.c| 5 +++--
 drivers/net/vxlan.c | 5 +++--
 drivers/s390/net/qeth_l3_main.c | 7 +--
 include/net/ip6_tunnel.h| 1 +
 net/ipv4/ip_tunnel.c| 2 +-
 net/ipv4/ip_vti.c   | 3 ++-
 net/ipv6/sit.c  | 7 ---
 8 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 2c13123bfd69..65eacafc898f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ipoib.h"
 
@@ -1456,7 +1457,7 @@ void ipoib_cm_skb_too_long(struct net_device *dev, struct 
sk_buff *skb,
struct ipoib_dev_priv *priv = ipoib_priv(dev);
int e = skb_queue_empty(>cm.skb_queue);
 
-   if (skb_dst(skb))
+   if (skb_valid_dst(skb))
skb_dst(skb)->ops->update_pmtu(skb_dst(skb), NULL, skb, mtu);
 
skb_queue_tail(>cm.skb_queue, skb);
diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 195e0d0add8d..2e5d2bc8d851 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define GENEVE_NETDEV_VER  "0.6"
 
@@ -825,7 +826,7 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
if (IS_ERR(rt))
return PTR_ERR(rt);
 
-   if (skb_dst(skb)) {
+   if (skb_valid_dst(skb)) {
int mtu = dst_mtu(>dst) - sizeof(struct iphdr) -
  GENEVE_BASE_HLEN - info->options_len - 14;
 
@@ -871,7 +872,7 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
if (IS_ERR(dst))
return PTR_ERR(dst);
 
-   if (skb_dst(skb)) {
+   if (skb_valid_dst(skb)) {
int mtu = dst_mtu(dst) - sizeof(struct ipv6hdr) -
  GENEVE_BASE_HLEN - info->options_len - 14;
 
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 82090ae7ced1..b0e96746bc2e 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #if IS_ENABLED(CONFIG_IPV6)
 #include 
@@ -2155,7 +2156,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
}
 
ndst = >dst;
-   if (skb_dst(skb)) {
+   if (skb_valid_dst(skb)) {
int mtu = dst_mtu(ndst) - VXLAN_HEADROOM;
 
skb_dst(skb)->ops->update_pmtu(skb_dst(skb), NULL,
@@ -2197,7 +2198,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
goto out_unlock;
}
 
-   if (skb_dst(skb)) {
+   if (skb_valid_dst(skb)) {
int mtu = dst_mtu(ndst) - VXLAN6_HEADROOM;
 
skb_dst(skb)->ops->update_pmtu(skb_dst(skb), NULL,
diff --git a/drivers/s390/net/qeth_l3_main.c b/drivers/s390/net/qeth_l3_main.c
index b0c888e86cd4..694efab9cccd 100644
--- a/drivers/s390/net/qeth_l3_main.c
+++ b/drivers/s390/net/qeth_l3_main.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "qeth_l3.h"
@@ -2259,9 +2260,11 @@ static int qeth_l3_get_cast_type(struct sk_buff *skb)
struct dst_entry *dst;
 
rcu_read_lock();
-   dst = skb_dst(skb);
-   if (dst)
+   if (skb_valid_dst(skb)) {
+   dst = skb_dst(skb);
n = dst_neigh_lookup_skb(dst, skb);
+   }
+
if (n) {
int cast_type = n->type;
 
diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 236e40ba06bf..1a704dfe5e1a 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -148,6 +148,7 @@ struct net *ip6_tnl_get_link_net(const struct net_device 
*dev);
 int ip6_tnl_get_iflink(const struct net_device *dev);
 int ip6_tnl_change_mtu(st

Re: [PATCH net] vxlan: update skb dst pmtu on tx path

2018-01-22 Thread Roman Kapl

On 12/18/2017 07:20 AM, Xin Long wrote:

Unlike ip tunnels, now vxlan doesn't do any pmtu update for
upper dst pmtu, even if it doesn't match the lower dst pmtu
any more.

The problem can be reproduced when reducing the vxlan lower
dev's pmtu when running netperf. In jianlin's testing, the
performance went to 1/7 of the previous.

This patch is to update the upper dst pmtu to match the lower
dst pmtu on tx path so that packets can be sent out even when
lower dev's pmtu has been changed.

It also works for metadata dst.
Sorry for resurrecting the conversation, but what exactly was meant by 
this "works metadata dst" remark? I am currently having kernel crashes 
related to ops->update_pmtu == NULL on VXLAN xmit and I am using tunnel 
metadata set by act_tunnel_key. Before this patch, it was OK.


My first thought was to add `if (ops->update_pmtu != NULL)`, but it 
appears some thought was given to this. Or maybe md_dst_ops should have 
update pmtu?


Thanks, Roman Kapl


[PATCH v3] net: sched: crash on blocks with goto chain action

2017-11-29 Thread Roman Kapl
tcf_block_put_ext has assumed that all filters (and thus their goto
actions) are destroyed in RCU callback and so can not race with our
list iteration. However, that is not true during netns cleanup (see
tcf_exts_get_net comment). The assumption was broken by the patch series
c7e460ce5572..623859ae06b8 ("Merge branch 'net-sched-race-fix'").

Prevent the user after free by holding all chains (except 0, that one is
already held) as it was done before
822e86d997e4 ("net_sched: remove tcf_block_put_deferred()").

To reproduce, run the following in a netns and then delete the ns:
ip link add dtest type dummy
tc qdisc add dev dtest ingress
tc filter add dev dtest chain 1 parent : handle 1 prio 1 flower action 
goto chain 2

Fixes: 623859ae06b8 ("Merge branch 'net-sched-race-fix'")
Signed-off-by: Roman Kapl <c...@rkapl.cz>
---
v1 -> v2: Hold all chains instead of just the currently iterated one,
  the code should be more clear this way.
v2 -> v3: Point out where the chains will be released.
  Blame the correct patch series (the one that broke the assumption).
--
 net/sched/cls_api.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 7d97f612c9b9..24a3593ed88a 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -336,7 +336,8 @@ static void tcf_block_put_final(struct work_struct *work)
struct tcf_chain *chain, *tmp;
 
rtnl_lock();
-   /* Only chain 0 should be still here. */
+
+   /* At this point, all the chains should have refcnt == 1. */
list_for_each_entry_safe(chain, tmp, >chain_list, list)
tcf_chain_put(chain);
rtnl_unlock();
@@ -344,15 +345,22 @@ static void tcf_block_put_final(struct work_struct *work)
 }
 
 /* XXX: Standalone actions are not allowed to jump to any chain, and bound
- * actions should be all removed after flushing. However, filters are now
- * destroyed in tc filter workqueue with RTNL lock, they can not race here.
+ * actions should be all removed after flushing.
  */
 void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q,
   struct tcf_block_ext_info *ei)
 {
-   struct tcf_chain *chain, *tmp;
+   struct tcf_chain *chain;
 
-   list_for_each_entry_safe(chain, tmp, >chain_list, list)
+   /* Hold a refcnt for all chains, except 0 (which is held anyway), so
+* that they don't disappear while we are iterating. We will release
+* them in tcf_block_put_final, including finally releasing chain 0.
+*/
+   list_for_each_entry(chain, >chain_list, list)
+   if (chain->index)
+   tcf_chain_hold(chain);
+
+   list_for_each_entry(chain, >chain_list, list)
tcf_chain_flush(chain);
 
tcf_block_offload_unbind(block, q, ei);
-- 
2.15.0



[PATCH v2] net: sched: crash on blocks with goto chain action

2017-11-24 Thread Roman Kapl
tcf_block_put_ext has assumed that all filters (and thus their goto
actions) are destroyed in RCU callback and thus can not race with our
list iteration. However, that is not true during netns cleanup (see
tcf_exts_get_net comment).

Prevent the user after free by holding all chains (except 0, that one is
already held). foreach_safe is not enough in this case.

To reproduce, run the following in a netns and then delete the ns:
ip link add dtest type dummy
tc qdisc add dev dtest ingress
tc filter add dev dtest chain 1 parent : handle 1 prio 1 flower action 
goto chain 2

Fixes: 822e86d997 ("net_sched: remove tcf_block_put_deferred()")
Signed-off-by: Roman Kapl <c...@rkapl.cz>
---
v1 -> v2: Hold all chains instead of just the currently iterated one,
  the code should be more clear this way.
---
 net/sched/cls_api.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 7d97f612c9b9..ddcf04b4ab43 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -336,7 +336,8 @@ static void tcf_block_put_final(struct work_struct *work)
struct tcf_chain *chain, *tmp;
 
rtnl_lock();
-   /* Only chain 0 should be still here. */
+
+   /* At this point, all the chains should have refcnt == 1. */
list_for_each_entry_safe(chain, tmp, >chain_list, list)
tcf_chain_put(chain);
rtnl_unlock();
@@ -344,15 +345,21 @@ static void tcf_block_put_final(struct work_struct *work)
 }
 
 /* XXX: Standalone actions are not allowed to jump to any chain, and bound
- * actions should be all removed after flushing. However, filters are now
- * destroyed in tc filter workqueue with RTNL lock, they can not race here.
+ * actions should be all removed after flushing.
  */
 void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q,
   struct tcf_block_ext_info *ei)
 {
-   struct tcf_chain *chain, *tmp;
+   struct tcf_chain *chain;
 
-   list_for_each_entry_safe(chain, tmp, >chain_list, list)
+   /* Hold a refcnt for all chains, except 0, so that they don't disappear
+* while we are iterating.
+*/
+   list_for_each_entry(chain, >chain_list, list)
+   if (chain->index)
+   tcf_chain_hold(chain);
+
+   list_for_each_entry(chain, >chain_list, list)
tcf_chain_flush(chain);
 
tcf_block_offload_unbind(block, q, ei);
-- 
2.15.0



Re: [PATCH] net: sched: crash on blocks with goto chain action

2017-11-21 Thread Roman Kapl

On 11/21/2017 08:31 PM, Cong Wang wrote:

On Mon, Nov 20, 2017 at 1:41 PM, Roman Kapl <c...@rkapl.cz> wrote:

On 11/20/2017 06:54 PM, Cong Wang wrote:

On Sun, Nov 19, 2017 at 8:17 AM, Roman Kapl <c...@rkapl.cz> wrote:

tcf_block_put_ext has assumed that all filters (and thus their goto
actions) are destroyed in RCU callback and thus can not race with our
list iteration. However, that is not true during netns cleanup (see
tcf_exts_get_net comment).

Prevent the user after free by holding the current list element we are
iterating over (foreach_safe is not enough).

Hmm...

Looks like we need to restore the trick we used previously, that is
holding refcnt for all list entries before this list iteration.


Was there a reason to hold all list entries in that trick? I thought that
holding just the current element will be enough, but maybe not.


Yes, let me quote Jiri's explanation:

"
The reason for the hold above was to avoid use after free in this loop.
Consider following example:

chain1
   1 filter with action goto_chain 2
chain2
   empty
I believe the exact same example is part of the 'how to reproduce' part 
of commit and the patch helped me get rid of that crash.


Now in your list_for_each_entry_safe loop,
Note that list_for_each_entry_safe was replaced by pure 
list_for_each_entry in my proposed patch.

chain1 is flushed, action is
removed and chain is put:
tcf_action_goto_chain_fini->tcf_chain_put(2)

Given the fact chain2 is empty, this put would lead to tcf_chain_destroy(2)

Then in another iteration of list_for_each_entry_safe you are using
already freed chain.
"


No, I believe that the last iteration would simply stop, because at the 
point you reach second iteration, chain->next == head.


But maybe the "hold all chains" approach from 822e86d997 (net_sched: 
remove tcf_block_put_deferred())  is simpler to understand?




Re: [PATCH] net: sched: crash on blocks with goto chain action

2017-11-20 Thread Roman Kapl

On 11/20/2017 06:54 PM, Cong Wang wrote:

On Sun, Nov 19, 2017 at 8:17 AM, Roman Kapl <c...@rkapl.cz> wrote:

tcf_block_put_ext has assumed that all filters (and thus their goto
actions) are destroyed in RCU callback and thus can not race with our
list iteration. However, that is not true during netns cleanup (see
tcf_exts_get_net comment).

Prevent the user after free by holding the current list element we are
iterating over (foreach_safe is not enough).

Hmm...

Looks like we need to restore the trick we used previously, that is
holding refcnt for all list entries before this list iteration.


Was there a reason to hold all list entries in that trick? I thought 
that holding just the current element will be enough, but maybe not.




[PATCH] net: sched: crash on blocks with goto chain action

2017-11-20 Thread Roman Kapl
tcf_block_put_ext has assumed that all filters (and thus their goto
actions) are destroyed in RCU callback and thus can not race with our
list iteration. However, that is not true during netns cleanup (see
tcf_exts_get_net comment).

Prevent the user after free by holding the current list element we are
iterating over (foreach_safe is not enough).

To reproduce, run the following in a netns and then delete the ns:
ip link add dtest type dummy
tc qdisc add dev dtest ingress
tc filter add dev dtest chain 1 parent : handle 1 prio 1 flower action 
goto chain 2

Signed-off-by: Roman Kapl <c...@rkapl.cz>
---
The mail was original rejected by vger, this is a re-send to netdev@vger only
(with the same message ID). Sorry for any confusion.
---
 net/sched/cls_api.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 7d97f612c9b9..58fed2ea3379 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -344,16 +344,26 @@ static void tcf_block_put_final(struct work_struct *work)
 }
 
 /* XXX: Standalone actions are not allowed to jump to any chain, and bound
- * actions should be all removed after flushing. However, filters are now
- * destroyed in tc filter workqueue with RTNL lock, they can not race here.
+ * actions should be all removed after flushing.
  */
 void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q,
   struct tcf_block_ext_info *ei)
 {
-   struct tcf_chain *chain, *tmp;
+   struct tcf_chain *chain, *last = NULL;
 
-   list_for_each_entry_safe(chain, tmp, >chain_list, list)
+   list_for_each_entry(chain, >chain_list, list) {
+   if (last)
+   tcf_chain_put(last);
+   /* Flushing a chain may release any other chain in this block,
+* so we have to hold on to the chain across flush to known
+* which one comes next.
+*/
+   tcf_chain_hold(chain);
tcf_chain_flush(chain);
+   last = chain;
+   }
+   if (last)
+   tcf_chain_put(last);
 
tcf_block_offload_unbind(block, q, ei);
 
-- 
2.15.0



[PATCH v2] net: sched: fix crash when deleting secondary chains

2017-11-20 Thread Roman Kapl
If you flush (delete) a filter chain other than chain 0 (such as when
deleting the device), the kernel may run into a use-after-free. The
chain refcount must not be decremented unless we are sure we are done
with the chain.

To reproduce the bug, run:
ip link add dtest type dummy
tc qdisc add dev dtest ingress
tc filter add dev dtest chain 1  parent : flower
ip link del dtest

Introduced in: commit f93e1cdcf42c ("net/sched: fix filter flushing"),
but unless you have KAsan or luck, you won't notice it until
commit 0dadc117ac8b ("cls_flower: use tcf_exts_get_net() before call_rcu()")

Fixes: f93e1cdcf42c ("net/sched: fix filter flushing")
Acked-by: Jiri Pirko <j...@mellanox.com>
Signed-off-by: Roman Kapl <c...@rkapl.cz>
---
v1 -> v2: Added Fixes and Acked-by tags

The mail was original rejected by vger, this is a re-send to netdev@vger only
(with the same message ID). Sorry for any confusion.
---

 net/sched/cls_api.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index ab255b421781..7d97f612c9b9 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -205,13 +205,14 @@ static void tcf_chain_head_change(struct tcf_chain *chain,
 
 static void tcf_chain_flush(struct tcf_chain *chain)
 {
-   struct tcf_proto *tp;
+   struct tcf_proto *tp = rtnl_dereference(chain->filter_chain);
 
tcf_chain_head_change(chain, NULL);
-   while ((tp = rtnl_dereference(chain->filter_chain)) != NULL) {
+   while (tp) {
RCU_INIT_POINTER(chain->filter_chain, tp->next);
-   tcf_chain_put(chain);
tcf_proto_destroy(tp);
+   tp = rtnl_dereference(chain->filter_chain);
+   tcf_chain_put(chain);
}
 }
 
-- 
2.15.0



[PATCH v2] net: move somaxconn init from sysctl code

2017-05-24 Thread Roman Kapl
The default value for somaxconn is set in sysctl_core_net_init(), but this
function is not called when kernel is configured without CONFIG_SYSCTL.

This results in the kernel not being able to accept TCP connections,
because the backlog has zero size. Usually, the user ends up with:
"TCP: request_sock_TCP: Possible SYN flooding on port 7. Dropping request.  
Check SNMP counters."
If SYN cookies are not enabled the connection is rejected.

Before ef547f2ac16 (tcp: remove max_qlen_log), the effects were less
severe, because the backlog was always at least eight slots long.

Signed-off-by: Roman Kapl <roman.k...@sysgo.com>
---
 net/core/net_namespace.c   | 19 +++
 net/core/sysctl_net_core.c |  2 --
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 1934efd..26bbfab 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -315,6 +315,25 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
goto out;
 }
 
+static int __net_init net_defaults_init_net(struct net *net)
+{
+   net->core.sysctl_somaxconn = SOMAXCONN;
+   return 0;
+}
+
+static struct pernet_operations net_defaults_ops = {
+   .init = net_defaults_init_net,
+};
+
+static __init int net_defaults_init(void)
+{
+   if (register_pernet_subsys(_defaults_ops))
+   panic("Cannot initialize net default settings");
+
+   return 0;
+}
+
+core_initcall(net_defaults_init);
 
 #ifdef CONFIG_NET_NS
 static struct ucounts *inc_net_namespaces(struct user_namespace *ns)
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index ea23254..b7cd9aa 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -479,8 +479,6 @@ static __net_init int sysctl_core_net_init(struct net *net)
 {
struct ctl_table *tbl;
 
-   net->core.sysctl_somaxconn = SOMAXCONN;
-
tbl = netns_core_table;
if (!net_eq(net, _net)) {
tbl = kmemdup(tbl, sizeof(netns_core_table), GFP_KERNEL);
-- 
2.10.1



Re: [PATCH] net: set default value for somaxconn

2017-05-23 Thread Roman Kapl

On 05/22/2017 08:18 PM, David Miller wrote:

From: Roman Kapl <roman.k...@sysgo.com>
Date: Mon, 22 May 2017 14:22:41 +0200


The default value for somaxconn is set in sysctl_core_net_init(), but this
function is not called when kernel is configured without CONFIG_SYSCTL.

This results in the kernel not being able to accept TCP connections,
because the backlog has zero size. Usually, the user ends up with:
"TCP: request_sock_TCP: Possible SYN flooding on port 7. Dropping request.  Check 
SNMP counters."

Before ef547f2ac16 (tcp: remove max_qlen_log), the effects were less
severe, because the backlog was always at least eight slots long.

Signed-off-by: Roman Kapl <roman.k...@sysgo.com>

I see the problem, but this changes behavior.

Existing code will set somaxconn to the default value for every namespace
that is created.

But with you changes, any modification made to init_net's value will
get inherited by any namespace created as a child thereafterwards.
Hmm, I can not find where somaxconn is inherited. But I see this needs 
to be per-net.


You really need to make this happen via pernet_operations, or even
inside of setup_net().
I will create new pernet_operations in net_namespace.c, move the 
somaxconn initialization there and send a new patch.


Thanks for comments, Roman Kapl


[PATCH] net: set default value for somaxconn

2017-05-22 Thread Roman Kapl
The default value for somaxconn is set in sysctl_core_net_init(), but this
function is not called when kernel is configured without CONFIG_SYSCTL.

This results in the kernel not being able to accept TCP connections,
because the backlog has zero size. Usually, the user ends up with:
"TCP: request_sock_TCP: Possible SYN flooding on port 7. Dropping request.  
Check SNMP counters."

Before ef547f2ac16 (tcp: remove max_qlen_log), the effects were less
severe, because the backlog was always at least eight slots long.

Signed-off-by: Roman Kapl <roman.k...@sysgo.com>
---
 net/core/net_namespace.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 1934efd..4f3bbff 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -37,6 +37,9 @@ EXPORT_SYMBOL_GPL(net_namespace_list);
 struct net init_net = {
.count  = ATOMIC_INIT(1),
.dev_base_head  = LIST_HEAD_INIT(init_net.dev_base_head),
+   .core = {
+   .sysctl_somaxconn = SOMAXCONN,
+   },
 };
 EXPORT_SYMBOL(init_net);
 
-- 
2.10.1