Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-22 Thread Shahar Klein



On 12/21/2016 7:04 PM, Daniel Borkmann wrote:

Shahar reported a soft lockup in tc_classify(), where we run into an
endless loop when walking the classifier chain due to tp->next == tp
which is a state we should never run into. The issue only seems to
trigger under load in the tc control path.

What happens is that in tc_ctl_tfilter(), thread A allocates a new
tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
with it. In that classifier callback we had to unlock/lock the rtnl
mutex and returned with -EAGAIN. One reason why we need to drop there
is, for example, that we need to request an action module to be loaded.

This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning
after we loaded and found the requested action, we need to redo the
whole request so we don't race against others. While we had to unlock
rtnl in that time, thread B's request was processed next on that CPU.
Thread B added a new tp instance successfully to the classifier chain.
When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN
and destroying its tp instance which never got linked, we goto replay
and redo A's request.

This time when walking the classifier chain in tc_ctl_tfilter() for
checking for existing tp instances we had a priority match and found
the tp instance that was created and linked by thread B. Now calling
again into tp->ops->change() with that tp was successful and returned
without error.

tp_created was never cleared in the second round, thus kernel thinks
that we need to link it into the classifier chain (once again). tp and
*back point to the same object due to the match we had earlier on. Thus
for thread B's already public tp, we reset tp->next to tp itself and
link it into the chain, which eventually causes the mentioned endless
loop in tc_classify() once a packet hits the data path.

Fix is to clear tp_created at the beginning of each request, also when
we replay it. On the paths that can cause -EAGAIN we already destroy
the original tp instance we had and on replay we really need to start
from scratch. It seems that this issue was first introduced in commit
12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining
and avoid kernel panic when we use cls_cgroup").

Fixes: 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and 
avoid kernel panic when we use cls_cgroup")
Reported-by: Shahar Klein <shah...@mellanox.com>
Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
Cc: Cong Wang <xiyou.wangc...@gmail.com>
---
 Shahar, you mentioned you wanted to run again later without the debug
 printk's. Once you do that and come to the same result again, please
 feel free to reply with a Tested-by or such. Thanks everyone!

 net/sched/cls_api.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3fbba79..1ecdf80 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -148,13 +148,15 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n)
unsigned long cl;
unsigned long fh;
int err;
-   int tp_created = 0;
+   int tp_created;

if ((n->nlmsg_type != RTM_GETTFILTER) &&
!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
return -EPERM;

 replay:
+   tp_created = 0;
+
err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, NULL);
if (err < 0)
return err;



I've tested this specific patch (without cong's patch and without the 
many prints) many times and in many test permutations today and it 
didn't lock


Thanks again Daniel!

Shahar


Re: Soft lockup in tc_classify

2016-12-21 Thread Shahar Klein



On 12/21/2016 12:15 PM, Daniel Borkmann wrote:

On 12/21/2016 08:03 AM, Cong Wang wrote:

On Tue, Dec 20, 2016 at 10:44 PM, Shahar Klein <shah...@mellanox.com>
wrote:

[...]

Looks like you added a debug printk inside tcf_destroy() too,
which seems racy with filter creation, it should not happen since
in both cases we take RTNL lock.

Don't know if changing all RCU_INIT_POINTER in that file to
rcu_assign_pointer could help anything or not. Mind to try?


I don't think at this point that it's RCU related at all.

I have a theory on what is happening. Quoting the piece in question from
Shahar's log:

 1: thread-2845[cpu-1] setting tp_created to 1 tp=94b5b0280780
back=94b9ea932060
 2: thread-2856[cpu-1] setting tp_created to 1 tp=94b9ea9322a0
back=94b9ea932060
 3: thread-2843[cpu-1] setting tp_created to 1 tp=94b5b402c960
back=94b9ea932060
 4: destroy 94b5b669fea0 tcf_destroy:1905
 5: thread-2853[cpu-1] setting tp_created to 1 tp=94b5b02805a0
back=94b9ea932060
 6: thread-2853[cpu-1] add/change filter by: fl_get [cls_flower]
tp=94b5b02805a0 tp->next=94b9ea932060
 7: destroy 94b5b0280780 tcf_destroy:1905
 8: thread-2845[cpu-1] add/change filter by: fl_get [cls_flower]
tp=94b5b02805a0 tp->next=94b5b02805a0

The interesting thing is that all this happens on CPU1, so as you say
we're under rtnl.
In 1), thread-2845 creates tp=94b5b0280780, which is destroyed in
7), presumably also
by thread-2845, and the weird part is why suddenly in 8) thread-2845
adds a created filter
without actually creating it. Plus, thread-2845 got interrupted, which
means it must have
dropped rntl in the middle. We drop it in tc_ctl_tfilter() when we do
tcf_proto_lookup_ops()
and need to pull in a module, but here this doesn't make sense at all
since i) at this
point we haven't created the tp yet and 2) flower was already there.
Thus the only explanation
where this must have happened is where we called tp->ops->change(). So
here the return
code must have been -EAGAIN, which makes sense because in 7) we
destroyed that specific
tp instance. Which means we goto replay but *do not* clear tp_created. I
think that is
the bug in question. So, while we dropped rtnl in the meantime, some
other tp instance
was added (tp=94b5b02805a0) that we had a match on in round 2, but
we still think it
was newly created which wasn't the actual case. So we'd need to deal
with the fact that
->change() callback could return -EAGAIN as well. Now looking at flower,
I think the call
chain must have been fl_change() -> fl_set_parms() ->
tcf_exts_validate() -> tcf_action_init()
-> tcf_action_init_1(). And here one possibility I see is that
tc_lookup_action_n()
failed, therefore we shortly dropped rtnl for the request_module() where
the module
got loaded successfully and thus error code from there is -EAGAIN that
got propagated
all the way through ->change() from tc_ctl_tfilter(). So it looks like a
generic issue
not specifically tied to flower.

Shahar, can you test the following? Thanks!

 net/sched/cls_api.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3fbba79..1ecdf80 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -148,13 +148,15 @@ static int tc_ctl_tfilter(struct sk_buff *skb,
struct nlmsghdr *n)
 unsigned long cl;
 unsigned long fh;
 int err;
-int tp_created = 0;
+int tp_created;

 if ((n->nlmsg_type != RTM_GETTFILTER) &&
 !netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
 return -EPERM;

 replay:
+tp_created = 0;
+
 err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, NULL);
 if (err < 0)
 return err;


Test run successfully!
I'll remove all other debug "fixes" and run again later

Thanks Daniel!



Re: Soft lockup in tc_classify

2016-12-21 Thread Shahar Klein



On 12/21/2016 9:03 AM, Cong Wang wrote:

On Tue, Dec 20, 2016 at 10:44 PM, Shahar Klein <shah...@mellanox.com> wrote:


Tried it with same results


This piece is pretty interesting:

[  408.554689] DEBUGG:SK thread-2853[cpu-1] setting tp_created to 1
tp=94b5b02805a0 back=94b9ea932060
[  408.574258] DEBUGG:SK thread-2853[cpu-1] add/change filter by:
fl_get [cls_flower] tp=94b5b02805a0 tp->next=94b9ea932060
[  408.587849] DEBUGG:SK destroy 94b5b0280780 tcf_destroy:1905
[  408.595862] DEBUGG:SK thread-2845[cpu-1] add/change filter by:
fl_get [cls_flower] tp=94b5b02805a0 tp->next=94b5b02805a0

Looks like you added a debug printk inside tcf_destroy() too,
which seems racy with filter creation, it should not happen since
in both cases we take RTNL lock.

Don't know if changing all RCU_INIT_POINTER in that file to
rcu_assign_pointer could help anything or not. Mind to try?



Tried it with same results



Thanks for debugging!

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3fbba79..b8a66d8 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -305,7 +305,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n)
kfree(tp);
goto errout;
}
-
+   printk(KERN_ERR "DEBUGG:SK thread-%d[cpu-%d] setting tp_created 
to 1 tp=%p back=%p\n", current->pid, current->on_cpu, tp, 
rtnl_dereference(*back));
tp_created = 1;
 
} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind))
@@ -317,11 +317,13 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n)
if (n->nlmsg_type == RTM_DELTFILTER && t->tcm_handle == 0) {
struct tcf_proto *next = rtnl_dereference(tp->next);
 
-   RCU_INIT_POINTER(*back, next);
+   printk(KERN_ERR "DEBUGG:SK delete filter by: %pf\n", 
tp->ops->get);
+
+   rcu_assign_pointer(*back, next);
 
tfilter_notify(net, skb, n, tp, fh,
   RTM_DELTFILTER, false);
-   tcf_destroy(tp, true);
+   tcf_destroy(tp);
err = 0;
goto errout;
}
@@ -331,25 +333,30 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n)
!(n->nlmsg_flags & NLM_F_CREATE))
goto errout;
} else {
+   bool last;
+
switch (n->nlmsg_type) {
case RTM_NEWTFILTER:
err = -EEXIST;
if (n->nlmsg_flags & NLM_F_EXCL) {
if (tp_created)
-   tcf_destroy(tp, true);
+   tcf_destroy(tp);
goto errout;
}
break;
case RTM_DELTFILTER:
-   err = tp->ops->delete(tp, fh);
+   printk(KERN_ERR "DEBUGG:SK %s:%d\n", __func__, 
__LINE__);
+   err = tp->ops->delete(tp, fh, );
if (err == 0) {
-   struct tcf_proto *next = 
rtnl_dereference(tp->next);
-
tfilter_notify(net, skb, n, tp,
   t->tcm_handle,
   RTM_DELTFILTER, false);
-   if (tcf_destroy(tp, false))
-   RCU_INIT_POINTER(*back, next);
+   if (last) {
+   struct tcf_proto *next = 
rtnl_dereference(tp->next);
+
+   rcu_assign_pointer(*back, next);
+   tcf_destroy(tp);
+   }
}
goto errout;
case RTM_GETTFILTER:
@@ -366,13 +373,14 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n)
  n->nlmsg_flags & NLM_F_CREATE ? TCA_ACT_NOREPLACE 
: TCA_ACT_REPLACE);
if (err == 0) {
if (tp_created) {
-   RCU_INIT_POINTER(tp->next, rtnl_dereference(*back));
+   rcu_assign_pointer(tp->next, rtnl_dereference(*back));
rcu_assign_pointer(*back, tp);
+   printk(KERN_ERR "DEBUGG:SK thread-%d[cpu-%d] add/change 
filter by: %pf tp=%p tp->next=%p\n", current->pid, current->on_cpu, 
tp->ops->get, tp, tp->next);
}
tfilter_notify(net, skb, n, tp, fh, RTM_NEWTFILTER, false);
} else {
if (tp_created)
-   tcf_destroy(tp, true);
+   tcf_destroy(tp);
}
 
 errout:


Re: Soft lockup in tc_classify

2016-12-21 Thread Shahar Klein



On 12/20/2016 1:47 PM, Daniel Borkmann wrote:

Hi Shahar,

On 12/20/2016 07:22 AM, Shahar Klein wrote:

On 12/19/2016 7:58 PM, Cong Wang wrote:

On Mon, Dec 19, 2016 at 8:39 AM, Shahar Klein <shah...@mellanox.com>
wrote:

On 12/13/2016 12:51 AM, Cong Wang wrote:


On Mon, Dec 12, 2016 at 1:18 PM, Or Gerlitz <gerlitz...@gmail.com>
wrote:


On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann
<dan...@iogearbox.net>
wrote:


Note that there's still the RCU fix missing for the deletion race
that
Cong will still send out, but you say that the only thing you do
is to
add a single rule, but no other operation in involved during that
test?


What's missing to have the deletion race fixed? making a patch or
testing to a patch which was sent?


If you think it would help for this problem, here is my patch rebased
on the latest net-next.

Again, I don't see how it could help this case yet, especially I don't
see how we could have a loop in this singly linked list.


I've applied cong's patch and hit a different lockup(full log
attached):


Are you sure this is really different? For me, it is still inside the
loop
in tc_classify(), with only a slightly different offset.



Daniel suggested I'll add a print:
case RTM_DELTFILTER:
-   err = tp->ops->delete(tp, fh);
+ printk(KERN_ERR "DEBUGG:SK %s:%d\n", __func__,
__LINE__);
+ err = tp->ops->delete(tp, fh, );
if (err == 0) {

and I couldn't see this print in the output.


Hmm, that is odd, if this never prints, then my patch should not make
any
difference.

There are still two other cases where we could change tp->next, so do
you
mind to add two more printk's for debugging?

Attached is the delta patch.

Thanks!


I've added a slightly different debug print:
@@ -368,11 +375,12 @@ static int tc_ctl_tfilter(struct sk_buff *skb,
struct nlmsghdr *n)
 if (tp_created) {
 RCU_INIT_POINTER(tp->next,
rtnl_dereference(*back));
 rcu_assign_pointer(*back, tp);
+ printk(KERN_ERR "DEBUGG:SK add/change filter by: %pf
tp=%p tp->next=%p\n", tp->ops->get, tp, tp->next);
 }
 tfilter_notify(net, skb, n, tp, fh, RTM_NEWTFILTER,
false);


I'm curious, could you be a bit more verbose why you didn't go with Cong's
debug patch?

In particular, why you removed the hunk from the condition
'n->nlmsg_type ==
RTM_DELTFILTER && t->tcm_handle == 0' where we delete the whole tp
instance?


I didn't remove this hunk, just added more. anyway I'm attaching the 
diff to make sure I didn't miss anything.

I've added some more prints in the destroy and added some current info
run log attached also



Is it because if you have that printk() there, then the issue doesn't
trigger
for you anymore? Or any other reason?

How many CPUs does your test machine have, I suspect more than 1, right?

So iff RTM_DELTFILTER with tcm_handle of 0 really played a role in this,
I'm
wondering whether there was a subtle deletion + add race where the newly
added
filter on the other CPU still saw a stale pointer in the list. But just
a wild
guess at this point.

Hmm, could you try this below to see whether the issue still appears?

Thanks,
Daniel

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3fbba79..4eee1cb 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -317,7 +317,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb,
struct nlmsghdr *n)
 if (n->nlmsg_type == RTM_DELTFILTER && t->tcm_handle == 0) {
 struct tcf_proto *next = rtnl_dereference(tp->next);

-RCU_INIT_POINTER(*back, next);
+rcu_assign_pointer(*back, next);

 tfilter_notify(net, skb, n, tp, fh,
RTM_DELTFILTER, false);




Tried it with same results


full output attached:

[  283.290271] Mirror/redirect action on
[  283.305031] DEBUGG:SK add/change filter by: fl_get [cls_flower]
tp=9432d704df60 tp->next=  (null)
[  283.322563] DEBUGG:SK add/change filter by: fl_get [cls_flower]
tp=9436e718d240 tp->next=  (null)
[  283.359997] GACT probability on
[  283.365923] DEBUGG:SK add/change filter by: fl_get [cls_flower]
tp=9436e718d3c0 tp->next=9436e718d240
[  283.378725] DEBUGG:SK add/change filter by: fl_get [cls_flower]
tp=9436e718d3c0 tp->next=9436e718d3c0
[  283.391310] DEBUGG:SK add/change filter by: fl_get [cls_flower]
tp=9436e718d3c0 tp->next=9436e718d3c0
[  283.403923] DEBUGG:SK add/change filter by: fl_get [cls_flower]
tp=9436e718d3c0 tp->next=9436e718d3c0
[  283.416542] DEBUGG:SK add/change filter by: fl_get [cls_flower]
tp=9436e718d3c0 tp->next=9436e718d3c0
[  308.538571] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
[swapper/0:0]

Thanks
Shahar
diff --git a/net/sched/cls

Re: Soft lockup in tc_classify

2016-12-20 Thread Shahar Klein



On 12/19/2016 7:58 PM, Cong Wang wrote:

Hello,

On Mon, Dec 19, 2016 at 8:39 AM, Shahar Klein <shah...@mellanox.com> wrote:



On 12/13/2016 12:51 AM, Cong Wang wrote:


On Mon, Dec 12, 2016 at 1:18 PM, Or Gerlitz <gerlitz...@gmail.com> wrote:


On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann <dan...@iogearbox.net>
wrote:


Note that there's still the RCU fix missing for the deletion race that
Cong will still send out, but you say that the only thing you do is to
add a single rule, but no other operation in involved during that test?



What's missing to have the deletion race fixed? making a patch or
testing to a patch which was sent?



If you think it would help for this problem, here is my patch rebased
on the latest net-next.

Again, I don't see how it could help this case yet, especially I don't
see how we could have a loop in this singly linked list.



I've applied cong's patch and hit a different lockup(full log attached):



Are you sure this is really different? For me, it is still inside the loop
in tc_classify(), with only a slightly different offset.




Daniel suggested I'll add a print:
case RTM_DELTFILTER:
-   err = tp->ops->delete(tp, fh);
+ printk(KERN_ERR "DEBUGG:SK %s:%d\n", __func__, __LINE__);
+ err = tp->ops->delete(tp, fh, );
if (err == 0) {

and I couldn't see this print in the output.


Hmm, that is odd, if this never prints, then my patch should not make any
difference.

There are still two other cases where we could change tp->next, so do you
mind to add two more printk's for debugging?

Attached is the delta patch.

Thanks!



I've added a slightly different debug print:
@@ -368,11 +375,12 @@ static int tc_ctl_tfilter(struct sk_buff *skb, 
struct nlmsghdr *n)

if (tp_created) {
RCU_INIT_POINTER(tp->next, 
rtnl_dereference(*back));

rcu_assign_pointer(*back, tp);
+ printk(KERN_ERR "DEBUGG:SK add/change filter by: %pf 
tp=%p tp->next=%p\n", tp->ops->get, tp, tp->next);

}
tfilter_notify(net, skb, n, tp, fh, RTM_NEWTFILTER, false);

full output attached:

[  283.290271] Mirror/redirect action on
[  283.305031] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=9432d704df60 tp->next=  (null)
[  283.322563] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=9436e718d240 tp->next=  (null)

[  283.359997] GACT probability on
[  283.365923] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=9436e718d3c0 tp->next=9436e718d240
[  283.378725] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=9436e718d3c0 tp->next=9436e718d3c0
[  283.391310] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=9436e718d3c0 tp->next=9436e718d3c0
[  283.403923] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=9436e718d3c0 tp->next=9436e718d3c0
[  283.416542] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=9436e718d3c0 tp->next=9436e718d3c0
[  308.538571] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! 
[swapper/0:0]



Thanks
Shahar



[  283.290271] Mirror/redirect action on
[  283.305031] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=9432d704df60 tp->next=  (null)
[  283.322563] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=9436e718d240 tp->next=  (null)
[  283.359997] GACT probability on
[  283.365923] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=9436e718d3c0 tp->next=9436e718d240
[  283.378725] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=9436e718d3c0 tp->next=9436e718d3c0
[  283.391310] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=9436e718d3c0 tp->next=9436e718d3c0
[  283.403923] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=9436e718d3c0 tp->next=9436e718d3c0
[  283.416542] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=9436e718d3c0 tp->next=9436e718d3c0
[  308.538571] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! 
[swapper/0:0]
[  308.547322] Modules linked in: act_gact act_mirred openvswitch 
nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 vfio_pci vfio_virqfd 
vfio_iommu_type1 vfio cls_flower mlx5_ib mlx5_core devlink sch_ingress nfsv3 
nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 
iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 
xt_conntrack nf_conntrack tun ebtable_filter ebtables ip6table_filter 
ip6_tables netconsole rpcrdma bridge ib_isert stp iscsi_target_mod llc ib_iser 
libiscsi scsi_transport_iscsi ib_srpt ib_srp scsi_transport_srp ib_ipoib 
rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core intel_rapl 
sb_edac edac_core x86_pkg_temp_thermal coretemp 

Re: Soft lockup in tc_classify

2016-12-19 Thread Shahar Klein



On 12/13/2016 12:51 AM, Cong Wang wrote:

On Mon, Dec 12, 2016 at 1:18 PM, Or Gerlitz  wrote:

On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann  wrote:


Note that there's still the RCU fix missing for the deletion race that
Cong will still send out, but you say that the only thing you do is to
add a single rule, but no other operation in involved during that test?


What's missing to have the deletion race fixed? making a patch or
testing to a patch which was sent?


If you think it would help for this problem, here is my patch rebased
on the latest net-next.

Again, I don't see how it could help this case yet, especially I don't
see how we could have a loop in this singly linked list.



I've applied cong's patch and hit a different lockup(full log attached):

[  264.725414] RIP: 0010:fl_classify+0x1f1/0x2b0 [cls_flower]

(gdb) list *(fl_classify+0x1f1)
0x1591 is in fl_classify (net/sched/cls_flower.c:187).
182 if (f && !tc_skip_sw(f->flags)) {
183 *res = f->res;
184 return tcf_exts_exec(skb, >exts, res);
185 }
186 return -1;
187 }
188 
189 static int fl_init(struct tcf_proto *tp)
190 {
191 struct cls_fl_head *head;
(gdb)

Daniel suggested I'll add a print:
case RTM_DELTFILTER:
-   err = tp->ops->delete(tp, fh);
+ printk(KERN_ERR "DEBUGG:SK %s:%d\n", __func__, __LINE__);
+ err = tp->ops->delete(tp, fh, );
if (err == 0) {

and I couldn't see this print in the output.




[  238.507670] GACT probability on
[  264.573942] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
[modprobe:4867]
[  264.582875] Modules linked in: act_gact act_mirred openvswitch 
nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 vfio_pci vfio_virqfd 
vfio_iommu_type1 vfio cls_flower mlx5_ib mlx5_core devlink sch_ingress nfsv3 
nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 
iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 
xt_conntrack nf_conntrack tun ebtable_filter ebtables ip6table_filter 
ip6_tables netconsole bridge stp llc rpcrdma ib_isert iscsi_target_mod ib_iser 
libiscsi scsi_transport_iscsi ib_srpt ib_srp scsi_transport_srp ib_ipoib 
rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac 
edac_core x86_pkg_temp_thermal coretemp kvm_intel kvm ib_core irqbypass 
crct10dif_pclmul crc32_pclmul igb ptp iTCO_wdt pps_core crc32c_intel joydev 
i2c_algo_bit ghash_clmulni_intel iTCO_vendor_support pcspkr i2c_i801 mei_me 
ipmi_ssif ioatdma shpchp tpm_tis i2c_smbus tpm_tis_core dca tpm mei lpc_ich 
ipmi_si ipmi_msghandler wmi nfsd target_core_mod auth_rpcgss nfs_acl lockd 
grace sunrpc isci libsas serio_raw scsi_transport_sas [last unloaded: devlink]
[  264.703539] CPU: 0 PID: 4867 Comm: modprobe Not tainted 4.9.0+ #27
[  264.710766] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
[  264.718475] task: 89916ad09d80 task.stack: a36f4451
[  264.725414] RIP: 0010:fl_classify+0x1f1/0x2b0 [cls_flower]
[  264.731863] RSP: 0018:89916fa03ad8 EFLAGS: 0246 ORIG_RAX: 
ff10
[  264.740971] RAX:  RBX: 89913f522100 RCX: 
[  264.749265] RDX: 89916fa03c98 RSI: 89915bd743c0 RDI: 89913f522100
[  264.757558] RBP: 89916fa03c28 R08: 270f R09: 
[  264.765852] R10:  R11: 0004 R12: 89916fa03c98
[  264.774144] R13: 89913ecc7c00 R14: 89915bd743c0 R15: 0001
[  264.782440] FS:  7f74f2f5e700() GS:89916fa0() 
knlGS:
[  264.792066] CS:  0010 DS:  ES:  CR0: 80050033
[  264.798809] CR2: 7f74f2491480 CR3: 00043f3f8000 CR4: 000426f0
[  264.807102] Call Trace:
[  264.810151]  
[  264.812717]  ? handle_edge_irq+0x87/0x130
[  264.817504]  ? handle_irq+0xab/0x130
[  264.821808]  ? irq_exit+0x77/0xe0
[  264.825828]  ? irq_exit+0x77/0xe0
[  264.829849]  ? do_IRQ+0x51/0xd0
[  264.833664]  ? common_interrupt+0x93/0x93
[  264.838462]  tc_classify+0x78/0x120
[  264.842679]  __netif_receive_skb_core+0x623/0xa00
[  264.848264]  ? udp4_gro_receive+0x10b/0x2d0
[  264.853253]  __netif_receive_skb+0x18/0x60
[  264.858150]  netif_receive_skb_internal+0x40/0xb0
[  264.863723]  napi_gro_receive+0xcd/0x120
[  264.868437]  mlx5e_handle_rx_cqe_rep+0x61b/0x890 [mlx5_core]
[  264.875105]  ? kvm_irq_delivery_to_apic+0x2c0/0x2c0 [kvm]
[  264.881462]  mlx5e_poll_rx_cq+0x83/0x840 [mlx5_core]
[  264.887333]  mlx5e_napi_poll+0x89/0x480 [mlx5_core]
[  264.893102]  net_rx_action+0x260/0x3c0
[  264.897610]  __do_softirq+0xc9/0x28c
[  264.901922]  irq_exit+0xd7/0xe0
[  264.905748]  do_IRQ+0x51/0xd0
[  264.909380]  common_interrupt+0x93/0x93
[  264.913984] RIP: 0010:mpihelp_submul_1+0x91/0xe0
[  264.919460] RSP: 0018:a36f44513940 EFLAGS: 0206 

Re: Soft lockup in tc_classify

2016-12-13 Thread Shahar Klein



On 12/12/2016 9:07 PM, Cong Wang wrote:

On Mon, Dec 12, 2016 at 8:04 AM, Shahar Klein <shah...@mellanox.com> wrote:



On 12/12/2016 3:28 PM, Daniel Borkmann wrote:


Hi Shahar,

On 12/12/2016 10:43 AM, Shahar Klein wrote:


Hi All,

sorry for the spam, the first time was sent with html part and was
rejected.

We observed an issue where a classifier instance next member is
pointing back to itself, causing a CPU soft lockup.
We found it by running traffic on many udp connections and then adding
a new flower rule using tc.

We added a quick workaround to verify it:

In tc_classify:

 for (; tp; tp = rcu_dereference_bh(tp->next)) {
 int err;
+   if (tp == tp->next)
+ RCU_INIT_POINTER(tp->next, NULL);


We also had a print here showing tp->next is pointing to tp. With this
workaround we are not hitting the issue anymore.
We are not sure we fully understand the mechanism here - with the rtnl
and rcu locks.
We'll appreciate your help solving this issue.



Note that there's still the RCU fix missing for the deletion race that
Cong will still send out, but you say that the only thing you do is to
add a single rule, but no other operation in involved during that test?


Hmm, I thought RCU_INIT_POINTER() respects readers, but seems no?
If so, that could be the cause since we play with the next pointer and
there is only one filter in this case, but I don't see why we could have
a loop here.



Do you have a script and kernel .config for reproducing this?



I'm using a user space socket app(https://github.com/shahar-klein/noodle)on
a vm to push udp packets from ~2000 different udp src ports ramping up at
~100 per second towards another vm on the same Hypervisor. Once the traffic
starts I'm pushing ingress flower tc udp rules(even_udp_src_port->mirred,
odd->drop) on the relevant representor in the Hypervisor.


Do you mind to share your `tc filter show dev...` output? Also, since you
mentioned you only add one flower filter, just want to make sure you never
delete any filter before/when the bug happens? How reproducible is this?



The bridge between the two vms is based on ovs and representors.
We have a dpif in the ovs creating tc rules from ovs rules.
We set up 5000 open flow rules looks like this:
cook.., udp,dl_dst=24:8a:07:38:a2:b2,tp_src=7000 actions=drop
cook.., udp,dl_dst=24:8a:07:38:a2:b2,tp_src=7002 actions=drop
cook.., udp,dl_dst=24:8a:07:38:a2:b2,tp_src=7004 actions=drop
.
.
.

and then fire up 2000 udp flows starting at udp src 7000 and ramping up 
at 100 flows per second so after 20 seconds we suppose to have 2000 
active udp flows and half of them are dropped at the tc level.


The first packet of any such match hits the miss rule in the ovs 
datapath and pushed up to the user space ovs which consult the open 
flows rules above and translate the ovs rule to tc rule and push the 
rule back to the kernel via netlink.
I'm not sure I understand what happens to the second packet of the same 
match or all the following packets in the same match till the tc 
datapath is 'ready' for them.


The soft lockup is easily reproducible using this scenario but it won't 
happen if we use a much more easy traffic scheme first, say 100 udp 
flows at 3 per second.


I added a print and a panic when hitting the loop(output attached)

Also attached our .config




Thanks!



.config.gz
Description: application/gzip


tc_classify_panic.gz
Description: application/gzip


Soft lockup in tc_classify

2016-12-12 Thread Shahar Klein

Hi All,

sorry for the spam, the first time was sent with html part and was rejected.

We observed an issue where a classifier instance next member is pointing 
back to itself, causing a CPU soft lockup.
We found it by running traffic on many udp connections and then adding a 
new flower rule using tc.


We added a quick workaround to verify it:

In tc_classify:

for (; tp; tp = rcu_dereference_bh(tp->next)) {
int err;
+   if (tp == tp->next)
+ RCU_INIT_POINTER(tp->next, NULL);


We also had a print here showing tp->next is pointing to tp. With this 
workaround we are not hitting the issue anymore.
We are not sure we fully understand the mechanism here - with the rtnl 
and rcu locks.

We'll appreciate your help solving this issue.

Thanks,
Shahar


Re: Soft lockup in tc_classify

2016-12-12 Thread Shahar Klein



On 12/12/2016 3:28 PM, Daniel Borkmann wrote:

Hi Shahar,

On 12/12/2016 10:43 AM, Shahar Klein wrote:

Hi All,

sorry for the spam, the first time was sent with html part and was
rejected.

We observed an issue where a classifier instance next member is
pointing back to itself, causing a CPU soft lockup.
We found it by running traffic on many udp connections and then adding
a new flower rule using tc.

We added a quick workaround to verify it:

In tc_classify:

 for (; tp; tp = rcu_dereference_bh(tp->next)) {
 int err;
+   if (tp == tp->next)
+ RCU_INIT_POINTER(tp->next, NULL);


We also had a print here showing tp->next is pointing to tp. With this
workaround we are not hitting the issue anymore.
We are not sure we fully understand the mechanism here - with the rtnl
and rcu locks.
We'll appreciate your help solving this issue.


Note that there's still the RCU fix missing for the deletion race that
Cong will still send out, but you say that the only thing you do is to
add a single rule, but no other operation in involved during that test?

Do you have a script and kernel .config for reproducing this?


I'm using a user space socket 
app(https://github.com/shahar-klein/noodle)on a vm to push udp packets 
from ~2000 different udp src ports ramping up at ~100 per second towards 
another vm on the same Hypervisor. Once the traffic starts I'm pushing 
ingress flower tc udp rules(even_udp_src_port->mirred, odd->drop) on the 
relevant representor in the Hypervisor.




Thanks,
Daniel