Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-26 Thread David Miller
From: Daniel Borkmann 
Date: Wed, 21 Dec 2016 18:04:11 +0100

> Shahar reported a soft lockup in tc_classify(), where we run into an
> endless loop when walking the classifier chain due to tp->next == tp
> which is a state we should never run into. The issue only seems to
> trigger under load in the tc control path.
> 
> What happens is that in tc_ctl_tfilter(), thread A allocates a new
> tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
> with it. In that classifier callback we had to unlock/lock the rtnl
> mutex and returned with -EAGAIN. One reason why we need to drop there
> is, for example, that we need to request an action module to be loaded.
> 
> This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning
> after we loaded and found the requested action, we need to redo the
> whole request so we don't race against others. While we had to unlock
> rtnl in that time, thread B's request was processed next on that CPU.
> Thread B added a new tp instance successfully to the classifier chain.
> When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN
> and destroying its tp instance which never got linked, we goto replay
> and redo A's request.
> 
> This time when walking the classifier chain in tc_ctl_tfilter() for
> checking for existing tp instances we had a priority match and found
> the tp instance that was created and linked by thread B. Now calling
> again into tp->ops->change() with that tp was successful and returned
> without error.
> 
> tp_created was never cleared in the second round, thus kernel thinks
> that we need to link it into the classifier chain (once again). tp and
> *back point to the same object due to the match we had earlier on. Thus
> for thread B's already public tp, we reset tp->next to tp itself and
> link it into the chain, which eventually causes the mentioned endless
> loop in tc_classify() once a packet hits the data path.
> 
> Fix is to clear tp_created at the beginning of each request, also when
> we replay it. On the paths that can cause -EAGAIN we already destroy
> the original tp instance we had and on replay we really need to start
> from scratch. It seems that this issue was first introduced in commit
> 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining
> and avoid kernel panic when we use cls_cgroup").
> 
> Fixes: 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps 
> chaining and avoid kernel panic when we use cls_cgroup")
> Reported-by: Shahar Klein 
> Signed-off-by: Daniel Borkmann 

Applied and queued up for -stable, thanks Daniel.


Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-24 Thread Daniel Borkmann

On 12/24/2016 08:34 AM, Cong Wang wrote:

On Thu, Dec 22, 2016 at 4:26 PM, Daniel Borkmann  wrote:

On 12/22/2016 08:05 PM, Cong Wang wrote:

On Wed, Dec 21, 2016 at 1:07 PM, Daniel Borkmann 
wrote:


Ok, you mean for net. In that case I prefer the smaller sized fix to be
honest. It also covers everything from the point where we fetch the chain
via cops->tcf_chain() to the end of the function, which is where most of
the complexity resides, and only the two mentioned commits do the relock,


I really wish the problem is only about relocking, but look at the code,
the deeper reason why we have this bug is the complexity of the logic
inside tc_ctl_tfilter(): 1) the replay logic is hard, we have to make it
idempotent; 2) the request logic itself is hard, because of tc filter
design
and implementation.

This is why I worry more than just relocking.


But do you have a concrete 2nd issue/bug you're seeing? It rather sounds to
me your argument is more about fear of complexity on tc framework itself.
I agree it's complex, and tc_ctl_tfilter() is quite big in itself, where it
would be good to reduce it's complexity into smaller pieces. But it's not
really related to the fix itself, reducing complexity requires significantly
more and deeper work on the code. We can rework tc_ctl_tfilter() in net-next
to try to simplify it, sure, but I don't get why we have to discuss so much
on this matter in this context, really.


Thanks for ignoring my point 1) above... You are dragging the discussion
further.


I don't think so. The analysis and patch I proposed provides an explanation
of how we get into the seen endless loop, it provides a logical fix for it,
which has been reviewed by others and it has been tested extensively that it
resolves the issue, which was easily reproducible for the reporter and that
after the fix it never occurred again. The delta is absolutely simple and
really low risk. Given this function has not much changed over time, also
distros could pick it up that have a much older base kernel than current
stable ones. This initiated follow-up discussion we're having here in general
is dragging the focus away for everyone, and quite frankly I'm getting tired
of discussing it. I have stated my preferences, you have stated yours, and
we're only repeating ourselves in circles which isn't helpful in any way,
the discussion is not about some concrete bug in the logic to fix anymore
(otherwise please name it). Hence my proposal that everything else can wait
and be done in net-next.


Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-23 Thread Cong Wang
On Thu, Dec 22, 2016 at 4:26 PM, Daniel Borkmann  wrote:
> On 12/22/2016 08:05 PM, Cong Wang wrote:
>>
>> On Wed, Dec 21, 2016 at 1:07 PM, Daniel Borkmann 
>> wrote:
>>>
>>>
>>> Ok, you mean for net. In that case I prefer the smaller sized fix to be
>>> honest. It also covers everything from the point where we fetch the chain
>>> via cops->tcf_chain() to the end of the function, which is where most of
>>> the complexity resides, and only the two mentioned commits do the relock,
>>
>>
>> I really wish the problem is only about relocking, but look at the code,
>> the deeper reason why we have this bug is the complexity of the logic
>> inside tc_ctl_tfilter(): 1) the replay logic is hard, we have to make it
>> idempotent; 2) the request logic itself is hard, because of tc filter
>> design
>> and implementation.
>>
>> This is why I worry more than just relocking.
>
>
> But do you have a concrete 2nd issue/bug you're seeing? It rather sounds to
> me your argument is more about fear of complexity on tc framework itself.
> I agree it's complex, and tc_ctl_tfilter() is quite big in itself, where it
> would be good to reduce it's complexity into smaller pieces. But it's not
> really related to the fix itself, reducing complexity requires significantly
> more and deeper work on the code. We can rework tc_ctl_tfilter() in net-next
> to try to simplify it, sure, but I don't get why we have to discuss so much
> on this matter in this context, really.

Thanks for ignoring my point 1) above... You are dragging the discussion
further.


>
>>> so as a fix I think it's fine as-is. As mentioned, if there's need to
>>> refactor tc_ctl_tfilter() net-next would be better, imho.
>>
>>
>> Refactor is a too strong word, just moving the replay out is not a
>> refactor.
>> The end result will be still smaller than your commit d936377414fadbafb4,
>> which is already backported to stable.
>
>
> Commit d936377414fa is a whole different thing, and not related to the

Nope, you said "small", I show an example of non-small, perfectly fits
in the topic
and beats your argument w.r.t. patch size with your previous case.


> topic at all. The two-line changes with the initialization fix is quite
> straight forward and fixes a bug in the logic. It's also less complex in
> terms of lines but also in terms of complexity than to refactor or move the

Size doesn't tell everything, focus on the code:

1) With your current approach: we have to verify if 'tp_created' is really
correct in ALL cases including replay case; we also have to verify if any
other local variables are as correct as 'tp_created'.

2) With my proposed approach: replay case is much easier, compiler
does everything for us, the function itself is, and should be, good enough
for replaying, no need to track ANY local variables.

For me 2) is much better than 1). Don't look at size, look at the whole code
in a bigger picture.

> replay out. Moving it out contextually means that you also wouldn't rule
> out that things like nlmsg_parse(), or in-fact *any* of the
> __tc_ctl_tfilter()
> return paths could potentially return an error that suddenly would require
> a replay of the request. So, while moving it out fixes the bug, too, it also

The current replay loop already covers almost all of the function, so this
argument doesn't make sense.


> adds more potential points where you would go and replay the request where
> you didn't do so before and therefore also adds more complexity to the code
> that needs review/audit when fixing bugs since you don't have these
> hard/direct
> return paths anymore. Therefore I don't think it's better to go that way
> for the fix.

Please read the code again:

'goto replay' is located at line 383, tc_ctl_tfilter() ends at line 385

'replay' label is located at line 157, tc_ctl_tfilter() starts
(without local variables)
at line 153.

So, replay loop covers 227 lines of code, tc_ctl_tfilter() contains
233 lines of code,
therefore 97.4% of tc_ctl_tfilter() is the replay loop, moving it is
out is literately just
2.6%.

You call this refactor... Huh? Do your math please.


Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-22 Thread Daniel Borkmann

On 12/22/2016 08:05 PM, Cong Wang wrote:

On Wed, Dec 21, 2016 at 1:07 PM, Daniel Borkmann  wrote:


Ok, you mean for net. In that case I prefer the smaller sized fix to be
honest. It also covers everything from the point where we fetch the chain
via cops->tcf_chain() to the end of the function, which is where most of
the complexity resides, and only the two mentioned commits do the relock,


I really wish the problem is only about relocking, but look at the code,
the deeper reason why we have this bug is the complexity of the logic
inside tc_ctl_tfilter(): 1) the replay logic is hard, we have to make it
idempotent; 2) the request logic itself is hard, because of tc filter design
and implementation.

This is why I worry more than just relocking.


But do you have a concrete 2nd issue/bug you're seeing? It rather sounds to
me your argument is more about fear of complexity on tc framework itself.
I agree it's complex, and tc_ctl_tfilter() is quite big in itself, where it
would be good to reduce it's complexity into smaller pieces. But it's not
really related to the fix itself, reducing complexity requires significantly
more and deeper work on the code. We can rework tc_ctl_tfilter() in net-next
to try to simplify it, sure, but I don't get why we have to discuss so much
on this matter in this context, really.


so as a fix I think it's fine as-is. As mentioned, if there's need to
refactor tc_ctl_tfilter() net-next would be better, imho.


Refactor is a too strong word, just moving the replay out is not a refactor.
The end result will be still smaller than your commit d936377414fadbafb4,
which is already backported to stable.


Commit d936377414fa is a whole different thing, and not related to the
topic at all. The two-line changes with the initialization fix is quite
straight forward and fixes a bug in the logic. It's also less complex in
terms of lines but also in terms of complexity than to refactor or move the
replay out. Moving it out contextually means that you also wouldn't rule
out that things like nlmsg_parse(), or in-fact *any* of the __tc_ctl_tfilter()
return paths could potentially return an error that suddenly would require
a replay of the request. So, while moving it out fixes the bug, too, it also
adds more potential points where you would go and replay the request where
you didn't do so before and therefore also adds more complexity to the code
that needs review/audit when fixing bugs since you don't have these hard/direct
return paths anymore. Therefore I don't think it's better to go that way
for the fix.

Thanks,
Daniel


Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-22 Thread Daniel Borkmann

On 12/22/2016 06:50 PM, John Fastabend wrote:

On 16-12-22 08:53 AM, David Miller wrote:

From: Daniel Borkmann 
Date: Wed, 21 Dec 2016 22:07:48 +0100


Ok, you mean for net. In that case I prefer the smaller sized fix to
be honest. It also covers everything from the point where we fetch
the chain via cops->tcf_chain() to the end of the function, which is
where most of the complexity resides, and only the two mentioned
commits do the relock, so as a fix I think it's fine as-is. As
mentioned, if there's need to refactor tc_ctl_tfilter() net-next
would be better, imho.


Please, can you two work towards an agreement as to what fix should
go in at this time?

Thanks.


Thanks for fixing this!

I have a slight preference to push this patch into net as its been
tested already by Shahar and is not particularly invasive. Then for
net-next do a larger cleanup to get rid of some of the complexity per
Cong's suggestion.


My preference as well, thanks.


Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-22 Thread Daniel Borkmann

On 12/22/2016 02:16 PM, Shahar Klein wrote:

On 12/21/2016 7:04 PM, Daniel Borkmann wrote:

Shahar reported a soft lockup in tc_classify(), where we run into an
endless loop when walking the classifier chain due to tp->next == tp
which is a state we should never run into. The issue only seems to
trigger under load in the tc control path.

What happens is that in tc_ctl_tfilter(), thread A allocates a new
tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
with it. In that classifier callback we had to unlock/lock the rtnl
mutex and returned with -EAGAIN. One reason why we need to drop there
is, for example, that we need to request an action module to be loaded.

This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning
after we loaded and found the requested action, we need to redo the
whole request so we don't race against others. While we had to unlock
rtnl in that time, thread B's request was processed next on that CPU.
Thread B added a new tp instance successfully to the classifier chain.
When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN
and destroying its tp instance which never got linked, we goto replay
and redo A's request.

This time when walking the classifier chain in tc_ctl_tfilter() for
checking for existing tp instances we had a priority match and found
the tp instance that was created and linked by thread B. Now calling
again into tp->ops->change() with that tp was successful and returned
without error.

tp_created was never cleared in the second round, thus kernel thinks
that we need to link it into the classifier chain (once again). tp and
*back point to the same object due to the match we had earlier on. Thus
for thread B's already public tp, we reset tp->next to tp itself and
link it into the chain, which eventually causes the mentioned endless
loop in tc_classify() once a packet hits the data path.

Fix is to clear tp_created at the beginning of each request, also when
we replay it. On the paths that can cause -EAGAIN we already destroy
the original tp instance we had and on replay we really need to start
from scratch. It seems that this issue was first introduced in commit
12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining
and avoid kernel panic when we use cls_cgroup").

Fixes: 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and 
avoid kernel panic when we use cls_cgroup")
Reported-by: Shahar Klein 
Signed-off-by: Daniel Borkmann 
Cc: Cong Wang 

[...]

I've tested this specific patch (without cong's patch and without the many 
prints) many times and in many test permutations today and it didn't lock

Thanks again Daniel!


Just catching up with email (traveling whole day), thanks a bunch for
your effort, Shahar! In that case lets then add:

Tested-by: Shahar Klein 


Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-22 Thread Cong Wang
On Wed, Dec 21, 2016 at 1:07 PM, Daniel Borkmann  wrote:
>
> Ok, you mean for net. In that case I prefer the smaller sized fix to be
> honest. It also covers everything from the point where we fetch the chain
> via cops->tcf_chain() to the end of the function, which is where most of
> the complexity resides, and only the two mentioned commits do the relock,

I really wish the problem is only about relocking, but look at the code,
the deeper reason why we have this bug is the complexity of the logic
inside tc_ctl_tfilter(): 1) the replay logic is hard, we have to make it
idempotent; 2) the request logic itself is hard, because of tc filter design
and implementation.

This is why I worry more than just relocking.

> so as a fix I think it's fine as-is. As mentioned, if there's need to
> refactor tc_ctl_tfilter() net-next would be better, imho.

Refactor is a too strong word, just moving the replay out is not a refactor.
The end result will be still smaller than your commit d936377414fadbafb4,
which is already backported to stable.


Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-22 Thread John Fastabend
On 16-12-22 08:53 AM, David Miller wrote:
> From: Daniel Borkmann 
> Date: Wed, 21 Dec 2016 22:07:48 +0100
> 
>> Ok, you mean for net. In that case I prefer the smaller sized fix to
>> be honest. It also covers everything from the point where we fetch
>> the chain via cops->tcf_chain() to the end of the function, which is
>> where most of the complexity resides, and only the two mentioned
>> commits do the relock, so as a fix I think it's fine as-is. As
>> mentioned, if there's need to refactor tc_ctl_tfilter() net-next
>> would be better, imho.
> 
> Please, can you two work towards an agreement as to what fix should
> go in at this time?
> 
> Thanks.
> 

Thanks for fixing this!

I have a slight preference to push this patch into net as its been
tested already by Shahar and is not particularly invasive. Then for
net-next do a larger cleanup to get rid of some of the complexity per
Cong's suggestion.

.John


Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-22 Thread David Miller
From: Daniel Borkmann 
Date: Wed, 21 Dec 2016 22:07:48 +0100

> Ok, you mean for net. In that case I prefer the smaller sized fix to
> be honest. It also covers everything from the point where we fetch
> the chain via cops->tcf_chain() to the end of the function, which is
> where most of the complexity resides, and only the two mentioned
> commits do the relock, so as a fix I think it's fine as-is. As
> mentioned, if there's need to refactor tc_ctl_tfilter() net-next
> would be better, imho.

Please, can you two work towards an agreement as to what fix should
go in at this time?

Thanks.


Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-22 Thread Shahar Klein



On 12/21/2016 7:04 PM, Daniel Borkmann wrote:

Shahar reported a soft lockup in tc_classify(), where we run into an
endless loop when walking the classifier chain due to tp->next == tp
which is a state we should never run into. The issue only seems to
trigger under load in the tc control path.

What happens is that in tc_ctl_tfilter(), thread A allocates a new
tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
with it. In that classifier callback we had to unlock/lock the rtnl
mutex and returned with -EAGAIN. One reason why we need to drop there
is, for example, that we need to request an action module to be loaded.

This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning
after we loaded and found the requested action, we need to redo the
whole request so we don't race against others. While we had to unlock
rtnl in that time, thread B's request was processed next on that CPU.
Thread B added a new tp instance successfully to the classifier chain.
When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN
and destroying its tp instance which never got linked, we goto replay
and redo A's request.

This time when walking the classifier chain in tc_ctl_tfilter() for
checking for existing tp instances we had a priority match and found
the tp instance that was created and linked by thread B. Now calling
again into tp->ops->change() with that tp was successful and returned
without error.

tp_created was never cleared in the second round, thus kernel thinks
that we need to link it into the classifier chain (once again). tp and
*back point to the same object due to the match we had earlier on. Thus
for thread B's already public tp, we reset tp->next to tp itself and
link it into the chain, which eventually causes the mentioned endless
loop in tc_classify() once a packet hits the data path.

Fix is to clear tp_created at the beginning of each request, also when
we replay it. On the paths that can cause -EAGAIN we already destroy
the original tp instance we had and on replay we really need to start
from scratch. It seems that this issue was first introduced in commit
12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining
and avoid kernel panic when we use cls_cgroup").

Fixes: 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and 
avoid kernel panic when we use cls_cgroup")
Reported-by: Shahar Klein 
Signed-off-by: Daniel Borkmann 
Cc: Cong Wang 
---
 Shahar, you mentioned you wanted to run again later without the debug
 printk's. Once you do that and come to the same result again, please
 feel free to reply with a Tested-by or such. Thanks everyone!

 net/sched/cls_api.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3fbba79..1ecdf80 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -148,13 +148,15 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n)
unsigned long cl;
unsigned long fh;
int err;
-   int tp_created = 0;
+   int tp_created;

if ((n->nlmsg_type != RTM_GETTFILTER) &&
!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
return -EPERM;

 replay:
+   tp_created = 0;
+
err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, NULL);
if (err < 0)
return err;



I've tested this specific patch (without cong's patch and without the 
many prints) many times and in many test permutations today and it 
didn't lock


Thanks again Daniel!

Shahar


Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-21 Thread Daniel Borkmann

On 12/21/2016 09:47 PM, Cong Wang wrote:

On Wed, Dec 21, 2016 at 12:02 PM, Daniel Borkmann  wrote:

On 12/21/2016 08:10 PM, Cong Wang wrote:

On Wed, Dec 21, 2016 at 10:51 AM, Cong Wang 
wrote:

On Wed, Dec 21, 2016 at 9:04 AM, Daniel Borkmann 
wrote:


What happens is that in tc_ctl_tfilter(), thread A allocates a new
tp, initializes it, sets tp_created to 1, and calls into
tp->ops->change()
with it. In that classifier callback we had to unlock/lock the rtnl
mutex and returned with -EAGAIN. One reason why we need to drop there
is, for example, that we need to request an action module to be loaded.


Excellent catch!

But why do we have to replay the request here? Shouldn't we just return
EAGAIN to user-space and let user-space decide to retry or not?
Replaying is the root of the evil here.


Answer myself: probably due to historical reasons, but still replaying
inside such a big function is just error-prone, we should promote it
out:


Have no strong opinion, I guess it could be done as a simplification
for net-next, why not, along with moving out the netlink_ns_capable()
check or possibly other things after careful analysis that don't need
to be redone in that circumstance.


It is only slightly bigger than your current one so could fit for -stable too.
Also, it could fix all potential problems like this one. Let compiler do the
work, not human. ;)


Ok, you mean for net. In that case I prefer the smaller sized fix to be
honest. It also covers everything from the point where we fetch the chain
via cops->tcf_chain() to the end of the function, which is where most of
the complexity resides, and only the two mentioned commits do the relock,
so as a fix I think it's fine as-is. As mentioned, if there's need to
refactor tc_ctl_tfilter() net-next would be better, imho.

Thanks,
Daniel


Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-21 Thread Cong Wang
On Wed, Dec 21, 2016 at 12:02 PM, Daniel Borkmann  wrote:
> On 12/21/2016 08:10 PM, Cong Wang wrote:
>>
>> On Wed, Dec 21, 2016 at 10:51 AM, Cong Wang 
>> wrote:
>>>
>>> On Wed, Dec 21, 2016 at 9:04 AM, Daniel Borkmann 
>>> wrote:

 What happens is that in tc_ctl_tfilter(), thread A allocates a new
 tp, initializes it, sets tp_created to 1, and calls into
 tp->ops->change()
 with it. In that classifier callback we had to unlock/lock the rtnl
 mutex and returned with -EAGAIN. One reason why we need to drop there
 is, for example, that we need to request an action module to be loaded.
>>>
>>>
>>> Excellent catch!
>>>
>>> But why do we have to replay the request here? Shouldn't we just return
>>> EAGAIN to user-space and let user-space decide to retry or not?
>>> Replaying is the root of the evil here.
>>
>>
>> Answer myself: probably due to historical reasons, but still replaying
>> inside such a big function is just error-prone, we should promote it
>> out:
>
>
> Have no strong opinion, I guess it could be done as a simplification
> for net-next, why not, along with moving out the netlink_ns_capable()
> check or possibly other things after careful analysis that don't need
> to be redone in that circumstance.

It is only slightly bigger than your current one so could fit for -stable too.
Also, it could fix all potential problems like this one. Let compiler do the
work, not human. ;)


Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-21 Thread Daniel Borkmann

On 12/21/2016 08:10 PM, Cong Wang wrote:

On Wed, Dec 21, 2016 at 10:51 AM, Cong Wang  wrote:

On Wed, Dec 21, 2016 at 9:04 AM, Daniel Borkmann  wrote:

What happens is that in tc_ctl_tfilter(), thread A allocates a new
tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
with it. In that classifier callback we had to unlock/lock the rtnl
mutex and returned with -EAGAIN. One reason why we need to drop there
is, for example, that we need to request an action module to be loaded.


Excellent catch!

But why do we have to replay the request here? Shouldn't we just return
EAGAIN to user-space and let user-space decide to retry or not?
Replaying is the root of the evil here.


Answer myself: probably due to historical reasons, but still replaying
inside such a big function is just error-prone, we should promote it
out:


Have no strong opinion, I guess it could be done as a simplification
for net-next, why not, along with moving out the netlink_ns_capable()
check or possibly other things after careful analysis that don't need
to be redone in that circumstance.

Thanks,
Daniel


Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-21 Thread Cong Wang
On Wed, Dec 21, 2016 at 10:51 AM, Cong Wang  wrote:
> On Wed, Dec 21, 2016 at 9:04 AM, Daniel Borkmann  wrote:
>> What happens is that in tc_ctl_tfilter(), thread A allocates a new
>> tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
>> with it. In that classifier callback we had to unlock/lock the rtnl
>> mutex and returned with -EAGAIN. One reason why we need to drop there
>> is, for example, that we need to request an action module to be loaded.
>
> Excellent catch!
>
> But why do we have to replay the request here? Shouldn't we just return
> EAGAIN to user-space and let user-space decide to retry or not?
> Replaying is the root of the evil here.

Answer myself: probably due to historical reasons, but still replaying
inside such a big function is just error-prone, we should promote it
out:

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3fbba79..7d5b42b 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -129,7 +129,7 @@ static inline u32 tcf_auto_prio(struct tcf_proto *tp)

 /* Add/change/delete/get a filter node */

-static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
+static int __tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
 {
struct net *net = sock_net(skb->sk);
struct nlattr *tca[TCA_MAX + 1];
@@ -154,7 +154,6 @@ static int tc_ctl_tfilter(struct sk_buff *skb,
struct nlmsghdr *n)
!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
return -EPERM;

-replay:
err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, NULL);
if (err < 0)
return err;
@@ -378,12 +377,19 @@ static int tc_ctl_tfilter(struct sk_buff *skb,
struct nlmsghdr *n)
 errout:
if (cl)
cops->put(q, cl);
-   if (err == -EAGAIN)
-   /* Replay the request. */
-   goto replay;
return err;
 }

+static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
+{
+   int ret;
+replay:
+   ret = __tc_ctl_tfilter(skb, n);
+   if (ret == -EAGAIN)
+   goto replay;
+   return ret;
+}
+
 static int tcf_fill_node(struct net *net, struct sk_buff *skb,
 struct tcf_proto *tp, unsigned long fh, u32 portid,
 u32 seq, u16 flags, int event)


Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-21 Thread Daniel Borkmann

On 12/21/2016 07:51 PM, Cong Wang wrote:

On Wed, Dec 21, 2016 at 9:04 AM, Daniel Borkmann  wrote:

What happens is that in tc_ctl_tfilter(), thread A allocates a new
tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
with it. In that classifier callback we had to unlock/lock the rtnl
mutex and returned with -EAGAIN. One reason why we need to drop there
is, for example, that we need to request an action module to be loaded.


Excellent catch!

But why do we have to replay the request here? Shouldn't we just return
EAGAIN to user-space and let user-space decide to retry or not?
Replaying is the root of the evil here.


Right, this behavior is already pretty old (2005), see history
tree 8d7c694553dc ("[PKT_SCHED]: act_api.c: drop rtnl for loading
modules") and 437293de63d8 ("[PKT_SCHED]: cls_api.c: drop rtnl
for loading modules"), some binaries could rely on that behavior
in one way or another I'd presume.


Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-21 Thread Cong Wang
On Wed, Dec 21, 2016 at 9:04 AM, Daniel Borkmann  wrote:
> What happens is that in tc_ctl_tfilter(), thread A allocates a new
> tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
> with it. In that classifier callback we had to unlock/lock the rtnl
> mutex and returned with -EAGAIN. One reason why we need to drop there
> is, for example, that we need to request an action module to be loaded.

Excellent catch!

But why do we have to replay the request here? Shouldn't we just return
EAGAIN to user-space and let user-space decide to retry or not?
Replaying is the root of the evil here.

I mean:

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3fbba79..a21f7d66 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -154,7 +154,6 @@ static int tc_ctl_tfilter(struct sk_buff *skb,
struct nlmsghdr *n)
!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
return -EPERM;

-replay:
err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, NULL);
if (err < 0)
return err;
@@ -278,8 +277,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb,
struct nlmsghdr *n)
tp_ops = tcf_proto_lookup_ops(kind);
/* We dropped the RTNL semaphore in order to
 * perform the module load.  So, even if we
-* succeeded in loading the module we have to
-* replay the request.  We indicate this using
+* succeeded in loading the module we return
 * -EAGAIN.
 */
if (tp_ops != NULL) {
@@ -378,9 +376,6 @@ static int tc_ctl_tfilter(struct sk_buff *skb,
struct nlmsghdr *n)
 errout:
if (cl)
cops->put(q, cl);
-   if (err == -EAGAIN)
-   /* Replay the request. */
-   goto replay;
return err;
 }


Re: [PATCH net] net, sched: fix soft lockup in tc_classify

2016-12-21 Thread Eric Dumazet
On Wed, 2016-12-21 at 18:04 +0100, Daniel Borkmann wrote:
> Shahar reported a soft lockup in tc_classify(), where we run into an
> endless loop when walking the classifier chain due to tp->next == tp
> which is a state we should never run into. The issue only seems to
> trigger under load in the tc control path.

Nice fix, thanks !

Acked-by: Eric Dumazet