Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-09-19 Thread Tobias Hommel
> After running for about 24 hours, I now encountered another panic. This time 
> it
> is caused by an out of memory situation. Although the trace shows action in 
> the
> filesystem code I'm posting it here because I cannot isolate the error and
> maybe it is caused by our NULL pointer bug or by the new fix.
> I do not have a serial console attached, so I could only attach a screenshot 
> of
> the panic to this mail.
> 
> I am running v4.19-rc3 from git with the above mentioned patch applied.
> After 19 hours everything still looked fine, XfrmFwdHdrError value was at 
> ~950.
> Overall memory usage shown by htop was at 1.2G/15.6G.
> I had htop running via ssh so I was able to see at least some status post
> mortem. Uptime: 23:50:57
> Overall memory usage was at 10.2G/15.6G and user processes were just
> using the usual amount of memory, so it looks like the kernel was eating up at
> least 9G of RAM.
> 
> Maybe this information is not very helpful for debugging, but it is at least a
> warning that something might still be wrong.
> 
> I'll try to gather some more information and keep you updated.

Running stable under load for more than 5 days now, I was not able to reproduce
that OOM situation. I leave it at that, the fix for the initial bug is fine for
me.


Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-09-12 Thread Tobias Hommel
On Wed, Sep 12, 2018 at 10:50:46AM +0200, Steffen Klassert wrote:
> On Tue, Sep 11, 2018 at 09:02:48PM +0200, Tobias Hommel wrote:
> > > > Subject: [PATCH RFC] xfrm: Fix NULL pointer dereference when 
> > > > skb_dst_force
> > > > clears the dst_entry.
> > > > 
> > > > Since commit 222d7dbd258d ("net: prevent dst uses after free")
> > > > skb_dst_force() might clear the dst_entry attached to the skb.
> > > > The xfrm code don't expect this to happen, so we crash with
> > > > a NULL pointer dereference in this case. Fix it by checking
> > > > skb_dst(skb) for NULL after skb_dst_force() and drop the packet
> > > > in cast the dst_entry was cleared.
> > > > 
> > > > Fixes: 222d7dbd258d ("net: prevent dst uses after free")
> > > > Reported-by: Tobias Hommel 
> > > > Reported-by: Kristian Evensen 
> > > > Reported-by: Wolfgang Walter 
> > > > Signed-off-by: Steffen Klassert 
> > > > ---
> > > 
> > > This patch fixes the problem here.
> > > 
> > > XfrmFwdHdrError gets around 80 at the very beginning and remains so. 
> > > Probably 
> > > this happens when some route are changed/set then. 
> > > 
> > > Regards and thanks,
> > 
> > Same here, we're now running stable for ~6 hours, XfrmFwdHdrError is at 220.
> > This is less than 1 lost packet per minute, which seems to be okay for now.
> 
> Thanks a lot for testing! This is now applied to the ipsec tree.

After running for about 24 hours, I now encountered another panic. This time it
is caused by an out of memory situation. Although the trace shows action in the
filesystem code I'm posting it here because I cannot isolate the error and
maybe it is caused by our NULL pointer bug or by the new fix.
I do not have a serial console attached, so I could only attach a screenshot of
the panic to this mail.

I am running v4.19-rc3 from git with the above mentioned patch applied.
After 19 hours everything still looked fine, XfrmFwdHdrError value was at ~950.
Overall memory usage shown by htop was at 1.2G/15.6G.
I had htop running via ssh so I was able to see at least some status post
mortem. Uptime: 23:50:57
Overall memory usage was at 10.2G/15.6G and user processes were just
using the usual amount of memory, so it looks like the kernel was eating up at
least 9G of RAM.

Maybe this information is not very helpful for debugging, but it is at least a
warning that something might still be wrong.

I'll try to gather some more information and keep you updated.


Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-09-12 Thread Steffen Klassert
On Tue, Sep 11, 2018 at 09:02:48PM +0200, Tobias Hommel wrote:
> > > Subject: [PATCH RFC] xfrm: Fix NULL pointer dereference when skb_dst_force
> > > clears the dst_entry.
> > > 
> > > Since commit 222d7dbd258d ("net: prevent dst uses after free")
> > > skb_dst_force() might clear the dst_entry attached to the skb.
> > > The xfrm code don't expect this to happen, so we crash with
> > > a NULL pointer dereference in this case. Fix it by checking
> > > skb_dst(skb) for NULL after skb_dst_force() and drop the packet
> > > in cast the dst_entry was cleared.
> > > 
> > > Fixes: 222d7dbd258d ("net: prevent dst uses after free")
> > > Reported-by: Tobias Hommel 
> > > Reported-by: Kristian Evensen 
> > > Reported-by: Wolfgang Walter 
> > > Signed-off-by: Steffen Klassert 
> > > ---
> > 
> > This patch fixes the problem here.
> > 
> > XfrmFwdHdrError gets around 80 at the very beginning and remains so. 
> > Probably 
> > this happens when some route are changed/set then. 
> > 
> > Regards and thanks,
> 
> Same here, we're now running stable for ~6 hours, XfrmFwdHdrError is at 220.
> This is less than 1 lost packet per minute, which seems to be okay for now.

Thanks a lot for testing! This is now applied to the ipsec tree.


Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-09-11 Thread Tobias Hommel
> > Subject: [PATCH RFC] xfrm: Fix NULL pointer dereference when skb_dst_force
> > clears the dst_entry.
> > 
> > Since commit 222d7dbd258d ("net: prevent dst uses after free")
> > skb_dst_force() might clear the dst_entry attached to the skb.
> > The xfrm code don't expect this to happen, so we crash with
> > a NULL pointer dereference in this case. Fix it by checking
> > skb_dst(skb) for NULL after skb_dst_force() and drop the packet
> > in cast the dst_entry was cleared.
> > 
> > Fixes: 222d7dbd258d ("net: prevent dst uses after free")
> > Reported-by: Tobias Hommel 
> > Reported-by: Kristian Evensen 
> > Reported-by: Wolfgang Walter 
> > Signed-off-by: Steffen Klassert 
> > ---
> >  net/xfrm/xfrm_output.c | 4 
> >  net/xfrm/xfrm_policy.c | 4 
> >  2 files changed, 8 insertions(+)
> > 
> > diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
> > index 89b178a78dc7..36d15a38ce5e 100644
> > --- a/net/xfrm/xfrm_output.c
> > +++ b/net/xfrm/xfrm_output.c
> > @@ -101,6 +101,10 @@ static int xfrm_output_one(struct sk_buff *skb, int
> > err) spin_unlock_bh(>lock);
> > 
> > skb_dst_force(skb);
> > +   if (!skb_dst(skb)) {
> > +   XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
> > +   goto error_nolock;
> > +   }
> > 
> > if (xfrm_offload(skb)) {
> > x->type_offload->encap(x, skb);
> > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> > index 7c5e8978aeaa..626e0f4d1749 100644
> > --- a/net/xfrm/xfrm_policy.c
> > +++ b/net/xfrm/xfrm_policy.c
> > @@ -2548,6 +2548,10 @@ int __xfrm_route_forward(struct sk_buff *skb,
> > unsigned short family) }
> > 
> > skb_dst_force(skb);
> > +   if (!skb_dst(skb)) {
> > +   XFRM_INC_STATS(net, LINUX_MIB_XFRMFWDHDRERROR);
> > +   return 0;
> > +   }
> > 
> > dst = xfrm_lookup(net, skb_dst(skb), , NULL, XFRM_LOOKUP_QUEUE);
> > if (IS_ERR(dst)) {
> 
> This patch fixes the problem here.
> 
> XfrmFwdHdrError gets around 80 at the very beginning and remains so. Probably 
> this happens when some route are changed/set then. 
> 
> Regards and thanks,

Same here, we're now running stable for ~6 hours, XfrmFwdHdrError is at 220.
This is less than 1 lost packet per minute, which seems to be okay for now.


Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-09-11 Thread Wolfgang Walter
Am Dienstag, 11. September 2018, 12:33:34 schrieb Steffen Klassert:
> On Mon, Sep 10, 2018 at 10:18:47AM +0200, Kristian Evensen wrote:
> > Hi,
> > 
> > Thanks everyone for all the effort in debugging this issue.
> > 
> > On Mon, Sep 10, 2018 at 8:39 AM Steffen Klassert
> > 
> >  wrote:
> > > The easy fix that could be backported to stable would be
> > > to check skb->dst for NULL and drop the packet in that case.
> > 
> > Thought I should just chime in and say that we deployed this
> > work-around when we started observing the error back in June. Since
> > then we have not seen any crashes. Also, we have instrumented some of
> > our kernels to count the number of times the error is hit (overall +
> > consecutive). Compared to the overall number of packets, the error
> > happens very rarely. With our workloads, we on average see the error
> > once every couple of days.
> 
> Thanks for letting us know!
> 
> I plan to fix this in the ipsec tree with:
> 
> Subject: [PATCH RFC] xfrm: Fix NULL pointer dereference when skb_dst_force
> clears the dst_entry.
> 
> Since commit 222d7dbd258d ("net: prevent dst uses after free")
> skb_dst_force() might clear the dst_entry attached to the skb.
> The xfrm code don't expect this to happen, so we crash with
> a NULL pointer dereference in this case. Fix it by checking
> skb_dst(skb) for NULL after skb_dst_force() and drop the packet
> in cast the dst_entry was cleared.
> 
> Fixes: 222d7dbd258d ("net: prevent dst uses after free")
> Reported-by: Tobias Hommel 
> Reported-by: Kristian Evensen 
> Reported-by: Wolfgang Walter 
> Signed-off-by: Steffen Klassert 
> ---
>  net/xfrm/xfrm_output.c | 4 
>  net/xfrm/xfrm_policy.c | 4 
>  2 files changed, 8 insertions(+)
> 
> diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
> index 89b178a78dc7..36d15a38ce5e 100644
> --- a/net/xfrm/xfrm_output.c
> +++ b/net/xfrm/xfrm_output.c
> @@ -101,6 +101,10 @@ static int xfrm_output_one(struct sk_buff *skb, int
> err) spin_unlock_bh(>lock);
> 
>   skb_dst_force(skb);
> + if (!skb_dst(skb)) {
> + XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
> + goto error_nolock;
> + }
> 
>   if (xfrm_offload(skb)) {
>   x->type_offload->encap(x, skb);
> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> index 7c5e8978aeaa..626e0f4d1749 100644
> --- a/net/xfrm/xfrm_policy.c
> +++ b/net/xfrm/xfrm_policy.c
> @@ -2548,6 +2548,10 @@ int __xfrm_route_forward(struct sk_buff *skb,
> unsigned short family) }
> 
>   skb_dst_force(skb);
> + if (!skb_dst(skb)) {
> + XFRM_INC_STATS(net, LINUX_MIB_XFRMFWDHDRERROR);
> + return 0;
> + }
> 
>   dst = xfrm_lookup(net, skb_dst(skb), , NULL, XFRM_LOOKUP_QUEUE);
>   if (IS_ERR(dst)) {

This patch fixes the problem here.

XfrmFwdHdrError gets around 80 at the very beginning and remains so. Probably 
this happens when some route are changed/set then. 

Regards and thanks,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts


Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-09-11 Thread Steffen Klassert
On Mon, Sep 10, 2018 at 10:18:47AM +0200, Kristian Evensen wrote:
> Hi,
> 
> Thanks everyone for all the effort in debugging this issue.
> 
> On Mon, Sep 10, 2018 at 8:39 AM Steffen Klassert
>  wrote:
> > The easy fix that could be backported to stable would be
> > to check skb->dst for NULL and drop the packet in that case.
> 
> Thought I should just chime in and say that we deployed this
> work-around when we started observing the error back in June. Since
> then we have not seen any crashes. Also, we have instrumented some of
> our kernels to count the number of times the error is hit (overall +
> consecutive). Compared to the overall number of packets, the error
> happens very rarely. With our workloads, we on average see the error
> once every couple of days.

Thanks for letting us know!

I plan to fix this in the ipsec tree with:

Subject: [PATCH RFC] xfrm: Fix NULL pointer dereference when skb_dst_force 
clears
 the dst_entry.

Since commit 222d7dbd258d ("net: prevent dst uses after free")
skb_dst_force() might clear the dst_entry attached to the skb.
The xfrm code don't expect this to happen, so we crash with
a NULL pointer dereference in this case. Fix it by checking
skb_dst(skb) for NULL after skb_dst_force() and drop the packet
in cast the dst_entry was cleared.

Fixes: 222d7dbd258d ("net: prevent dst uses after free")
Reported-by: Tobias Hommel 
Reported-by: Kristian Evensen 
Reported-by: Wolfgang Walter 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_output.c | 4 
 net/xfrm/xfrm_policy.c | 4 
 2 files changed, 8 insertions(+)

diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 89b178a78dc7..36d15a38ce5e 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -101,6 +101,10 @@ static int xfrm_output_one(struct sk_buff *skb, int err)
spin_unlock_bh(>lock);
 
skb_dst_force(skb);
+   if (!skb_dst(skb)) {
+   XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
+   goto error_nolock;
+   }
 
if (xfrm_offload(skb)) {
x->type_offload->encap(x, skb);
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 7c5e8978aeaa..626e0f4d1749 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2548,6 +2548,10 @@ int __xfrm_route_forward(struct sk_buff *skb, unsigned 
short family)
}
 
skb_dst_force(skb);
+   if (!skb_dst(skb)) {
+   XFRM_INC_STATS(net, LINUX_MIB_XFRMFWDHDRERROR);
+   return 0;
+   }
 
dst = xfrm_lookup(net, skb_dst(skb), , NULL, XFRM_LOOKUP_QUEUE);
if (IS_ERR(dst)) {
-- 
2.17.1



Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-09-10 Thread Wolfgang Walter
Am Montag, 10. September 2018, 10:18:47 schrieb Kristian Evensen:
> Hi,
> 
> Thanks everyone for all the effort in debugging this issue.
> 
> On Mon, Sep 10, 2018 at 8:39 AM Steffen Klassert
> 
>  wrote:
> > The easy fix that could be backported to stable would be
> > to check skb->dst for NULL and drop the packet in that case.
> 
> Thought I should just chime in and say that we deployed this
> work-around when we started observing the error back in June. Since
> then we have not seen any crashes. Also, we have instrumented some of
> our kernels to count the number of times the error is hit (overall +
> consecutive). Compared to the overall number of packets, the error
> happens very rarely. With our workloads, we on average see the error
> once every couple of days.
> 

Would you mind send us yout patch (with the accounting) so that we can check 
how often that happens here?

Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts


Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-09-10 Thread Tobias Hommel
On Mon, Sep 10, 2018 at 08:37:39AM +0200, Steffen Klassert wrote:
...
> The other thing I wonder about is why Tobias bisected this to
> 
> commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
> ipv4: mark DST_NOGC and remove the operation of dst_free()
> 
> from 'Jun 17 2017' and not to
> 
> commit 222d7dbd258dad4cd5241c43ef818141fad5a87a
> net: prevent dst uses after free
> 
> from 'Sep 21 2017'.
> 
> Maybe Tobias has seen two bugs. Before
> ("net: prevent dst uses after free"), it was the
> use after free, and after this fix it was a NULL
> pointer derference of skb->dst.
> 
Uhm, yeah, I checked back, we actually had different bugs. My mistake, sorry
for the confusion.


Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-09-10 Thread Kristian Evensen
Hi,

Thanks everyone for all the effort in debugging this issue.

On Mon, Sep 10, 2018 at 8:39 AM Steffen Klassert
 wrote:
> The easy fix that could be backported to stable would be
> to check skb->dst for NULL and drop the packet in that case.

Thought I should just chime in and say that we deployed this
work-around when we started observing the error back in June. Since
then we have not seen any crashes. Also, we have instrumented some of
our kernels to count the number of times the error is hit (overall +
consecutive). Compared to the overall number of packets, the error
happens very rarely. With our workloads, we on average see the error
once every couple of days.

BR,
Kristian


Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-09-10 Thread Steffen Klassert
On Fri, Sep 07, 2018 at 11:10:55PM +0200, Wolfgang Walter wrote:
> Hello Steffen,
> 
> in one of your emails to Thomas you wrote:
> > xfrm_lookup+0x2a is at the very beginning of xfrm_lookup(), here we
> > find:
> > 
> > u16 family = dst_orig->ops->family;
> > 
> > ops has an offset of 32 bytes (20 hex) in dst_orig, so looks like
> > dst_orig is NULL.
> > 
> > In the forwarding case, we get dst_orig from the skb and dst_orig
> > can't be NULL here unless the skb itself is already fishy.
> 
> Is this really true?
> 
> If xfrm_lookup is called from 
> 
> __xfrm_route_forward():
> 
> int __xfrm_route_forward(struct sk_buff *skb, unsigned short family)
> {
> struct net *net = dev_net(skb->dev);
> struct flowi fl;
> struct dst_entry *dst;
> int res = 1;
> 
> if (xfrm_decode_session(skb, , family) < 0) {
> XFRM_INC_STATS(net, LINUX_MIB_XFRMFWDHDRERROR);
> return 0;
> }
> 
> skb_dst_force(skb);
> 
> dst = xfrm_lookup(net, skb_dst(skb), , NULL, XFRM_LOOKUP_QUEUE);
> if (IS_ERR(dst)) {
> res = 0;
> dst = NULL;
> }
> skb_dst_set(skb, dst);
> return res;
> }
> 
> couldn't it be possible that skb_dst_force(skb) actually sets dst to NULL if 
> it cannot safely lock it? If it is absolutely sure that skb_dst_force() never 
> can set dst to NULL I wonder why it is called at all?

Ugh, skb_dst_force apparently changed since I looked at it last time.
I did not expect that it can clear skb->dst. This behaviour was
introduced with:

commit 222d7dbd258dad4cd5241c43ef818141fad5a87a
net: prevent dst uses after free

from Eric Dumazet (put him to Cc).

The easy fix that could be backported to stable would be
to check skb->dst for NULL and drop the packet in that case.

I wonder if we can do better here. We can still use the
dst_entry as long as we don't exit the RCU grace period.
But looking deeper into it, the crypto layer might return
asynchronously. In this case, we exit the RCU grace period
and we have to drop the packet anyway.

If I understand correct, the bug happens rarely. So maybe
we could just stay with the easy fix (I'll do a patch today).

The other thing I wonder about is why Tobias bisected this to

commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
ipv4: mark DST_NOGC and remove the operation of dst_free()

from 'Jun 17 2017' and not to

commit 222d7dbd258dad4cd5241c43ef818141fad5a87a
net: prevent dst uses after free

from 'Sep 21 2017'.

Maybe Tobias has seen two bugs. Before
("net: prevent dst uses after free"), it was the
use after free, and after this fix it was a NULL
pointer derference of skb->dst.



Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-09-07 Thread Wolfgang Walter
Hello Steffen,

in one of your emails to Thomas you wrote:
> xfrm_lookup+0x2a is at the very beginning of xfrm_lookup(), here we
> find:
> 
> u16 family = dst_orig->ops->family;
> 
> ops has an offset of 32 bytes (20 hex) in dst_orig, so looks like
> dst_orig is NULL.
> 
> In the forwarding case, we get dst_orig from the skb and dst_orig
> can't be NULL here unless the skb itself is already fishy.

Is this really true?

If xfrm_lookup is called from 

__xfrm_route_forward():

int __xfrm_route_forward(struct sk_buff *skb, unsigned short family)
{
struct net *net = dev_net(skb->dev);
struct flowi fl;
struct dst_entry *dst;
int res = 1;

if (xfrm_decode_session(skb, , family) < 0) {
XFRM_INC_STATS(net, LINUX_MIB_XFRMFWDHDRERROR);
return 0;
}

skb_dst_force(skb);

dst = xfrm_lookup(net, skb_dst(skb), , NULL, XFRM_LOOKUP_QUEUE);
if (IS_ERR(dst)) {
res = 0;
dst = NULL;
}
skb_dst_set(skb, dst);
return res;
}

couldn't it be possible that skb_dst_force(skb) actually sets dst to NULL if 
it cannot safely lock it? If it is absolutely sure that skb_dst_force() never 
can set dst to NULL I wonder why it is called at all?


Here is  skb_dst_force()

static inline void skb_dst_force(struct sk_buff *skb)
{
if (skb_dst_is_noref(skb)) {
struct dst_entry *dst = skb_dst(skb);

WARN_ON(!rcu_read_lock_held());
if (!dst_hold_safe(dst))
dst = NULL;

skb->_skb_refdst = (unsigned long)dst;
}
}

and dst_hold_safe() is

static inline bool dst_hold_safe(struct dst_entry *dst)
{
return atomic_inc_not_zero(>__refcnt);
}



Am Freitag, 7. September 2018, 22:22:39 schrieb Wolfgang Walter:
> Am Freitag, 31. August 2018, 08:50:24 schrieb Steffen Klassert:
> > On Thu, Aug 30, 2018 at 08:53:50PM +0200, Wolfgang Walter wrote:
> > > Hello,
> > > 
> > > kernels > 4.12 do not work on one of our main routers. They crash as
> > > soon
> > > as ipsec-tunnels are configured and ipsec-traffic actually flows.
> > 
> > Can you please send the backtrace of this crash?
> 
> I bootet the b838d5e1c5b6e57b10ec8af2268824041e3ea911 several times but I
> could not record the complete trace. I think I have to log to the serial
> console but I can't do that before next week.
> 
> 
> What I could record ist:
> 
> There is a always
>... 
> the callrace.
> 
> This is the part I could see:
> 
> 
> irq_exit+0x71/0x80
> do_IRQ+0x4d/0xd0
> common_interrup+07a/0x7a
> 
> RIP: 010:cpuidle_enter_state+0x11d/0x200
> RSP: 0018:c9000321bee0 EFLAGS: 0282 ORIG_RAX: ffc4
> RAX: 88085efde450 RBX: 0004 RCX: 0003c9e63c13
> RDX: 0003c9e63c13 RSI: ffb03103fe35ac43 RDI: 
> RBP: e87cf600 R08: 000c R09: 0004
> R10: 0400 R11: 0003c99e56fc R12: 0003c9e63c13
> R13: 0003c9da9567 R14: 0004 R15: 822763e0
> do_idle+0xd3/0x160
> cpu_startup_entry+0x14/0x20
> secondary_startup_64+0xa5/0xb0
> Code: 00 0f b7 83 c0 00 00 00 80 7c 02 08 01 0f 86 d3 02 00 00 41
> 8b 8c 24 3c 10 00 00 48 8b 6b 58 85 c9 0f 84 2f 01 00 00 48 83 e5 fe  45
> 60
> 02 0f 84 4e 01 00 00 f6 43 38 01 74 0d 80 00 bd ab 00 00
> RIP: ip_forward+0xd4/0x470 RSP: 88085efc3cb0
> CR2: 0060
> [ end trace 7205b53c25b7b35a ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> Kernel Offset: disabled
> Rebooting in 60 seconds..
> 
> 
> I got an email from Tobias Hommel and I think it is the same problem.
> 
> It is very clear that it is the difference from
> 
>   ipv4: call dst_hold_safe() properly
> 
> to
> 
>   ipv4: mark DST_NOGC and remove the operation of dst_free()
> 
> which triggers this bug.
> 
> Regards,

Regards
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts


Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-09-07 Thread Wolfgang Walter
Am Freitag, 31. August 2018, 08:50:24 schrieb Steffen Klassert:
> On Thu, Aug 30, 2018 at 08:53:50PM +0200, Wolfgang Walter wrote:
> > Hello,
> > 
> > kernels > 4.12 do not work on one of our main routers. They crash as soon
> > as ipsec-tunnels are configured and ipsec-traffic actually flows.
> 
> Can you please send the backtrace of this crash?
> 

I bootet the b838d5e1c5b6e57b10ec8af2268824041e3ea911 several times but I 
could not record the complete trace. I think I have to log to the serial 
console but I can't do that before next week.


What I could record ist:

There is a always 
 ... 
the callrace.

This is the part I could see:


irq_exit+0x71/0x80
do_IRQ+0x4d/0xd0
common_interrup+07a/0x7a

RIP: 010:cpuidle_enter_state+0x11d/0x200
RSP: 0018:c9000321bee0 EFLAGS: 0282 ORIG_RAX: ffc4
RAX: 88085efde450 RBX: 0004 RCX: 0003c9e63c13
RDX: 0003c9e63c13 RSI: ffb03103fe35ac43 RDI: 
RBP: e87cf600 R08: 000c R09: 0004
R10: 0400 R11: 0003c99e56fc R12: 0003c9e63c13
R13: 0003c9da9567 R14: 0004 R15: 822763e0
do_idle+0xd3/0x160
cpu_startup_entry+0x14/0x20
secondary_startup_64+0xa5/0xb0
Code: 00 0f b7 83 c0 00 00 00 80 7c 02 08 01 0f 86 d3 02 00 00 41
8b 8c 24 3c 10 00 00 48 8b 6b 58 85 c9 0f 84 2f 01 00 00 48 83 e5 fe  45 
60
02 0f 84 4e 01 00 00 f6 43 38 01 74 0d 80 00 bd ab 00 00
RIP: ip_forward+0xd4/0x470 RSP: 88085efc3cb0
CR2: 0060
[ end trace 7205b53c25b7b35a ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
Rebooting in 60 seconds..


I got an email from Tobias Hommel and I think it is the same problem.

It is very clear that it is the difference from

ipv4: call dst_hold_safe() properly

to

ipv4: mark DST_NOGC and remove the operation of dst_free()

which triggers this bug.

Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts


Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-09-07 Thread Wolfgang Walter
Hello,

didn't respond as I've been on vacation.

Am Freitag, 31. August 2018, 08:50:24 schrieb Steffen Klassert:
> On Thu, Aug 30, 2018 at 08:53:50PM +0200, Wolfgang Walter wrote:
> > Hello,
> > 
> > kernels > 4.12 do not work on one of our main routers. They crash as soon
> > as ipsec-tunnels are configured and ipsec-traffic actually flows.
> 
> Can you please send the backtrace of this crash?
> 

I'll try today. The oops quickly disappears because other problems arising 
from it pop up. The machine crashes and no logs are logged. I try to make foto 
or try to log to the serial console.

At the moment I only see that there is xfrm_ stuff in the call trace as 
xfrm_lookup, xfrm_route_, and it is while routing a packet.

With later kernels (4.18.5) the machine seems to crash without a call trace on 
console.

Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts


Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-08-31 Thread Steffen Klassert
On Thu, Aug 30, 2018 at 08:53:50PM +0200, Wolfgang Walter wrote:
> Hello,
> 
> kernels > 4.12 do not work on one of our main routers. They crash as soon
> as ipsec-tunnels are configured and ipsec-traffic actually flows.

Can you please send the backtrace of this crash?

Thanks!


Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-08-30 Thread Wolfgang Walter
Hello,

kernels > 4.12 do not work on one of our main routers. They crash as soon
as ipsec-tunnels are configured and ipsec-traffic actually flows.
 
Just configuring ipsec (that is starting strongswan) does not trigger the
oops.
 
I finally found time to bisect that. It bisected down to

b838d5e1c5b6e57b10ec8af2268824041e3ea911
ipv4: mark DST_NOGC and remove the operation of dst_free()

Now we have other machines which run just fine with the very same kernels
doing ipsec. They differ insofar as they have much less cores, do not use
the ixgbe driver, do not have 10G and terminate only a few tunnels instead
of hundreds.

I already tested distribution kernels > 4.12 from debian, they also crash.

All kernels I created in the bisection run fine if I didn't use ipsec.
The bad ones all oopsed/crashed exactly as vanilla 4.14 described above.


Here is the bisect-log:

# bad: [bebc6082da0a9f5d47a1ea2edc099bf671058bd4] Linux 4.14
# good: [69973b830859bc6529a7a0468ba0d80ee5117826] Linux 4.9
git bisect start 'v4.14' 'v4.9'
# good: [d82dd0e34d0347be201fd274dc84cd645dccc064] raid1: prefer disk without 
bad blocks
git bisect good d82dd0e34d0347be201fd274dc84cd645dccc064
# bad: [9967468c0a109644e4a1f5b39b39bf86fe7507a7] Merge branch 'akpm' (patches 
from Andrew)
git bisect bad 9967468c0a109644e4a1f5b39b39bf86fe7507a7
# bad: [17d9aa66b08de445645bd0688fc1635bed77a57b] Merge tag 
'iwlwifi-next-for-kalle-2017-06-30' of 
git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-next
git bisect bad 17d9aa66b08de445645bd0688fc1635bed77a57b
# good: [de4d195308ad589626571dbe5789cebf9695a204] Merge branch 
'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good de4d195308ad589626571dbe5789cebf9695a204
# good: [9376906c17fa975bf6a7ea9dd124be697bcda289] Merge branch 
'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 9376906c17fa975bf6a7ea9dd124be697bcda289
# good: [40e86a3619a1e84ad73c716c943f65fc38eb1e28] iwlwifi: mvm: use 
scnprintf() instead of snprintf()
git bisect good 40e86a3619a1e84ad73c716c943f65fc38eb1e28
# bad: [c66f2091c9248ddf42504c74cd327ae8619b04a4] net/mlx5e: Prevent PFC call 
for non ethernet ports
git bisect bad c66f2091c9248ddf42504c74cd327ae8619b04a4
# good: [a090bd4ff8387c409732a8e059fbf264ea0bdd56] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect good a090bd4ff8387c409732a8e059fbf264ea0bdd56
# good: [1947030645b6012aeee98da764d6dd47071a6aad] Merge branch 
'dsa-prefix-Global-macros'
git bisect good 1947030645b6012aeee98da764d6dd47071a6aad
# good: [69137ea60c9dad58773a1918de6c1b00b088520c] pktgen: Specify num packets 
per thread
git bisect good 69137ea60c9dad58773a1918de6c1b00b088520c
# good: [d24406c85d123df773bc4df88ad5da2233896919] udp: call dst_hold_safe() in 
udp_sk_rx_set_dst()
git bisect good d24406c85d123df773bc4df88ad5da2233896919
# bad: [5b7c9a8ff828287af5aebe93e707271bf1a82cc3] net: remove dst gc related 
code
git bisect bad 5b7c9a8ff828287af5aebe93e707271bf1a82cc3
# bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and 
remove the operation of dst_free()
git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
# good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a new 
function dst_dev_put()
git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
# good: [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() 
properly
git bisect good 95c47f9cf5e028d1ae77dc6c767c1edc8a18025b
# good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() 
properly
git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
# first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark 
DST_NOGC and remove the operation of dst_free()


In my first email I wrote >= 4.12, but I think 4.12 works. I bisected between
4.9 and 4.14 as we actually run 4.9 on the machine with the problem and 4.14
on most other routers.

I also tested 4.18.5 and it still shows this bug.


Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts