Re: 4.1.0, kernel panic, pppoe_release

2015-09-25 Thread Denys Fedoryshchenko

On 2015-09-25 17:38, Guillaume Nault wrote:

On Tue, Sep 22, 2015 at 04:47:48AM +0300, Denys Fedoryshchenko wrote:

Hi,
Sorry for late reply, was not able to push new kernel on pppoes 
without

permissions (it's production servers), just got OK.

I am testing patch on another pppoe server with 9k users, for ~3 days, 
seems

fine. I will test today
also on server that was experiencing crashes within 1 day.


Thanks for the feedback. I'm about to submit a fix. Should I add a
Tested-by tag for you?
On one of servers i got same crash as before, within hours. 9k users 
server also crashed after while, so it seems it doesn't help.

I will do some more tests tomorrow.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.1.0, kernel panic, pppoe_release

2015-09-25 Thread Guillaume Nault
On Tue, Sep 22, 2015 at 04:47:48AM +0300, Denys Fedoryshchenko wrote:
> Hi,
> Sorry for late reply, was not able to push new kernel on pppoes without
> permissions (it's production servers), just got OK.
> 
> I am testing patch on another pppoe server with 9k users, for ~3 days, seems
> fine. I will test today
> also on server that was experiencing crashes within 1 day.
>
Thanks for the feedback. I'm about to submit a fix. Should I add a
Tested-by tag for you?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.1.0, kernel panic, pppoe_release

2015-09-25 Thread Guillaume Nault
On Fri, Sep 25, 2015 at 06:02:42PM +0300, Denys Fedoryshchenko wrote:
> On 2015-09-25 17:38, Guillaume Nault wrote:
> >On Tue, Sep 22, 2015 at 04:47:48AM +0300, Denys Fedoryshchenko wrote:
> >>Hi,
> >>Sorry for late reply, was not able to push new kernel on pppoes without
> >>permissions (it's production servers), just got OK.
> >>
> >>I am testing patch on another pppoe server with 9k users, for ~3 days,
> >>seems
> >>fine. I will test today
> >>also on server that was experiencing crashes within 1 day.
> >>
> >Thanks for the feedback. I'm about to submit a fix. Should I add a
> >Tested-by tag for you?
> On one of servers i got same crash as before, within hours. 9k users server
> also crashed after while, so it seems it doesn't help.
> I will do some more tests tomorrow.
Ok, this must be a different bug then. Do you have a trace of a crash
with the patched kernel?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.1.0, kernel panic, pppoe_release

2015-09-21 Thread Denys Fedoryshchenko

Hi,
Sorry for late reply, was not able to push new kernel on pppoes without 
permissions (it's production servers), just got OK.


I am testing patch on another pppoe server with 9k users, for ~3 days, 
seems fine. I will test today

also on server that was experiencing crashes within 1 day.

On 2015-09-10 18:56, Guillaume Nault wrote:

On Fri, Jul 17, 2015 at 09:16:14PM +0300, Denys Fedoryshchenko wrote:

Probably my knowledge of kernel is not sufficient, but i will try few
approaches.
One of them to add to pppoe_unbind_sock_work:

pppox_unbind_sock(sk);
+/* Signal the death of the socket. */
+sk->sk_state = PPPOX_DEAD;


I don't believe this will fix anything. pppox_unbind_sock() already
sets sk->sk_state when necessary.

I will wait first, to make sure this patch was causing kernel panic 
(it

needs 24h testing cycle), then i will try this fix.


I suspect the problem goes with actions performed on the underlying
interface (MAC address, MTU or link state update). This triggers
pppoe_flush_dev(), which cleans up the device without announcing it
in sk->sk_state.

Can you pleas try the following patch?

---
diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index 3837ae3..2ed7506 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -313,7 +313,6 @@ static void pppoe_flush_dev(struct net_device *dev)
if (po->pppoe_dev == dev &&
 			sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND | PPPOX_ZOMBIE)) 
{

pppox_unbind_sock(sk);
-   sk->sk_state = PPPOX_ZOMBIE;
sk->sk_state_change(sk);
po->pppoe_dev = NULL;
dev_put(dev);

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.1.0, kernel panic, pppoe_release

2015-09-10 Thread Guillaume Nault
On Fri, Jul 17, 2015 at 09:16:14PM +0300, Denys Fedoryshchenko wrote:
> Probably my knowledge of kernel is not sufficient, but i will try few
> approaches.
> One of them to add to pppoe_unbind_sock_work:
> 
> pppox_unbind_sock(sk);
> +/* Signal the death of the socket. */
> +sk->sk_state = PPPOX_DEAD;
>
I don't believe this will fix anything. pppox_unbind_sock() already
sets sk->sk_state when necessary.

> I will wait first, to make sure this patch was causing kernel panic (it
> needs 24h testing cycle), then i will try this fix.
> 
I suspect the problem goes with actions performed on the underlying
interface (MAC address, MTU or link state update). This triggers
pppoe_flush_dev(), which cleans up the device without announcing it
in sk->sk_state.

Can you pleas try the following patch?

---
diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index 3837ae3..2ed7506 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -313,7 +313,6 @@ static void pppoe_flush_dev(struct net_device *dev)
if (po->pppoe_dev == dev &&
sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND | 
PPPOX_ZOMBIE)) {
pppox_unbind_sock(sk);
-   sk->sk_state = PPPOX_ZOMBIE;
sk->sk_state_change(sk);
po->pppoe_dev = NULL;
dev_put(dev);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.1.0, kernel panic, pppoe_release

2015-07-17 Thread Denys Fedoryshchenko

As i suspect, this kernel panic caused by recent changes to pppoe.
This problem appearing in accel-pppd (server), on loaded servers (2k 
users and more).
Most probably related to changed pppoe: Use workqueue to die properly 
when a PADT is received

I will try to reverse this and related patches.

On 2015-07-14 13:57, Denys Fedoryshchenko wrote:

Here is panic message from netconsole. Please let me know if any
additional information required.

Jul 14 13:49:16 10.0.252.10 [76078.867822] BUG: unable to handle kernel
Jul 14 13:49:16 10.0.252.10 NULL pointer dereference
Jul 14 13:49:16 10.0.252.10 at 03f0
Jul 14 13:49:16 10.0.252.10 [76078.868280] IP:
Jul 14 13:49:16 10.0.252.10 [a011e12a]
pppoe_release+0x56/0x142 [pppoe]
Jul 14 13:49:16 10.0.252.10 [76078.868541] PGD 336e4a067
Jul 14 13:49:16 10.0.252.10 PUD 333f17067
Jul 14 13:49:16 10.0.252.10 PMD 0
Jul 14 13:49:16 10.0.252.10
Jul 14 13:49:16 10.0.252.10 [76078.868918] Oops:  [#1]
Jul 14 13:49:16 10.0.252.10 SMP
Jul 14 13:49:16 10.0.252.10
Jul 14 13:49:16 10.0.252.10 [76078.869226] Modules linked in:
Jul 14 13:49:16 10.0.252.10 netconsole
Jul 14 13:49:16 10.0.252.10 configfs
Jul 14 13:49:16 10.0.252.10 coretemp
Jul 14 13:49:16 10.0.252.10 sch_fq
Jul 14 13:49:16 10.0.252.10 cls_fw
Jul 14 13:49:16 10.0.252.10 act_police
Jul 14 13:49:16 10.0.252.10 cls_u32
Jul 14 13:49:16 10.0.252.10 sch_ingress
Jul 14 13:49:16 10.0.252.10 sch_sfq
Jul 14 13:49:16 10.0.252.10 sch_htb
Jul 14 13:49:16 10.0.252.10 pppoe
Jul 14 13:49:16 10.0.252.10 pppox
Jul 14 13:49:16 10.0.252.10 ppp_generic
Jul 14 13:49:16 10.0.252.10 slhc
Jul 14 13:49:16 10.0.252.10 nf_nat_pptp
Jul 14 13:49:16 10.0.252.10 nf_nat_proto_gre
Jul 14 13:49:16 10.0.252.10 nf_conntrack_pptp
Jul 14 13:49:16 10.0.252.10 nf_conntrack_proto_gre
Jul 14 13:49:16 10.0.252.10 tun
Jul 14 13:49:16 10.0.252.10 xt_REDIRECT
Jul 14 13:49:16 10.0.252.10 nf_nat_redirect
Jul 14 13:49:16 10.0.252.10 xt_set
Jul 14 13:49:16 10.0.252.10 xt_TCPMSS
Jul 14 13:49:16 10.0.252.10 ipt_REJECT
Jul 14 13:49:16 10.0.252.10 nf_reject_ipv4
Jul 14 13:49:16 10.0.252.10 ts_bm
Jul 14 13:49:16 10.0.252.10 xt_string
Jul 14 13:49:16 10.0.252.10 xt_connmark
Jul 14 13:49:16 10.0.252.10 xt_DSCP
Jul 14 13:49:16 10.0.252.10 xt_mark
Jul 14 13:49:16 10.0.252.10 xt_tcpudp
Jul 14 13:49:16 10.0.252.10 iptable_mangle
Jul 14 13:49:16 10.0.252.10 iptable_filter
Jul 14 13:49:16 10.0.252.10 iptable_nat
Jul 14 13:49:16 10.0.252.10 nf_conntrack_ipv4
Jul 14 13:49:16 10.0.252.10 nf_defrag_ipv4
Jul 14 13:49:16 10.0.252.10 nf_nat_ipv4
Jul 14 13:49:16 10.0.252.10 nf_nat
Jul 14 13:49:16 10.0.252.10 nf_conntrack
Jul 14 13:49:16 10.0.252.10 ip_tables
Jul 14 13:49:16 10.0.252.10 x_tables
Jul 14 13:49:16 10.0.252.10 ip_set_hash_ip
Jul 14 13:49:16 10.0.252.10 ip_set
Jul 14 13:49:16 10.0.252.10 nfnetlink
Jul 14 13:49:16 10.0.252.10 8021q
Jul 14 13:49:16 10.0.252.10 garp
Jul 14 13:49:16 10.0.252.10 mrp
Jul 14 13:49:16 10.0.252.10 stp
Jul 14 13:49:16 10.0.252.10 llc
Jul 14 13:49:16 10.0.252.10 [last unloaded: netconsole]
Jul 14 13:49:16 10.0.252.10
Jul 14 13:49:16 10.0.252.10 [76078.873195] CPU: 3 PID: 2940 Comm:
accel-pppd Not tainted 4.1.0-build-0074 #7
Jul 14 13:49:16 10.0.252.10 [76078.873396] Hardware name: HP ProLiant
DL320e Gen8 v2, BIOS P80 04/02/2015
Jul 14 13:49:16 10.0.252.10 [76078.873598] task: 8800b1886ba0 ti:
8800b09f4000 task.ti: 8800b09f4000
Jul 14 13:49:16 10.0.252.10 [76078.873929] RIP: 
0010:[a011e12a]

Jul 14 13:49:16 10.0.252.10 [a011e12a]
pppoe_release+0x56/0x142 [pppoe]
Jul 14 13:49:16 10.0.252.10 [76078.874317] RSP: 0018:8800b09f7e28
EFLAGS: 00010202
Jul 14 13:49:16 10.0.252.10 [76078.874512] RAX:  RBX:
88032a214400 RCX: 
Jul 14 13:49:16 10.0.252.10 [76078.874709] RDX: 000d RSI:
fe01 RDI: 8180d6da
Jul 14 13:49:16 10.0.252.10 [76078.874906] RBP: 8800b09f7e68 R08:
 R09: 
Jul 14 13:49:16 10.0.252.10 [76078.875102] R10: 88031ef6a110 R11:
0293 R12: 88030f8d8fc0
Jul 14 13:49:16 10.0.252.10 [76078.875299] R13: 88030f8d8ff0 R14:
88033115ee40 R15: 8803394e4920
Jul 14 13:49:16 10.0.252.10 [76078.875499] FS:  7f79b602c700()
GS:88034746() knlGS:
Jul 14 13:49:16 10.0.252.10 [76078.875837] CS:  0010 DS:  ES: 
CR0: 80050033
Jul 14 13:49:16 10.0.252.10 [76078.876036] CR2: 03f0 CR3:
000335425000 CR4: 001407e0
Jul 14 13:49:16 10.0.252.10 [76078.876239] Stack:
Jul 14 13:49:16 10.0.252.10 [76078.876434]  88033ac45c80
Jul 14 13:49:16 10.0.252.10 
Jul 14 13:49:16 10.0.252.10 0001
Jul 14 13:49:16 10.0.252.10 88030f8d8fc0
Jul 14 13:49:16 10.0.252.10
Jul 14 13:49:16 10.0.252.10 [76078.877001]  a0120260
Jul 14 13:49:16 10.0.252.10 88030f8d8ff0
Jul 14 13:49:16 10.0.252.10 88033115ee40
Jul 14 13:49:16 10.0.252.10 8803394e4920
Jul 14 13:49:16 10.0.252.10
Jul 14 13:49:16 

Re: 4.1.0, kernel panic, pppoe_release

2015-07-17 Thread Denys Fedoryshchenko
Probably my knowledge of kernel is not sufficient, but i will try few 
approaches.

One of them to add to pppoe_unbind_sock_work:

pppox_unbind_sock(sk);
+/* Signal the death of the socket. */
+sk-sk_state = PPPOX_DEAD;

I will wait first, to make sure this patch was causing kernel panic (it 
needs 24h testing cycle), then i will try this fix.


On 2015-07-17 18:36, Dan Williams wrote:

On Fri, 2015-07-17 at 12:24 +0300, Denys Fedoryshchenko wrote:

As i suspect, this kernel panic caused by recent changes to pppoe.
This problem appearing in accel-pppd (server), on loaded servers (2k
users and more).
Most probably related to changed pppoe: Use workqueue to die properly
when a PADT is received
I will try to reverse this and related patches.


While I didn't write the patch, I'm the one that started the process
that got it submitted...  Could you review the patch quickly too to see
if you can spot anything amiss with it, so that it could get fixed up?
The original patch does fix a real problem so ideally we don't have to
revert the whole thing upstream.

Dan


On 2015-07-14 13:57, Denys Fedoryshchenko wrote:
 Here is panic message from netconsole. Please let me know if any
 additional information required.

 Jul 14 13:49:16 10.0.252.10 [76078.867822] BUG: unable to handle kernel
 Jul 14 13:49:16 10.0.252.10 NULL pointer dereference
 Jul 14 13:49:16 10.0.252.10 at 03f0
 Jul 14 13:49:16 10.0.252.10 [76078.868280] IP:
 Jul 14 13:49:16 10.0.252.10 [a011e12a]
 pppoe_release+0x56/0x142 [pppoe]
 Jul 14 13:49:16 10.0.252.10 [76078.868541] PGD 336e4a067
 Jul 14 13:49:16 10.0.252.10 PUD 333f17067
 Jul 14 13:49:16 10.0.252.10 PMD 0
 Jul 14 13:49:16 10.0.252.10
 Jul 14 13:49:16 10.0.252.10 [76078.868918] Oops:  [#1]
 Jul 14 13:49:16 10.0.252.10 SMP
 Jul 14 13:49:16 10.0.252.10
 Jul 14 13:49:16 10.0.252.10 [76078.869226] Modules linked in:
 Jul 14 13:49:16 10.0.252.10 netconsole
 Jul 14 13:49:16 10.0.252.10 configfs
 Jul 14 13:49:16 10.0.252.10 coretemp
 Jul 14 13:49:16 10.0.252.10 sch_fq
 Jul 14 13:49:16 10.0.252.10 cls_fw
 Jul 14 13:49:16 10.0.252.10 act_police
 Jul 14 13:49:16 10.0.252.10 cls_u32
 Jul 14 13:49:16 10.0.252.10 sch_ingress
 Jul 14 13:49:16 10.0.252.10 sch_sfq
 Jul 14 13:49:16 10.0.252.10 sch_htb
 Jul 14 13:49:16 10.0.252.10 pppoe
 Jul 14 13:49:16 10.0.252.10 pppox
 Jul 14 13:49:16 10.0.252.10 ppp_generic
 Jul 14 13:49:16 10.0.252.10 slhc
 Jul 14 13:49:16 10.0.252.10 nf_nat_pptp
 Jul 14 13:49:16 10.0.252.10 nf_nat_proto_gre
 Jul 14 13:49:16 10.0.252.10 nf_conntrack_pptp
 Jul 14 13:49:16 10.0.252.10 nf_conntrack_proto_gre
 Jul 14 13:49:16 10.0.252.10 tun
 Jul 14 13:49:16 10.0.252.10 xt_REDIRECT
 Jul 14 13:49:16 10.0.252.10 nf_nat_redirect
 Jul 14 13:49:16 10.0.252.10 xt_set
 Jul 14 13:49:16 10.0.252.10 xt_TCPMSS
 Jul 14 13:49:16 10.0.252.10 ipt_REJECT
 Jul 14 13:49:16 10.0.252.10 nf_reject_ipv4
 Jul 14 13:49:16 10.0.252.10 ts_bm
 Jul 14 13:49:16 10.0.252.10 xt_string
 Jul 14 13:49:16 10.0.252.10 xt_connmark
 Jul 14 13:49:16 10.0.252.10 xt_DSCP
 Jul 14 13:49:16 10.0.252.10 xt_mark
 Jul 14 13:49:16 10.0.252.10 xt_tcpudp
 Jul 14 13:49:16 10.0.252.10 iptable_mangle
 Jul 14 13:49:16 10.0.252.10 iptable_filter
 Jul 14 13:49:16 10.0.252.10 iptable_nat
 Jul 14 13:49:16 10.0.252.10 nf_conntrack_ipv4
 Jul 14 13:49:16 10.0.252.10 nf_defrag_ipv4
 Jul 14 13:49:16 10.0.252.10 nf_nat_ipv4
 Jul 14 13:49:16 10.0.252.10 nf_nat
 Jul 14 13:49:16 10.0.252.10 nf_conntrack
 Jul 14 13:49:16 10.0.252.10 ip_tables
 Jul 14 13:49:16 10.0.252.10 x_tables
 Jul 14 13:49:16 10.0.252.10 ip_set_hash_ip
 Jul 14 13:49:16 10.0.252.10 ip_set
 Jul 14 13:49:16 10.0.252.10 nfnetlink
 Jul 14 13:49:16 10.0.252.10 8021q
 Jul 14 13:49:16 10.0.252.10 garp
 Jul 14 13:49:16 10.0.252.10 mrp
 Jul 14 13:49:16 10.0.252.10 stp
 Jul 14 13:49:16 10.0.252.10 llc
 Jul 14 13:49:16 10.0.252.10 [last unloaded: netconsole]
 Jul 14 13:49:16 10.0.252.10
 Jul 14 13:49:16 10.0.252.10 [76078.873195] CPU: 3 PID: 2940 Comm:
 accel-pppd Not tainted 4.1.0-build-0074 #7
 Jul 14 13:49:16 10.0.252.10 [76078.873396] Hardware name: HP ProLiant
 DL320e Gen8 v2, BIOS P80 04/02/2015
 Jul 14 13:49:16 10.0.252.10 [76078.873598] task: 8800b1886ba0 ti:
 8800b09f4000 task.ti: 8800b09f4000
 Jul 14 13:49:16 10.0.252.10 [76078.873929] RIP:
 0010:[a011e12a]
 Jul 14 13:49:16 10.0.252.10 [a011e12a]
 pppoe_release+0x56/0x142 [pppoe]
 Jul 14 13:49:16 10.0.252.10 [76078.874317] RSP: 0018:8800b09f7e28
 EFLAGS: 00010202
 Jul 14 13:49:16 10.0.252.10 [76078.874512] RAX:  RBX:
 88032a214400 RCX: 
 Jul 14 13:49:16 10.0.252.10 [76078.874709] RDX: 000d RSI:
 fe01 RDI: 8180d6da
 Jul 14 13:49:16 10.0.252.10 [76078.874906] RBP: 8800b09f7e68 R08:
  R09: 
 Jul 14 13:49:16 10.0.252.10 [76078.875102] R10: 88031ef6a110 R11:
 0293 R12: 88030f8d8fc0
 Jul 14 13:49:16 10.0.252.10 [76078.875299] R13: 88030f8d8ff0