Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-03-05 Thread Guillaume Nault
On Sat, Mar 03, 2018 at 11:33:53AM +0200, Denys Fedoryshchenko wrote:
> On 2018-03-02 19:43, Guillaume Nault wrote:
> > Out of curiosity, did unit-cache really bring performance improvements
> > on your workload?
> On old kernels it definitely did, due local specifics (electricity outages)
> i might have few thousands of interfaces deleted and created again in short
> period of time.
> And before interfaces creation/deletion (especially when there is thousands
> of them) was very expensive.
I see. Our workload is a bit different, that's probably why we've never
felt the need for the unit-cache.


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-03-03 Thread Denys Fedoryshchenko

On 2018-03-02 19:43, Guillaume Nault wrote:

On Thu, Mar 01, 2018 at 10:07:05PM +0200, Denys Fedoryshchenko wrote:

On 2018-03-01 22:01, Guillaume Nault wrote:
> diff --git a/drivers/net/ppp/ppp_generic.c
> b/drivers/net/ppp/ppp_generic.c
> index 255a5def56e9..2acf4b0eabd1 100644
> --- a/drivers/net/ppp/ppp_generic.c
> +++ b/drivers/net/ppp/ppp_generic.c
> @@ -3161,6 +3161,15 @@ ppp_connect_channel(struct channel *pch, int
> unit)
>goto outl;
>
>ppp_lock(ppp);
> +  spin_lock_bh(>downl);
> +  if (!pch->chan) {
> +  /* Don't connect unregistered channels */
> +  ppp_unlock(ppp);
> +  spin_unlock_bh(>downl);


This is obviously wrong. It should have been
+   spin_unlock_bh(>downl);
+   ppp_unlock(ppp);

Sorry, I shouldn't have hurried.
This is fixed in the official version.


> +  ret = -ENOTCONN;
> +  goto outl;
> +  }
> +  spin_unlock_bh(>downl);
>if (pch->file.hdrlen > ppp->file.hdrlen)
>ppp->file.hdrlen = pch->file.hdrlen;
>hdrlen = pch->file.hdrlen + 2;   /* for protocol bytes */
Ok, i will try to test that at night.
Thanks a lot! For me also problem solved anyway by removing 
unit-cache, just

i think it's nice to have bug fixed :)

I think this bug has been there forever, indeed it's good to have it 
fixed.

Thanks a lot for your help (and patience!).

FYI, if you see accel-ppp logs like
"ioctl(PPPIOCCONNECT): Transport endpoint is not connected", then that
means the patch prevented the scenario that was leading to the original
crash.

Out of curiosity, did unit-cache really bring performance improvements
on your workload?
On old kernels it definitely did, due local specifics (electricity 
outages) i might have few thousands of interfaces deleted and created 
again in short period of time.
And before interfaces creation/deletion (especially when there is 
thousands of them) was very expensive.


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-03-02 Thread Guillaume Nault
On Thu, Mar 01, 2018 at 10:07:05PM +0200, Denys Fedoryshchenko wrote:
> On 2018-03-01 22:01, Guillaume Nault wrote:
> > diff --git a/drivers/net/ppp/ppp_generic.c
> > b/drivers/net/ppp/ppp_generic.c
> > index 255a5def56e9..2acf4b0eabd1 100644
> > --- a/drivers/net/ppp/ppp_generic.c
> > +++ b/drivers/net/ppp/ppp_generic.c
> > @@ -3161,6 +3161,15 @@ ppp_connect_channel(struct channel *pch, int
> > unit)
> > goto outl;
> > 
> > ppp_lock(ppp);
> > +   spin_lock_bh(>downl);
> > +   if (!pch->chan) {
> > +   /* Don't connect unregistered channels */
> > +   ppp_unlock(ppp);
> > +   spin_unlock_bh(>downl);

This is obviously wrong. It should have been
+   spin_unlock_bh(>downl);
+   ppp_unlock(ppp);

Sorry, I shouldn't have hurried.
This is fixed in the official version.

> > +   ret = -ENOTCONN;
> > +   goto outl;
> > +   }
> > +   spin_unlock_bh(>downl);
> > if (pch->file.hdrlen > ppp->file.hdrlen)
> > ppp->file.hdrlen = pch->file.hdrlen;
> > hdrlen = pch->file.hdrlen + 2;  /* for protocol bytes */
> Ok, i will try to test that at night.
> Thanks a lot! For me also problem solved anyway by removing unit-cache, just
> i think it's nice to have bug fixed :)
> 
I think this bug has been there forever, indeed it's good to have it fixed.
Thanks a lot for your help (and patience!).

FYI, if you see accel-ppp logs like
"ioctl(PPPIOCCONNECT): Transport endpoint is not connected", then that
means the patch prevented the scenario that was leading to the original
crash.

Out of curiosity, did unit-cache really bring performance improvements
on your workload?



Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-03-01 Thread Denys Fedoryshchenko



On 2018-03-01 22:01, Guillaume Nault wrote:

On Tue, Feb 27, 2018 at 07:56:27PM +0100, Guillaume Nault wrote:

On Tue, Feb 27, 2018 at 12:58:55PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-23 12:07, Guillaume Nault wrote:
> > On Fri, Feb 23, 2018 at 11:41:43AM +0200, Denys Fedoryshchenko wrote:
> > > On 2018-02-23 11:38, Guillaume Nault wrote:
> > > > On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
> > > > > I'm using accel-ppp that has unit-cache option, i guess for
> > > > > "reusing" ppp
> > > > > interfaces (because creating a lot of interfaces on BRAS with 8k
> > > > > users quite
> > > > > expensive).
> > > > > Maybe it is somehow related and can be that scenario causing this bug?
> > > > >
> > > > Indeed, it'd be interesting to know if unit-cache is part of the
> > > > equation (if it's workable for you to disable it).
> > > Already did that and testing, unfortunately i had to disable KASAN
> > > and full
> > > refcount, as performance hit is too heavy for me. I will try to
> > > enable KASAN
> > > alone tomorrow.
> > >
> > Don't hesitate to post the result even if you can't afford enabling
> > KASAN.
> Till now 4 days and no reboots.
>
That unit-cache information was very useful. I can now reproduce the
issue and work on a fix.


You can try the following patch.

Sorry for the delay, I'm a bit out of time these days.

diff --git a/drivers/net/ppp/ppp_generic.c 
b/drivers/net/ppp/ppp_generic.c

index 255a5def56e9..2acf4b0eabd1 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -3161,6 +3161,15 @@ ppp_connect_channel(struct channel *pch, int 
unit)

goto outl;

ppp_lock(ppp);
+   spin_lock_bh(>downl);
+   if (!pch->chan) {
+   /* Don't connect unregistered channels */
+   ppp_unlock(ppp);
+   spin_unlock_bh(>downl);
+   ret = -ENOTCONN;
+   goto outl;
+   }
+   spin_unlock_bh(>downl);
if (pch->file.hdrlen > ppp->file.hdrlen)
ppp->file.hdrlen = pch->file.hdrlen;
hdrlen = pch->file.hdrlen + 2;   /* for protocol bytes */

Ok, i will try to test that at night.
Thanks a lot! For me also problem solved anyway by removing unit-cache, 
just i think it's nice to have bug fixed :)


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-03-01 Thread Guillaume Nault
On Tue, Feb 27, 2018 at 07:56:27PM +0100, Guillaume Nault wrote:
> On Tue, Feb 27, 2018 at 12:58:55PM +0200, Denys Fedoryshchenko wrote:
> > On 2018-02-23 12:07, Guillaume Nault wrote:
> > > On Fri, Feb 23, 2018 at 11:41:43AM +0200, Denys Fedoryshchenko wrote:
> > > > On 2018-02-23 11:38, Guillaume Nault wrote:
> > > > > On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
> > > > > > I'm using accel-ppp that has unit-cache option, i guess for
> > > > > > "reusing" ppp
> > > > > > interfaces (because creating a lot of interfaces on BRAS with 8k
> > > > > > users quite
> > > > > > expensive).
> > > > > > Maybe it is somehow related and can be that scenario causing this 
> > > > > > bug?
> > > > > >
> > > > > Indeed, it'd be interesting to know if unit-cache is part of the
> > > > > equation (if it's workable for you to disable it).
> > > > Already did that and testing, unfortunately i had to disable KASAN
> > > > and full
> > > > refcount, as performance hit is too heavy for me. I will try to
> > > > enable KASAN
> > > > alone tomorrow.
> > > > 
> > > Don't hesitate to post the result even if you can't afford enabling
> > > KASAN.
> > Till now 4 days and no reboots.
> > 
> That unit-cache information was very useful. I can now reproduce the
> issue and work on a fix.
>
You can try the following patch.

Sorry for the delay, I'm a bit out of time these days.

diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index 255a5def56e9..2acf4b0eabd1 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -3161,6 +3161,15 @@ ppp_connect_channel(struct channel *pch, int unit)
goto outl;
 
ppp_lock(ppp);
+   spin_lock_bh(>downl);
+   if (!pch->chan) {
+   /* Don't connect unregistered channels */
+   ppp_unlock(ppp);
+   spin_unlock_bh(>downl);
+   ret = -ENOTCONN;
+   goto outl;
+   }
+   spin_unlock_bh(>downl);
if (pch->file.hdrlen > ppp->file.hdrlen)
ppp->file.hdrlen = pch->file.hdrlen;
hdrlen = pch->file.hdrlen + 2;  /* for protocol bytes */


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-27 Thread Guillaume Nault
On Tue, Feb 27, 2018 at 12:58:55PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-23 12:07, Guillaume Nault wrote:
> > On Fri, Feb 23, 2018 at 11:41:43AM +0200, Denys Fedoryshchenko wrote:
> > > On 2018-02-23 11:38, Guillaume Nault wrote:
> > > > On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
> > > > > I'm using accel-ppp that has unit-cache option, i guess for
> > > > > "reusing" ppp
> > > > > interfaces (because creating a lot of interfaces on BRAS with 8k
> > > > > users quite
> > > > > expensive).
> > > > > Maybe it is somehow related and can be that scenario causing this bug?
> > > > >
> > > > Indeed, it'd be interesting to know if unit-cache is part of the
> > > > equation (if it's workable for you to disable it).
> > > Already did that and testing, unfortunately i had to disable KASAN
> > > and full
> > > refcount, as performance hit is too heavy for me. I will try to
> > > enable KASAN
> > > alone tomorrow.
> > > 
> > Don't hesitate to post the result even if you can't afford enabling
> > KASAN.
> Till now 4 days and no reboots.
> 
That unit-cache information was very useful. I can now reproduce the
issue and work on a fix.

Thanks Denys!


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-27 Thread Guillaume Nault
On Thu, Feb 22, 2018 at 07:30:38PM +0100, Guillaume Nault wrote:
> On Wed, Feb 21, 2018 at 12:04:30PM -0800, Cong Wang wrote:
> > For me it looks like pch->clist is not removed from the list ppp->channels
> > when destroyed via ppp_release(). But I don't want to pretend I understand
> > ppp logic.
> > 
> I've thought about that too, but couldn't find a scenario that could
> trigger the bug.
> 
> To get ->private_data pointing to a struct channel pointer, a file needs
> to ioctl(PPPIOCATTCHAN) first. For this call to succeed, the channel
> must have been registered with ppp_register_net_channel(). Both
> operations take a reference on the channel, which means that, before
> adding pch->clist to a ppp->channels list (with ppp_connect_channel()),
> the channel is already held by a /dev/ppp file and by the code that
> registered the channel in the first place.
> 
> Therefore, closing the /dev/ppp file can't be enough to make
> ppp_release() free the channel. We need to unregister the channel for
> the refcount to drop to 0. And ppp_unregister_channel(), removes
> pch->clist from ppp->channels before decrementing the reference
> counter.
> 
And this is where my reasoning failed... If pch->clist hasn't been
added to a ppp->channels list (that is, there was no
ppp_connect_channel() call for this channel), then
ppp_unregister_channel() only decrements the reference counter.

Therefore, we now have an unregistered channel which is only held by a
/dev/ppp file. But ioctl(PPPIOCCONNECT) still works on such a file, so
one can add pch->clist to a ppp->channels list. When the file
descriptor closes, we fall in Cong's scenario and the channel is freed,
leaving dangling pointers in ppp->channels.
Then, it's just a matter of calling ioctl(PPPIOCCONNECT) on this ppp
unit again to make list_add_tail() follow those invalid pointers and
crash.

Thank you Cong for putting me on the right track.


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-27 Thread Denys Fedoryshchenko

On 2018-02-23 12:07, Guillaume Nault wrote:

On Fri, Feb 23, 2018 at 11:41:43AM +0200, Denys Fedoryshchenko wrote:

On 2018-02-23 11:38, Guillaume Nault wrote:
> On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
> > I'm using accel-ppp that has unit-cache option, i guess for
> > "reusing" ppp
> > interfaces (because creating a lot of interfaces on BRAS with 8k
> > users quite
> > expensive).
> > Maybe it is somehow related and can be that scenario causing this bug?
> >
> Indeed, it'd be interesting to know if unit-cache is part of the
> equation (if it's workable for you to disable it).
Already did that and testing, unfortunately i had to disable KASAN and 
full
refcount, as performance hit is too heavy for me. I will try to enable 
KASAN

alone tomorrow.

Don't hesitate to post the result even if you can't afford enabling 
KASAN.

Till now 4 days and no reboots.


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-24 Thread Denys Fedoryshchenko

On 2018-02-23 12:07, Guillaume Nault wrote:

On Fri, Feb 23, 2018 at 11:41:43AM +0200, Denys Fedoryshchenko wrote:

On 2018-02-23 11:38, Guillaume Nault wrote:
> On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
> > I'm using accel-ppp that has unit-cache option, i guess for
> > "reusing" ppp
> > interfaces (because creating a lot of interfaces on BRAS with 8k
> > users quite
> > expensive).
> > Maybe it is somehow related and can be that scenario causing this bug?
> >
> Indeed, it'd be interesting to know if unit-cache is part of the
> equation (if it's workable for you to disable it).
Already did that and testing, unfortunately i had to disable KASAN and 
full
refcount, as performance hit is too heavy for me. I will try to enable 
KASAN

alone tomorrow.

Don't hesitate to post the result even if you can't afford enabling 
KASAN.

Very likely unit-cache is major contributor to this reboots.
After disabling it, it is almost 48h and no reboots yet.


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-23 Thread Denys Fedoryshchenko

On 2018-02-23 12:07, Guillaume Nault wrote:

On Fri, Feb 23, 2018 at 11:41:43AM +0200, Denys Fedoryshchenko wrote:

On 2018-02-23 11:38, Guillaume Nault wrote:
> On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
> > I'm using accel-ppp that has unit-cache option, i guess for
> > "reusing" ppp
> > interfaces (because creating a lot of interfaces on BRAS with 8k
> > users quite
> > expensive).
> > Maybe it is somehow related and can be that scenario causing this bug?
> >
> Indeed, it'd be interesting to know if unit-cache is part of the
> equation (if it's workable for you to disable it).
Already did that and testing, unfortunately i had to disable KASAN and 
full
refcount, as performance hit is too heavy for me. I will try to enable 
KASAN

alone tomorrow.

Don't hesitate to post the result even if you can't afford enabling 
KASAN.
For sure, i am expecting it to crash even if KASAN not enabled (just i 
wont have clean message what is reason).
Usually it happened for me within 6-10 hours after upgrade at night, 
when load started to increase, i prefer to wait

48h at least, even if no crash.


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-23 Thread Guillaume Nault
On Fri, Feb 23, 2018 at 11:41:43AM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-23 11:38, Guillaume Nault wrote:
> > On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
> > > I'm using accel-ppp that has unit-cache option, i guess for
> > > "reusing" ppp
> > > interfaces (because creating a lot of interfaces on BRAS with 8k
> > > users quite
> > > expensive).
> > > Maybe it is somehow related and can be that scenario causing this bug?
> > > 
> > Indeed, it'd be interesting to know if unit-cache is part of the
> > equation (if it's workable for you to disable it).
> Already did that and testing, unfortunately i had to disable KASAN and full
> refcount, as performance hit is too heavy for me. I will try to enable KASAN
> alone tomorrow.
> 
Don't hesitate to post the result even if you can't afford enabling KASAN.


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-23 Thread Denys Fedoryshchenko

On 2018-02-23 11:38, Guillaume Nault wrote:

On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
I'm using accel-ppp that has unit-cache option, i guess for "reusing" 
ppp
interfaces (because creating a lot of interfaces on BRAS with 8k users 
quite

expensive).
Maybe it is somehow related and can be that scenario causing this bug?


Indeed, it'd be interesting to know if unit-cache is part of the
equation (if it's workable for you to disable it).
Already did that and testing, unfortunately i had to disable KASAN and 
full refcount, as performance hit is too heavy for me. I will try to 
enable KASAN alone tomorrow.


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-23 Thread Guillaume Nault
On Thu, Feb 22, 2018 at 08:51:19PM +0200, Denys Fedoryshchenko wrote:
> I'm using accel-ppp that has unit-cache option, i guess for "reusing" ppp
> interfaces (because creating a lot of interfaces on BRAS with 8k users quite
> expensive).
> Maybe it is somehow related and can be that scenario causing this bug?
> 
Indeed, it'd be interesting to know if unit-cache is part of the
equation (if it's workable for you to disable it).


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-22 Thread Denys Fedoryshchenko

On 2018-02-22 20:30, Guillaume Nault wrote:

On Wed, Feb 21, 2018 at 12:04:30PM -0800, Cong Wang wrote:
On Thu, Feb 15, 2018 at 11:31 AM, Guillaume Nault 
 wrote:

> On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
>> On 2018-02-15 17:55, Guillaume Nault wrote:
>> > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
>> > > Here we go:
>> > >
>> > >   [24558.921549]
>> > > ==
>> > >   [24558.922167] BUG: KASAN: use-after-free in
>> > > ppp_ioctl+0xa6a/0x1522
>> > > [ppp_generic]
>> > >   [24558.922776] Write of size 8 at addr 8803d35bf3f8 by task
>> > > accel-pppd/12622
>> > >   [24558.923113]
>> > >   [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
>> > > W
>> > > 4.15.3-build-0134 #1
>> > >   [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
>> > > BIOS P80
>> > > 04/02/2015
>> > >   [24558.924406] Call Trace:
>> > >   [24558.924753]  dump_stack+0x46/0x59
>> > >   [24558.925103]  print_address_description+0x6b/0x23b
>> > >   [24558.925451]  ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > >   [24558.925797]  kasan_report+0x21b/0x241
>> > >   [24558.926136]  ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > >   [24558.926479]  ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
>> > >   [24558.926829]  ? sock_sendmsg+0x89/0x99
>> > >   [24558.927176]  ? __vfs_write+0xd9/0x4ad
>> > >   [24558.927523]  ? kernel_read+0xed/0xed
>> > >   [24558.927872]  ? SyS_getpeername+0x18c/0x18c
>> > >   [24558.928213]  ? bit_waitqueue+0x2a/0x2a
>> > >   [24558.928561]  ? wake_atomic_t_function+0x115/0x115
>> > >   [24558.928898]  vfs_ioctl+0x6e/0x81
>> > >   [24558.929228]  do_vfs_ioctl+0xa00/0xb10
>> > >   [24558.929571]  ? sigprocmask+0x1a6/0x1d0
>> > >   [24558.929907]  ? sigsuspend+0x13e/0x13e
>> > >   [24558.930239]  ? ioctl_preallocate+0x14e/0x14e
>> > >   [24558.930568]  ? SyS_rt_sigprocmask+0xf1/0x142
>> > >   [24558.930904]  ? sigprocmask+0x1d0/0x1d0
>> > >   [24558.931252]  SyS_ioctl+0x39/0x55
>> > >   [24558.931595]  ? do_vfs_ioctl+0xb10/0xb10
>> > >   [24558.931942]  do_syscall_64+0x1b1/0x31f
>> > >   [24558.932288]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > >   [24558.932627] RIP: 0033:0x7f302849d8a7
>> > >   [24558.932965] RSP: 002b:7f3029a52af8 EFLAGS: 0206
>> > > ORIG_RAX:
>> > > 0010
>> > >   [24558.933578] RAX: ffda RBX: 7f3027d861e3 RCX:
>> > > 7f302849d8a7
>> > >   [24558.933927] RDX: 7f3023f49468 RSI: 4004743a RDI:
>> > > 3a67
>> > >   [24558.934266] RBP: 7f3029a52b20 R08:  R09:
>> > > 55c8308d8e40
>> > >   [24558.934607] R10: 0008 R11: 0206 R12:
>> > > 7f3023f49358
>> > >   [24558.934947] R13: 7ffe86e5723f R14:  R15:
>> > > 7f3029a53700
>> > >   [24558.935288]
>> > >   [24558.935626] Allocated by task 12622:
>> > >   [24558.935972]  ppp_register_net_channel+0x5f/0x5c6
>> > > [ppp_generic]
>> > >   [24558.936306]  pppoe_connect+0xab7/0xc71 [pppoe]
>> > >   [24558.936640]  SyS_connect+0x14b/0x1b7
>> > >   [24558.936975]  do_syscall_64+0x1b1/0x31f
>> > >   [24558.937319]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > >   [24558.937655]
>> > >   [24558.937993] Freed by task 12622:
>> > >   [24558.938321]  kfree+0xb0/0x11d
>> > >   [24558.938658]  ppp_release+0x111/0x120 [ppp_generic]
>> > >   [24558.938994]  __fput+0x2ba/0x51a
>> > >   [24558.939332]  task_work_run+0x11c/0x13d
>> > >   [24558.939676]  exit_to_usermode_loop+0x7c/0xaf
>> > >   [24558.940022]  do_syscall_64+0x2ea/0x31f
>> > >   [24558.940368]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > >   [24558.947099]
>> >
>> > Your first guess was right. It looks like we have an issue with
>> > reference counting on the channels. Can you send me your ppp_generic.o?
>> http://nuclearcat.com/ppp_generic.o
>> Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
>>
> From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
> concurrently on the same ppp_file. Even if this ppp_file was pointed at
> by two different file descriptors, I can't see how this could defeat
> the reference counting mechanism. I'm going to think more about it.

For me it looks like pch->clist is not removed from the list 
ppp->channels
when destroyed via ppp_release(). But I don't want to pretend I 
understand

ppp logic.


I've thought about that too, but couldn't find a scenario that could
trigger the bug.

To get ->private_data pointing to a struct channel pointer, a file 
needs

to ioctl(PPPIOCATTCHAN) first. For this call to succeed, the channel
must have been registered with ppp_register_net_channel(). Both
operations take a reference on the channel, which means that, before
adding pch->clist to a ppp->channels list (with ppp_connect_channel()),
the channel is already held by a /dev/ppp file and by the code that
registered the channel in the first place.

Therefore, closing the /dev/ppp 

Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-22 Thread Guillaume Nault
On Wed, Feb 21, 2018 at 12:04:30PM -0800, Cong Wang wrote:
> On Thu, Feb 15, 2018 at 11:31 AM, Guillaume Nault  
> wrote:
> > On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
> >> On 2018-02-15 17:55, Guillaume Nault wrote:
> >> > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
> >> > > Here we go:
> >> > >
> >> > >   [24558.921549]
> >> > > ==
> >> > >   [24558.922167] BUG: KASAN: use-after-free in
> >> > > ppp_ioctl+0xa6a/0x1522
> >> > > [ppp_generic]
> >> > >   [24558.922776] Write of size 8 at addr 8803d35bf3f8 by task
> >> > > accel-pppd/12622
> >> > >   [24558.923113]
> >> > >   [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
> >> > > W
> >> > > 4.15.3-build-0134 #1
> >> > >   [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
> >> > > BIOS P80
> >> > > 04/02/2015
> >> > >   [24558.924406] Call Trace:
> >> > >   [24558.924753]  dump_stack+0x46/0x59
> >> > >   [24558.925103]  print_address_description+0x6b/0x23b
> >> > >   [24558.925451]  ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> >> > >   [24558.925797]  kasan_report+0x21b/0x241
> >> > >   [24558.926136]  ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> >> > >   [24558.926479]  ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
> >> > >   [24558.926829]  ? sock_sendmsg+0x89/0x99
> >> > >   [24558.927176]  ? __vfs_write+0xd9/0x4ad
> >> > >   [24558.927523]  ? kernel_read+0xed/0xed
> >> > >   [24558.927872]  ? SyS_getpeername+0x18c/0x18c
> >> > >   [24558.928213]  ? bit_waitqueue+0x2a/0x2a
> >> > >   [24558.928561]  ? wake_atomic_t_function+0x115/0x115
> >> > >   [24558.928898]  vfs_ioctl+0x6e/0x81
> >> > >   [24558.929228]  do_vfs_ioctl+0xa00/0xb10
> >> > >   [24558.929571]  ? sigprocmask+0x1a6/0x1d0
> >> > >   [24558.929907]  ? sigsuspend+0x13e/0x13e
> >> > >   [24558.930239]  ? ioctl_preallocate+0x14e/0x14e
> >> > >   [24558.930568]  ? SyS_rt_sigprocmask+0xf1/0x142
> >> > >   [24558.930904]  ? sigprocmask+0x1d0/0x1d0
> >> > >   [24558.931252]  SyS_ioctl+0x39/0x55
> >> > >   [24558.931595]  ? do_vfs_ioctl+0xb10/0xb10
> >> > >   [24558.931942]  do_syscall_64+0x1b1/0x31f
> >> > >   [24558.932288]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> >> > >   [24558.932627] RIP: 0033:0x7f302849d8a7
> >> > >   [24558.932965] RSP: 002b:7f3029a52af8 EFLAGS: 0206
> >> > > ORIG_RAX:
> >> > > 0010
> >> > >   [24558.933578] RAX: ffda RBX: 7f3027d861e3 RCX:
> >> > > 7f302849d8a7
> >> > >   [24558.933927] RDX: 7f3023f49468 RSI: 4004743a RDI:
> >> > > 3a67
> >> > >   [24558.934266] RBP: 7f3029a52b20 R08:  R09:
> >> > > 55c8308d8e40
> >> > >   [24558.934607] R10: 0008 R11: 0206 R12:
> >> > > 7f3023f49358
> >> > >   [24558.934947] R13: 7ffe86e5723f R14:  R15:
> >> > > 7f3029a53700
> >> > >   [24558.935288]
> >> > >   [24558.935626] Allocated by task 12622:
> >> > >   [24558.935972]  ppp_register_net_channel+0x5f/0x5c6
> >> > > [ppp_generic]
> >> > >   [24558.936306]  pppoe_connect+0xab7/0xc71 [pppoe]
> >> > >   [24558.936640]  SyS_connect+0x14b/0x1b7
> >> > >   [24558.936975]  do_syscall_64+0x1b1/0x31f
> >> > >   [24558.937319]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> >> > >   [24558.937655]
> >> > >   [24558.937993] Freed by task 12622:
> >> > >   [24558.938321]  kfree+0xb0/0x11d
> >> > >   [24558.938658]  ppp_release+0x111/0x120 [ppp_generic]
> >> > >   [24558.938994]  __fput+0x2ba/0x51a
> >> > >   [24558.939332]  task_work_run+0x11c/0x13d
> >> > >   [24558.939676]  exit_to_usermode_loop+0x7c/0xaf
> >> > >   [24558.940022]  do_syscall_64+0x2ea/0x31f
> >> > >   [24558.940368]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> >> > >   [24558.947099]
> >> >
> >> > Your first guess was right. It looks like we have an issue with
> >> > reference counting on the channels. Can you send me your ppp_generic.o?
> >> http://nuclearcat.com/ppp_generic.o
> >> Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
> >>
> > From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
> > concurrently on the same ppp_file. Even if this ppp_file was pointed at
> > by two different file descriptors, I can't see how this could defeat
> > the reference counting mechanism. I'm going to think more about it.
> 
> For me it looks like pch->clist is not removed from the list ppp->channels
> when destroyed via ppp_release(). But I don't want to pretend I understand
> ppp logic.
> 
I've thought about that too, but couldn't find a scenario that could
trigger the bug.

To get ->private_data pointing to a struct channel pointer, a file needs
to ioctl(PPPIOCATTCHAN) first. For this call to succeed, the channel
must have been registered with ppp_register_net_channel(). Both
operations take a reference on the channel, which means that, before
adding pch->clist to a ppp->channels list (with ppp_connect_channel()),
the channel is 

Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-21 Thread Guillaume Nault
On Sun, Feb 18, 2018 at 12:01:02PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-16 20:48, Guillaume Nault wrote:
> > On Fri, Feb 16, 2018 at 01:13:18PM +0200, Denys Fedoryshchenko wrote:
> > > As far as i can see there is only KASAN triggered again(and server
> > > rebooted
> > > shortly after that), but nothing else:
> > > 
> > Ok, so no refcount failure detected. Not what I expected... but that's
> > still an information. It's getting even harder to find a ppp scenario
> > that could lead to such symptoms.
> > If that's acceptable for you, you can try reverting the few commits
> > that entered after 4.14.
> > 
> > 02612bb05e51df8489db5e94d0cf8d1c81f87b0c pppoe: take ->needed_headroom
> > of lower device into account on xmit
> > 0171c41835591e9aa2e384b703ef9a6ae367c610 ppp: unlock all_ppp_mutex
> > before registering device
> > e6675000f9a404f7651724c0b2e2e71f7247d3a1 ppp: exit_net cleanup checks
> > added
> > f02b2320b27c16b644691267ee3b5c110846f49e ppp: Destroy the mutex when
> > cleanup
> > 90e229ef61fad240554f5899eb122fbe44990f78 ppp: allow usage in namespaces
> > 709c89b45b874b2f81a074b8802a736009873f48 drivers, net, ppp: convert
> > syncppp.refcnt from atomic_t to refcount_t
> > d780cd44e3cea119a3346e6d7c04d35b9c50d54b drivers, net, ppp: convert
> > ppp_file.refcnt from atomic_t to refcount_t
> > 313a912155c78ed87ad6fca175dc56b75fd00a58 drivers, net, ppp: convert
> > asyncppp.refcnt from atomic_t to refcount_t
> > 
> > Sorry, but I have nothing better to propose for now. At least that
> > should help narrowing the problem space.
> > I'm going to stress test ppp_generic and pppoe on my side.
> > 
> Quick update.
> Testing 5 first patches didn't changed anything.
> But revering more, with last 4 patches also (i did all together) is changing
> things, probably i need to repeat one night more reverting just all
> refcount_t patches.
>
So you got the following trace with all 8 patches reverted, right?
I prefer to concentrate on the other traces for now. If this one tends
to be reproducible, you can try to activate lockdep (for lack of better
suggestion).

>  [25222.173840] [ cut here ]
>  [25222.174259] NETDEV WATCHDOG: eth1 (ixgbe): transmit queue 3 timed out
>  [25222.174618] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:323
> dev_watchdog+0x44a/0x555
>  [25222.175212] Modules linked in: pppoe pppox ppp_generic slhc netconsole
> configfs coretemp nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp
> nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x
> t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4 xt_set
> xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net
> ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na
> t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables
> x_tables 8021q garp mrp stp llc ixgbe dca
>  [25222.177133] CPU: 3 PID: 0 Comm: swapper/3 Tainted: GB   W
> 4.15.3-build-0134 #6
>  [25222.184121] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
> 04/02/2015
>  [25222.184457] RIP: 0010:dev_watchdog+0x44a/0x555
>  [25222.184791] RSP: 0018:8803f22c7d98 EFLAGS: 00010292
>  [25222.185127] RAX:  RBX: 8803ded00438 RCX:
> 
>  [25222.185463] RDX: 0001 RSI: 0002 RDI:
> ed007e458fa8
>  [25222.185797] RBP: 8803ded0 R08: 0001 R09:
> 
>  [25222.186133] R10: 8803f22c7e30 R11: 0001 R12:
> 8803ded28450
>  [25222.186471] R13: 0003 R14: dc00 R15:
> 8803ded283c0
>  [25222.186804] FS:  () GS:8803f22c()
> knlGS:
>  [25222.187401] CS:  0010 DS:  ES:  CR0: 80050033
>  [25222.187739] CR2: 561f5bffc128 CR3: 000445a0d003 CR4:
> 001606e0
>  [25222.188077] Call Trace:
>  [25222.188410]  
>  [25222.188740]  ? dev_graft_qdisc+0xfa/0xfa
>  [25222.189072]  call_timer_fn+0x15/0x72
>  [25222.189407]  ? dev_graft_qdisc+0xfa/0xfa
>  [25222.189741]  expire_timers+0x1b9/0x1d5
>  [25222.190072]  run_timer_softirq+0x184/0x361
>  [25222.190400]  ? expire_timers+0x1d5/0x1d5
>  [25222.190723]  ? enqueue_hrtimer+0xce/0xd8
>  [25222.191048]  ? __hrtimer_run_queues+0x1ec/0x24d
>  [25222.191373]  __do_softirq+0x17f/0x34a
>  [25222.191702]  irq_exit+0x8f/0xf9
>  [25222.192034]  smp_apic_timer_interrupt+0xcb/0xd6
>  [25222.192365]  apic_timer_interrupt+0x92/0xa0
>  [25222.192695]  
>  [25222.193023] RIP: 0010:mwait_idle+0x99/0xac
>  [25222.193355] RSP: 0018:8803f030fef8 EFLAGS: 0246 ORIG_RAX:
> ff11
>  [25222.193956] RAX:  RBX: 8803f02e3500 RCX:
> 
>  [25222.194290] RDX: 11007e05c6a0 RSI:  RDI:
> 
>  [25222.194626] RBP: 8803f02e3500 R08: ed007ccc8eef R09:
> 8803e6647728
>  [25222.194958] R10: 8803f030fdd0 R11: 0001 R12:
> 
>  [25222.195292] R13: dc00 R14: 

Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-21 Thread Guillaume Nault
On Wed, Feb 21, 2018 at 12:26:51PM +0200, Denys Fedoryshchenko wrote:
> It seems even rebuilding seemingly stable version triggering crashes too
> (but different ones)
Different ones? The trace following your message looks very similar to
your first KASAN report. Or are you refering to the lockup you posted
on Sun, 18 Feb 2018?

Also, which stable versions are you refering to?

I'm interested in the ppp_generic.o file that produced the following
trace. Just to be sure that the differences come from the new debugging
options.

> Maybe it is coincidence, and bug reproducer appeared in network same time i
> decided to upgrade kernel,
> as it happened with xt_MSS(and that bug existed for years).
> 
> Deleted quoting, i added more debug options (as much as performance
> degradation allows me).
> This is vanilla again:
> 
> [14834.090421]
> ==
> [14834.091157] BUG: KASAN: use-after-free in __list_add_valid+0x69/0xad
> [14834.091521] Read of size 8 at addr 8803dbeb8660 by task
> accel-pppd/12636
> [14834.091905]
> [14834.092282] CPU: 0 PID: 12636 Comm: accel-pppd Not tainted
> 4.15.4-build-0134 #1
> [14834.092930] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
> 04/02/2015
> [14834.093320] Call Trace:
> [14834.093680]  dump_stack+0xb3/0x13e
> [14834.094050]  ? _atomic_dec_and_lock+0x10f/0x10f
> [14834.094434]  print_address_description+0x69/0x236
> [14834.094814]  ? __list_add_valid+0x69/0xad
> [14834.095197]  kasan_report+0x219/0x23f
> [14834.095570]  __list_add_valid+0x69/0xad
> [14834.095957]  ppp_ioctl+0x1216/0x2201 [ppp_generic]
> [14834.096348]  ? ppp_write+0x1cc/0x1cc [ppp_generic]
> [14834.096723]  ? get_usage_char.isra.2+0x36/0x36
> [14834.097094]  ? packet_poll+0x362/0x362
> [14834.097455]  ? lock_downgrade+0x4d0/0x4d0
> [14834.097811]  ? rcu_irq_enter_disabled+0x8/0x8
> [14834.098187]  ? get_usage_char.isra.2+0x36/0x36
> [14834.098561]  ? __fget+0x3b8/0x3eb
> [14834.098936]  ? get_usage_char.isra.2+0x36/0x36
> [14834.099309]  ? __fget+0x3a0/0x3eb
> [14834.099682]  ? get_usage_char.isra.2+0x36/0x36
> [14834.100069]  ? __fget+0x3a0/0x3eb
> [14834.100443]  ? lock_downgrade+0x4d0/0x4d0
> [14834.100814]  ? rcu_irq_enter_disabled+0x8/0x8
> [14834.101203]  ? __fget+0x3b8/0x3eb
> [14834.101581]  ? expand_files+0x62f/0x62f
> [14834.101945]  ? kernel_read+0xed/0xed
> [14834.102322]  ? SyS_getpeername+0x28b/0x28b
> [14834.102690]  vfs_ioctl+0x6e/0x81
> [14834.103049]  do_vfs_ioctl+0xe2f/0xe62
> [14834.103413]  ? ioctl_preallocate+0x211/0x211
> [14834.103778]  ? __fget_light+0x28c/0x2ca
> [14834.104150]  ? iterate_fd+0x2a8/0x2a8
> [14834.104526]  ? SyS_rt_sigprocmask+0x12e/0x181
> [14834.104876]  ? sigprocmask+0x23f/0x23f
> [14834.105231]  ? SyS_write+0x148/0x173
> [14834.105580]  ? SyS_read+0x173/0x173
> [14834.105943]  SyS_ioctl+0x39/0x55
> [14834.106316]  ? do_vfs_ioctl+0xe62/0xe62
> [14834.106694]  do_syscall_64+0x262/0x594
> [14834.107076]  ? syscall_return_slowpath+0x351/0x351
> [14834.107447]  ? up_read+0x17/0x2c
> [14834.107806]  ? __do_page_fault+0x68a/0x763
> [14834.108171]  ? entry_SYSCALL_64_after_hwframe+0x36/0x9b
> [14834.108550]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [14834.108937]  entry_SYSCALL_64_after_hwframe+0x26/0x9b
> [14834.109293] RIP: 0033:0x7fc9be3758a7
> [14834.109652] RSP: 002b:7fc9bf92aaf8 EFLAGS: 0206 ORIG_RAX:
> 0010
> [14834.110313] RAX: ffda RBX: 7fc9bdc5e1e3 RCX:
> 7fc9be3758a7
> [14834.110707] RDX: 7fc9b7ad13e8 RSI: 4004743a RDI:
> 4b9f
> [14834.111082] RBP: 7fc9bf92ab20 R08:  R09:
> 55f07a27fe40
> [14834.111471] R10: 0008 R11: 0206 R12:
> 7fc9b7ad12d8
> [14834.111845] R13: 7ffd06346a6f R14:  R15:
> 7fc9bf92b700
> [14834.112231]
> [14834.112589] Allocated by task 12636:
> [14834.112962]  ppp_register_net_channel+0xc4/0x610 [ppp_generic]
> [14834.113331]  pppoe_connect+0xe6d/0x1097 [pppoe]
> [14834.113691]  SyS_connect+0x19c/0x274
> [14834.114054]  do_syscall_64+0x262/0x594
> [14834.114421]  entry_SYSCALL_64_after_hwframe+0x26/0x9b
> [14834.114792]
> [14834.115139] Freed by task 12636:
> [14834.115504]  kfree+0xe2/0x154
> [14834.115866]  ppp_release+0x11b/0x12a [ppp_generic]
> [14834.116240]  __fput+0x342/0x5ba
> [14834.116611]  task_work_run+0x15d/0x198
> [14834.116973]  exit_to_usermode_loop+0xc7/0x153
> [14834.117320]  do_syscall_64+0x53d/0x594
> [14834.117694]  entry_SYSCALL_64_after_hwframe+0x26/0x9b
> [14834.118067]
> [14834.118426] The buggy address belongs to the object at 8803dbeb8480
> [14834.119087] The buggy address is located 480 bytes inside of
> [14834.119755] The buggy address belongs to the page:
> [14834.120138] page:ea000f6fae00 count:1 mapcount:0 mapping:
> (null) index:0x8803dbebd580 compound_mapcount: 0
> [14834.120817] flags: 0x17ffe0008100(slab|head)
> [14834.121171] raw: 17ffe0008100  8803dbebd580
> 

Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-21 Thread Denys Fedoryshchenko

On 2018-02-21 20:55, Guillaume Nault wrote:

On Wed, Feb 21, 2018 at 12:26:51PM +0200, Denys Fedoryshchenko wrote:
It seems even rebuilding seemingly stable version triggering crashes 
too

(but different ones)

Different ones? The trace following your message looks very similar to
your first KASAN report. Or are you refering to the lockup you posted
on Sun, 18 Feb 2018?

Also, which stable versions are you refering to?
Trace i sent in previous email - is latest kernel, vanilla, just more 
debug options and few options disabled.
One of disabled was spitting some errors (it is obviously bug), 
CONFIG_XFRM, in nf_xfrm_me_harder (i reported about it).

And i disabled namespaces, as they are often source of trouble.

Today i will try to revert just:
drivers, net, ppp: convert asyncppp.refcnt from atomic_t to refcount_t
drivers, net, ppp: convert syncppp.refcnt from atomic_t to refcount_t
drivers, net, ppp: convert ppp_file.refcnt from atomic_t to  refcount_t

Because i suspect previously, after reverting this patches i got 
different kernel
panic (and i didn't noticed that, now too late to identify between other 
crashes),

seems it was not KASAN.
I will report results after testing, unfortunately i can't test it more 
than once per day.


"Stable" for me was 4.14.2 - but it looks like on that kernel i am 
getting different issue now.

I will paste it below.

Another observation, just hour ago, i noticed on another server, where i 
am testing 4.15, and 4.14.20
(at moment of testing 4.14.20, but no debug at that moment), when i 
killed accel-pppd (pppoe server software),
with 8k sessions online, i got some weird behaviour, accel-pppd process 
got stuck, same as
ifconfig and "ip link", and even kexec -e didn't worked(got stuck too), 
unless i did kexec -e -x

(so it wont try to make interfaces down on kexec).
I will try to reproduce this bug as well, with debug enabled (lockdep 
and so) i hope it is not related.




I'm interested in the ppp_generic.o file that produced the following
trace. Just to be sure that the differences come from the new debugging
options.

Also kernel config:
https://nuclearcat.com/bughunting/config.txt
https://nuclearcat.com/bughunting/ppp_generic.o

This is in 4.14.2, was seemingly stable before:

[50401.388670] NETDEV WATCHDOG: eth1 (ixgbe): transmit queue 1 timed out
[50401.389014] [ cut here ]
[50401.389340] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:320 
dev_watchdog+0x15c/0x1b9
[50401.389925] Modules linked in: pppoe pppox ppp_generic slhc 
netconsole configfs coretemp nf_nat_pptp nf_nat_proto_gre 
nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x
t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4 
xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net 
ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na
t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
ip_tables x_tables 8021q garp mrp stp llc ixgbe dca
[50401.391869] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 
4.14.2-build-0134 #4
[50401.392191] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 
04/02/2015

[50401.392513] task: 880434d72640 task.stack: c90001914000
[50401.392836] RIP: 0010:dev_watchdog+0x15c/0x1b9
[50401.393155] RSP: 0018:8804364c3e90 EFLAGS: 00010286
[50401.393470] RAX: 0039 RBX: 88042f6e RCX: 

[50401.393787] RDX: 0001 RSI: 0002 RDI: 
828dbc64
[50401.394103] RBP: 8804364c3eb0 R08: 0001 R09: 

[50401.394420] R10: 0002 R11: 8803fa075c00 R12: 
0001
[50401.394739] R13: 0040 R14: 0003 R15: 
81e05108
[50401.395064] FS:  () GS:8804364c() 
knlGS:

[50401.395645] CS:  0010 DS:  ES:  CR0: 80050033
[50401.395970] CR2: 7fff25fc20a8 CR3: 01e09005 CR4: 
001606e0

[50401.396294] Call Trace:
[50401.396613]  
[50401.396934]  ? qdisc_rcu_free+0x3f/0x3f
[50401.397255]  call_timer_fn.isra.4+0x17/0x7b
[50401.397576]  expire_timers+0x6f/0x7e
[50401.397899]  run_timer_softirq+0x6d/0x8f
[50401.398219]  ? ktime_get+0x3b/0x8c
[50401.398540]  ? lapic_next_event+0x18/0x1c
[50401.398862]  ? clockevents_program_event+0xa3/0xbb
[50401.399186]  __do_softirq+0xbc/0x1ab
[50401.399510]  irq_exit+0x4d/0x8e
[50401.399832]  smp_apic_timer_interrupt+0x73/0x80
[50401.400157]  apic_timer_interrupt+0x8d/0xa0
[50401.400480]  
[50401.400801] RIP: 0010:mwait_idle+0x4e/0x61
[50401.401123] RSP: 0018:c90001917ec0 EFLAGS: 0246 ORIG_RAX: 
ff10
[50401.401714] RAX:  RBX: 880434d72640 RCX: 

[50401.402037] RDX:  RSI:  RDI: 

[50401.402362] RBP: c90001917ec0 R08:  R09: 
0001
[50401.402685] R10: c90001917e58 R11: 037a R12: 


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-21 Thread Cong Wang
On Thu, Feb 15, 2018 at 11:31 AM, Guillaume Nault  wrote:
> On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
>> On 2018-02-15 17:55, Guillaume Nault wrote:
>> > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
>> > > Here we go:
>> > >
>> > >   [24558.921549]
>> > > ==
>> > >   [24558.922167] BUG: KASAN: use-after-free in
>> > > ppp_ioctl+0xa6a/0x1522
>> > > [ppp_generic]
>> > >   [24558.922776] Write of size 8 at addr 8803d35bf3f8 by task
>> > > accel-pppd/12622
>> > >   [24558.923113]
>> > >   [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
>> > > W
>> > > 4.15.3-build-0134 #1
>> > >   [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
>> > > BIOS P80
>> > > 04/02/2015
>> > >   [24558.924406] Call Trace:
>> > >   [24558.924753]  dump_stack+0x46/0x59
>> > >   [24558.925103]  print_address_description+0x6b/0x23b
>> > >   [24558.925451]  ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > >   [24558.925797]  kasan_report+0x21b/0x241
>> > >   [24558.926136]  ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > >   [24558.926479]  ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
>> > >   [24558.926829]  ? sock_sendmsg+0x89/0x99
>> > >   [24558.927176]  ? __vfs_write+0xd9/0x4ad
>> > >   [24558.927523]  ? kernel_read+0xed/0xed
>> > >   [24558.927872]  ? SyS_getpeername+0x18c/0x18c
>> > >   [24558.928213]  ? bit_waitqueue+0x2a/0x2a
>> > >   [24558.928561]  ? wake_atomic_t_function+0x115/0x115
>> > >   [24558.928898]  vfs_ioctl+0x6e/0x81
>> > >   [24558.929228]  do_vfs_ioctl+0xa00/0xb10
>> > >   [24558.929571]  ? sigprocmask+0x1a6/0x1d0
>> > >   [24558.929907]  ? sigsuspend+0x13e/0x13e
>> > >   [24558.930239]  ? ioctl_preallocate+0x14e/0x14e
>> > >   [24558.930568]  ? SyS_rt_sigprocmask+0xf1/0x142
>> > >   [24558.930904]  ? sigprocmask+0x1d0/0x1d0
>> > >   [24558.931252]  SyS_ioctl+0x39/0x55
>> > >   [24558.931595]  ? do_vfs_ioctl+0xb10/0xb10
>> > >   [24558.931942]  do_syscall_64+0x1b1/0x31f
>> > >   [24558.932288]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > >   [24558.932627] RIP: 0033:0x7f302849d8a7
>> > >   [24558.932965] RSP: 002b:7f3029a52af8 EFLAGS: 0206
>> > > ORIG_RAX:
>> > > 0010
>> > >   [24558.933578] RAX: ffda RBX: 7f3027d861e3 RCX:
>> > > 7f302849d8a7
>> > >   [24558.933927] RDX: 7f3023f49468 RSI: 4004743a RDI:
>> > > 3a67
>> > >   [24558.934266] RBP: 7f3029a52b20 R08:  R09:
>> > > 55c8308d8e40
>> > >   [24558.934607] R10: 0008 R11: 0206 R12:
>> > > 7f3023f49358
>> > >   [24558.934947] R13: 7ffe86e5723f R14:  R15:
>> > > 7f3029a53700
>> > >   [24558.935288]
>> > >   [24558.935626] Allocated by task 12622:
>> > >   [24558.935972]  ppp_register_net_channel+0x5f/0x5c6
>> > > [ppp_generic]
>> > >   [24558.936306]  pppoe_connect+0xab7/0xc71 [pppoe]
>> > >   [24558.936640]  SyS_connect+0x14b/0x1b7
>> > >   [24558.936975]  do_syscall_64+0x1b1/0x31f
>> > >   [24558.937319]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > >   [24558.937655]
>> > >   [24558.937993] Freed by task 12622:
>> > >   [24558.938321]  kfree+0xb0/0x11d
>> > >   [24558.938658]  ppp_release+0x111/0x120 [ppp_generic]
>> > >   [24558.938994]  __fput+0x2ba/0x51a
>> > >   [24558.939332]  task_work_run+0x11c/0x13d
>> > >   [24558.939676]  exit_to_usermode_loop+0x7c/0xaf
>> > >   [24558.940022]  do_syscall_64+0x2ea/0x31f
>> > >   [24558.940368]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > >   [24558.947099]
>> >
>> > Your first guess was right. It looks like we have an issue with
>> > reference counting on the channels. Can you send me your ppp_generic.o?
>> http://nuclearcat.com/ppp_generic.o
>> Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
>>
> From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
> concurrently on the same ppp_file. Even if this ppp_file was pointed at
> by two different file descriptors, I can't see how this could defeat
> the reference counting mechanism. I'm going to think more about it.

For me it looks like pch->clist is not removed from the list ppp->channels
when destroyed via ppp_release(). But I don't want to pretend I understand
ppp logic.


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-21 Thread Denys Fedoryshchenko
It seems even rebuilding seemingly stable version triggering crashes too 
(but different ones)
Maybe it is coincidence, and bug reproducer appeared in network same 
time i decided to upgrade kernel,

as it happened with xt_MSS(and that bug existed for years).

Deleted quoting, i added more debug options (as much as performance 
degradation allows me).

This is vanilla again:

[14834.090421] 
==

[14834.091157] BUG: KASAN: use-after-free in __list_add_valid+0x69/0xad
[14834.091521] Read of size 8 at addr 8803dbeb8660 by task 
accel-pppd/12636

[14834.091905]
[14834.092282] CPU: 0 PID: 12636 Comm: accel-pppd Not tainted 
4.15.4-build-0134 #1
[14834.092930] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 
04/02/2015

[14834.093320] Call Trace:
[14834.093680]  dump_stack+0xb3/0x13e
[14834.094050]  ? _atomic_dec_and_lock+0x10f/0x10f
[14834.094434]  print_address_description+0x69/0x236
[14834.094814]  ? __list_add_valid+0x69/0xad
[14834.095197]  kasan_report+0x219/0x23f
[14834.095570]  __list_add_valid+0x69/0xad
[14834.095957]  ppp_ioctl+0x1216/0x2201 [ppp_generic]
[14834.096348]  ? ppp_write+0x1cc/0x1cc [ppp_generic]
[14834.096723]  ? get_usage_char.isra.2+0x36/0x36
[14834.097094]  ? packet_poll+0x362/0x362
[14834.097455]  ? lock_downgrade+0x4d0/0x4d0
[14834.097811]  ? rcu_irq_enter_disabled+0x8/0x8
[14834.098187]  ? get_usage_char.isra.2+0x36/0x36
[14834.098561]  ? __fget+0x3b8/0x3eb
[14834.098936]  ? get_usage_char.isra.2+0x36/0x36
[14834.099309]  ? __fget+0x3a0/0x3eb
[14834.099682]  ? get_usage_char.isra.2+0x36/0x36
[14834.100069]  ? __fget+0x3a0/0x3eb
[14834.100443]  ? lock_downgrade+0x4d0/0x4d0
[14834.100814]  ? rcu_irq_enter_disabled+0x8/0x8
[14834.101203]  ? __fget+0x3b8/0x3eb
[14834.101581]  ? expand_files+0x62f/0x62f
[14834.101945]  ? kernel_read+0xed/0xed
[14834.102322]  ? SyS_getpeername+0x28b/0x28b
[14834.102690]  vfs_ioctl+0x6e/0x81
[14834.103049]  do_vfs_ioctl+0xe2f/0xe62
[14834.103413]  ? ioctl_preallocate+0x211/0x211
[14834.103778]  ? __fget_light+0x28c/0x2ca
[14834.104150]  ? iterate_fd+0x2a8/0x2a8
[14834.104526]  ? SyS_rt_sigprocmask+0x12e/0x181
[14834.104876]  ? sigprocmask+0x23f/0x23f
[14834.105231]  ? SyS_write+0x148/0x173
[14834.105580]  ? SyS_read+0x173/0x173
[14834.105943]  SyS_ioctl+0x39/0x55
[14834.106316]  ? do_vfs_ioctl+0xe62/0xe62
[14834.106694]  do_syscall_64+0x262/0x594
[14834.107076]  ? syscall_return_slowpath+0x351/0x351
[14834.107447]  ? up_read+0x17/0x2c
[14834.107806]  ? __do_page_fault+0x68a/0x763
[14834.108171]  ? entry_SYSCALL_64_after_hwframe+0x36/0x9b
[14834.108550]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[14834.108937]  entry_SYSCALL_64_after_hwframe+0x26/0x9b
[14834.109293] RIP: 0033:0x7fc9be3758a7
[14834.109652] RSP: 002b:7fc9bf92aaf8 EFLAGS: 0206 ORIG_RAX: 
0010
[14834.110313] RAX: ffda RBX: 7fc9bdc5e1e3 RCX: 
7fc9be3758a7
[14834.110707] RDX: 7fc9b7ad13e8 RSI: 4004743a RDI: 
4b9f
[14834.111082] RBP: 7fc9bf92ab20 R08:  R09: 
55f07a27fe40
[14834.111471] R10: 0008 R11: 0206 R12: 
7fc9b7ad12d8
[14834.111845] R13: 7ffd06346a6f R14:  R15: 
7fc9bf92b700

[14834.112231]
[14834.112589] Allocated by task 12636:
[14834.112962]  ppp_register_net_channel+0xc4/0x610 [ppp_generic]
[14834.113331]  pppoe_connect+0xe6d/0x1097 [pppoe]
[14834.113691]  SyS_connect+0x19c/0x274
[14834.114054]  do_syscall_64+0x262/0x594
[14834.114421]  entry_SYSCALL_64_after_hwframe+0x26/0x9b
[14834.114792]
[14834.115139] Freed by task 12636:
[14834.115504]  kfree+0xe2/0x154
[14834.115866]  ppp_release+0x11b/0x12a [ppp_generic]
[14834.116240]  __fput+0x342/0x5ba
[14834.116611]  task_work_run+0x15d/0x198
[14834.116973]  exit_to_usermode_loop+0xc7/0x153
[14834.117320]  do_syscall_64+0x53d/0x594
[14834.117694]  entry_SYSCALL_64_after_hwframe+0x26/0x9b
[14834.118067]
[14834.118426] The buggy address belongs to the object at 
8803dbeb8480

[14834.119087] The buggy address is located 480 bytes inside of
[14834.119755] The buggy address belongs to the page:
[14834.120138] page:ea000f6fae00 count:1 mapcount:0 mapping: 
 (null) index:0x8803dbebd580 compound_mapcount: 0

[14834.120817] flags: 0x17ffe0008100(slab|head)
[14834.121171] raw: 17ffe0008100  8803dbebd580 
0001001c001b
[14834.121800] raw: ea000d718020 ea000d32d620 8803f080ee80 


[14834.122415] page dumped because: kasan: bad access detected
[14834.122787]
[14834.123140] Memory state around the buggy address:
[14834.123503]  8803dbeb8500: fb fb fb fb fb fb fb fb fb fb fb fb fb 
fb fb fb
[14834.124150]  8803dbeb8580: fb fb fb fb fb fb fb fb fb fb fb fb fb 
fb fb fb
[14834.124806] >8803dbeb8600: fb fb fb fb fb fb fb fb fb fb fb fb fb 
fb fb fb

[14834.125467]^
[14834.125848]  8803dbeb8680: fb fb fb fb fb 

Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-20 Thread Denys Fedoryshchenko

On 2018-02-16 20:48, Guillaume Nault wrote:

On Fri, Feb 16, 2018 at 01:13:18PM +0200, Denys Fedoryshchenko wrote:

On 2018-02-15 21:42, Guillaume Nault wrote:
> On Thu, Feb 15, 2018 at 09:34:42PM +0200, Denys Fedoryshchenko wrote:
> > On 2018-02-15 21:31, Guillaume Nault wrote:
> > > On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
> > > > On 2018-02-15 17:55, Guillaume Nault wrote:
> > > > > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
> > > > > > Here we go:
> > > > > >
> > > > > >   [24558.921549]
> > > > > > ==
> > > > > >   [24558.922167] BUG: KASAN: use-after-free in
> > > > > > ppp_ioctl+0xa6a/0x1522
> > > > > > [ppp_generic]
> > > > > >   [24558.922776] Write of size 8 at addr 8803d35bf3f8 by 
task
> > > > > > accel-pppd/12622
> > > > > >   [24558.923113]
> > > > > >   [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
> > > > > > W
> > > > > > 4.15.3-build-0134 #1
> > > > > >   [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
> > > > > > BIOS P80
> > > > > > 04/02/2015
> > > > > >   [24558.924406] Call Trace:
> > > > > >   [24558.924753]  dump_stack+0x46/0x59
> > > > > >   [24558.925103]  print_address_description+0x6b/0x23b
> > > > > >   [24558.925451]  ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > > > > >   [24558.925797]  kasan_report+0x21b/0x241
> > > > > >   [24558.926136]  ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > > > > >   [24558.926479]  ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
> > > > > >   [24558.926829]  ? sock_sendmsg+0x89/0x99
> > > > > >   [24558.927176]  ? __vfs_write+0xd9/0x4ad
> > > > > >   [24558.927523]  ? kernel_read+0xed/0xed
> > > > > >   [24558.927872]  ? SyS_getpeername+0x18c/0x18c
> > > > > >   [24558.928213]  ? bit_waitqueue+0x2a/0x2a
> > > > > >   [24558.928561]  ? wake_atomic_t_function+0x115/0x115
> > > > > >   [24558.928898]  vfs_ioctl+0x6e/0x81
> > > > > >   [24558.929228]  do_vfs_ioctl+0xa00/0xb10
> > > > > >   [24558.929571]  ? sigprocmask+0x1a6/0x1d0
> > > > > >   [24558.929907]  ? sigsuspend+0x13e/0x13e
> > > > > >   [24558.930239]  ? ioctl_preallocate+0x14e/0x14e
> > > > > >   [24558.930568]  ? SyS_rt_sigprocmask+0xf1/0x142
> > > > > >   [24558.930904]  ? sigprocmask+0x1d0/0x1d0
> > > > > >   [24558.931252]  SyS_ioctl+0x39/0x55
> > > > > >   [24558.931595]  ? do_vfs_ioctl+0xb10/0xb10
> > > > > >   [24558.931942]  do_syscall_64+0x1b1/0x31f
> > > > > >   [24558.932288]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > > >   [24558.932627] RIP: 0033:0x7f302849d8a7
> > > > > >   [24558.932965] RSP: 002b:7f3029a52af8 EFLAGS: 0206
> > > > > > ORIG_RAX:
> > > > > > 0010
> > > > > >   [24558.933578] RAX: ffda RBX: 7f3027d861e3 
RCX:
> > > > > > 7f302849d8a7
> > > > > >   [24558.933927] RDX: 7f3023f49468 RSI: 4004743a 
RDI:
> > > > > > 3a67
> > > > > >   [24558.934266] RBP: 7f3029a52b20 R08:  
R09:
> > > > > > 55c8308d8e40
> > > > > >   [24558.934607] R10: 0008 R11: 0206 
R12:
> > > > > > 7f3023f49358
> > > > > >   [24558.934947] R13: 7ffe86e5723f R14:  
R15:
> > > > > > 7f3029a53700
> > > > > >   [24558.935288]
> > > > > >   [24558.935626] Allocated by task 12622:
> > > > > >   [24558.935972]  ppp_register_net_channel+0x5f/0x5c6
> > > > > > [ppp_generic]
> > > > > >   [24558.936306]  pppoe_connect+0xab7/0xc71 [pppoe]
> > > > > >   [24558.936640]  SyS_connect+0x14b/0x1b7
> > > > > >   [24558.936975]  do_syscall_64+0x1b1/0x31f
> > > > > >   [24558.937319]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > > >   [24558.937655]
> > > > > >   [24558.937993] Freed by task 12622:
> > > > > >   [24558.938321]  kfree+0xb0/0x11d
> > > > > >   [24558.938658]  ppp_release+0x111/0x120 [ppp_generic]
> > > > > >   [24558.938994]  __fput+0x2ba/0x51a
> > > > > >   [24558.939332]  task_work_run+0x11c/0x13d
> > > > > >   [24558.939676]  exit_to_usermode_loop+0x7c/0xaf
> > > > > >   [24558.940022]  do_syscall_64+0x2ea/0x31f
> > > > > >   [24558.940368]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > > >   [24558.947099]
> > > > >
> > > > > Your first guess was right. It looks like we have an issue with
> > > > > reference counting on the channels. Can you send me your 
ppp_generic.o?
> > > > http://nuclearcat.com/ppp_generic.o
> > > > Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
> > > >
> > > From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
> > > concurrently on the same ppp_file. Even if this ppp_file was pointed at
> > > by two different file descriptors, I can't see how this could defeat
> > > the reference counting mechanism. I'm going to think more about it.
> > >
> > > Can you test with CONFIG_REFCOUNT_FULL? (and keep
> > > d780cd44e3ce ("drivers, net, ppp: convert ppp_file.refcnt from
> > > atomic_t to refcount_t")).
> > Ok, i will try that tonight. On vanilla kernel or reversing
> 

Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-18 Thread Denys Fedoryshchenko

On 2018-02-16 20:48, Guillaume Nault wrote:

On Fri, Feb 16, 2018 at 01:13:18PM +0200, Denys Fedoryshchenko wrote:

On 2018-02-15 21:42, Guillaume Nault wrote:
> On Thu, Feb 15, 2018 at 09:34:42PM +0200, Denys Fedoryshchenko wrote:
> > On 2018-02-15 21:31, Guillaume Nault wrote:
> > > On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
> > > > On 2018-02-15 17:55, Guillaume Nault wrote:
> > > > > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
> > > > > > Here we go:
> > > > > >
> > > > > >   [24558.921549]
> > > > > > ==
> > > > > >   [24558.922167] BUG: KASAN: use-after-free in
> > > > > > ppp_ioctl+0xa6a/0x1522
> > > > > > [ppp_generic]
> > > > > >   [24558.922776] Write of size 8 at addr 8803d35bf3f8 by 
task
> > > > > > accel-pppd/12622
> > > > > >   [24558.923113]
> > > > > >   [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
> > > > > > W
> > > > > > 4.15.3-build-0134 #1
> > > > > >   [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
> > > > > > BIOS P80
> > > > > > 04/02/2015
> > > > > >   [24558.924406] Call Trace:
> > > > > >   [24558.924753]  dump_stack+0x46/0x59
> > > > > >   [24558.925103]  print_address_description+0x6b/0x23b
> > > > > >   [24558.925451]  ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > > > > >   [24558.925797]  kasan_report+0x21b/0x241
> > > > > >   [24558.926136]  ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > > > > >   [24558.926479]  ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
> > > > > >   [24558.926829]  ? sock_sendmsg+0x89/0x99
> > > > > >   [24558.927176]  ? __vfs_write+0xd9/0x4ad
> > > > > >   [24558.927523]  ? kernel_read+0xed/0xed
> > > > > >   [24558.927872]  ? SyS_getpeername+0x18c/0x18c
> > > > > >   [24558.928213]  ? bit_waitqueue+0x2a/0x2a
> > > > > >   [24558.928561]  ? wake_atomic_t_function+0x115/0x115
> > > > > >   [24558.928898]  vfs_ioctl+0x6e/0x81
> > > > > >   [24558.929228]  do_vfs_ioctl+0xa00/0xb10
> > > > > >   [24558.929571]  ? sigprocmask+0x1a6/0x1d0
> > > > > >   [24558.929907]  ? sigsuspend+0x13e/0x13e
> > > > > >   [24558.930239]  ? ioctl_preallocate+0x14e/0x14e
> > > > > >   [24558.930568]  ? SyS_rt_sigprocmask+0xf1/0x142
> > > > > >   [24558.930904]  ? sigprocmask+0x1d0/0x1d0
> > > > > >   [24558.931252]  SyS_ioctl+0x39/0x55
> > > > > >   [24558.931595]  ? do_vfs_ioctl+0xb10/0xb10
> > > > > >   [24558.931942]  do_syscall_64+0x1b1/0x31f
> > > > > >   [24558.932288]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > > >   [24558.932627] RIP: 0033:0x7f302849d8a7
> > > > > >   [24558.932965] RSP: 002b:7f3029a52af8 EFLAGS: 0206
> > > > > > ORIG_RAX:
> > > > > > 0010
> > > > > >   [24558.933578] RAX: ffda RBX: 7f3027d861e3 
RCX:
> > > > > > 7f302849d8a7
> > > > > >   [24558.933927] RDX: 7f3023f49468 RSI: 4004743a 
RDI:
> > > > > > 3a67
> > > > > >   [24558.934266] RBP: 7f3029a52b20 R08:  
R09:
> > > > > > 55c8308d8e40
> > > > > >   [24558.934607] R10: 0008 R11: 0206 
R12:
> > > > > > 7f3023f49358
> > > > > >   [24558.934947] R13: 7ffe86e5723f R14:  
R15:
> > > > > > 7f3029a53700
> > > > > >   [24558.935288]
> > > > > >   [24558.935626] Allocated by task 12622:
> > > > > >   [24558.935972]  ppp_register_net_channel+0x5f/0x5c6
> > > > > > [ppp_generic]
> > > > > >   [24558.936306]  pppoe_connect+0xab7/0xc71 [pppoe]
> > > > > >   [24558.936640]  SyS_connect+0x14b/0x1b7
> > > > > >   [24558.936975]  do_syscall_64+0x1b1/0x31f
> > > > > >   [24558.937319]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > > >   [24558.937655]
> > > > > >   [24558.937993] Freed by task 12622:
> > > > > >   [24558.938321]  kfree+0xb0/0x11d
> > > > > >   [24558.938658]  ppp_release+0x111/0x120 [ppp_generic]
> > > > > >   [24558.938994]  __fput+0x2ba/0x51a
> > > > > >   [24558.939332]  task_work_run+0x11c/0x13d
> > > > > >   [24558.939676]  exit_to_usermode_loop+0x7c/0xaf
> > > > > >   [24558.940022]  do_syscall_64+0x2ea/0x31f
> > > > > >   [24558.940368]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > > >   [24558.947099]
> > > > >
> > > > > Your first guess was right. It looks like we have an issue with
> > > > > reference counting on the channels. Can you send me your 
ppp_generic.o?
> > > > http://nuclearcat.com/ppp_generic.o
> > > > Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
> > > >
> > > From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
> > > concurrently on the same ppp_file. Even if this ppp_file was pointed at
> > > by two different file descriptors, I can't see how this could defeat
> > > the reference counting mechanism. I'm going to think more about it.
> > >
> > > Can you test with CONFIG_REFCOUNT_FULL? (and keep
> > > d780cd44e3ce ("drivers, net, ppp: convert ppp_file.refcnt from
> > > atomic_t to refcount_t")).
> > Ok, i will try that tonight. On vanilla kernel or reversing
> 

Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-16 Thread Guillaume Nault
On Fri, Feb 16, 2018 at 01:13:18PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-15 21:42, Guillaume Nault wrote:
> > On Thu, Feb 15, 2018 at 09:34:42PM +0200, Denys Fedoryshchenko wrote:
> > > On 2018-02-15 21:31, Guillaume Nault wrote:
> > > > On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
> > > > > On 2018-02-15 17:55, Guillaume Nault wrote:
> > > > > > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko 
> > > > > > wrote:
> > > > > > > Here we go:
> > > > > > >
> > > > > > >   [24558.921549]
> > > > > > > ==
> > > > > > >   [24558.922167] BUG: KASAN: use-after-free in
> > > > > > > ppp_ioctl+0xa6a/0x1522
> > > > > > > [ppp_generic]
> > > > > > >   [24558.922776] Write of size 8 at addr 8803d35bf3f8 by 
> > > > > > > task
> > > > > > > accel-pppd/12622
> > > > > > >   [24558.923113]
> > > > > > >   [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: 
> > > > > > > G
> > > > > > > W
> > > > > > > 4.15.3-build-0134 #1
> > > > > > >   [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
> > > > > > > BIOS P80
> > > > > > > 04/02/2015
> > > > > > >   [24558.924406] Call Trace:
> > > > > > >   [24558.924753]  dump_stack+0x46/0x59
> > > > > > >   [24558.925103]  print_address_description+0x6b/0x23b
> > > > > > >   [24558.925451]  ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > > > > > >   [24558.925797]  kasan_report+0x21b/0x241
> > > > > > >   [24558.926136]  ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > > > > > >   [24558.926479]  ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
> > > > > > >   [24558.926829]  ? sock_sendmsg+0x89/0x99
> > > > > > >   [24558.927176]  ? __vfs_write+0xd9/0x4ad
> > > > > > >   [24558.927523]  ? kernel_read+0xed/0xed
> > > > > > >   [24558.927872]  ? SyS_getpeername+0x18c/0x18c
> > > > > > >   [24558.928213]  ? bit_waitqueue+0x2a/0x2a
> > > > > > >   [24558.928561]  ? wake_atomic_t_function+0x115/0x115
> > > > > > >   [24558.928898]  vfs_ioctl+0x6e/0x81
> > > > > > >   [24558.929228]  do_vfs_ioctl+0xa00/0xb10
> > > > > > >   [24558.929571]  ? sigprocmask+0x1a6/0x1d0
> > > > > > >   [24558.929907]  ? sigsuspend+0x13e/0x13e
> > > > > > >   [24558.930239]  ? ioctl_preallocate+0x14e/0x14e
> > > > > > >   [24558.930568]  ? SyS_rt_sigprocmask+0xf1/0x142
> > > > > > >   [24558.930904]  ? sigprocmask+0x1d0/0x1d0
> > > > > > >   [24558.931252]  SyS_ioctl+0x39/0x55
> > > > > > >   [24558.931595]  ? do_vfs_ioctl+0xb10/0xb10
> > > > > > >   [24558.931942]  do_syscall_64+0x1b1/0x31f
> > > > > > >   [24558.932288]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > > > >   [24558.932627] RIP: 0033:0x7f302849d8a7
> > > > > > >   [24558.932965] RSP: 002b:7f3029a52af8 EFLAGS: 0206
> > > > > > > ORIG_RAX:
> > > > > > > 0010
> > > > > > >   [24558.933578] RAX: ffda RBX: 7f3027d861e3 
> > > > > > > RCX:
> > > > > > > 7f302849d8a7
> > > > > > >   [24558.933927] RDX: 7f3023f49468 RSI: 4004743a 
> > > > > > > RDI:
> > > > > > > 3a67
> > > > > > >   [24558.934266] RBP: 7f3029a52b20 R08:  
> > > > > > > R09:
> > > > > > > 55c8308d8e40
> > > > > > >   [24558.934607] R10: 0008 R11: 0206 
> > > > > > > R12:
> > > > > > > 7f3023f49358
> > > > > > >   [24558.934947] R13: 7ffe86e5723f R14:  
> > > > > > > R15:
> > > > > > > 7f3029a53700
> > > > > > >   [24558.935288]
> > > > > > >   [24558.935626] Allocated by task 12622:
> > > > > > >   [24558.935972]  ppp_register_net_channel+0x5f/0x5c6
> > > > > > > [ppp_generic]
> > > > > > >   [24558.936306]  pppoe_connect+0xab7/0xc71 [pppoe]
> > > > > > >   [24558.936640]  SyS_connect+0x14b/0x1b7
> > > > > > >   [24558.936975]  do_syscall_64+0x1b1/0x31f
> > > > > > >   [24558.937319]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > > > >   [24558.937655]
> > > > > > >   [24558.937993] Freed by task 12622:
> > > > > > >   [24558.938321]  kfree+0xb0/0x11d
> > > > > > >   [24558.938658]  ppp_release+0x111/0x120 [ppp_generic]
> > > > > > >   [24558.938994]  __fput+0x2ba/0x51a
> > > > > > >   [24558.939332]  task_work_run+0x11c/0x13d
> > > > > > >   [24558.939676]  exit_to_usermode_loop+0x7c/0xaf
> > > > > > >   [24558.940022]  do_syscall_64+0x2ea/0x31f
> > > > > > >   [24558.940368]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > > > >   [24558.947099]
> > > > > >
> > > > > > Your first guess was right. It looks like we have an issue with
> > > > > > reference counting on the channels. Can you send me your 
> > > > > > ppp_generic.o?
> > > > > http://nuclearcat.com/ppp_generic.o
> > > > > Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
> > > > >
> > > > From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
> > > > concurrently on the same ppp_file. Even if this ppp_file was pointed at
> > > > by two different file descriptors, I can't see how this could defeat
> > > > the reference counting mechanism. I'm going 

Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-16 Thread Denys Fedoryshchenko

On 2018-02-15 21:42, Guillaume Nault wrote:

On Thu, Feb 15, 2018 at 09:34:42PM +0200, Denys Fedoryshchenko wrote:

On 2018-02-15 21:31, Guillaume Nault wrote:
> On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
> > On 2018-02-15 17:55, Guillaume Nault wrote:
> > > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
> > > > Here we go:
> > > >
> > > >   [24558.921549]
> > > > ==
> > > >   [24558.922167] BUG: KASAN: use-after-free in
> > > > ppp_ioctl+0xa6a/0x1522
> > > > [ppp_generic]
> > > >   [24558.922776] Write of size 8 at addr 8803d35bf3f8 by task
> > > > accel-pppd/12622
> > > >   [24558.923113]
> > > >   [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
> > > > W
> > > > 4.15.3-build-0134 #1
> > > >   [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
> > > > BIOS P80
> > > > 04/02/2015
> > > >   [24558.924406] Call Trace:
> > > >   [24558.924753]  dump_stack+0x46/0x59
> > > >   [24558.925103]  print_address_description+0x6b/0x23b
> > > >   [24558.925451]  ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > > >   [24558.925797]  kasan_report+0x21b/0x241
> > > >   [24558.926136]  ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > > >   [24558.926479]  ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
> > > >   [24558.926829]  ? sock_sendmsg+0x89/0x99
> > > >   [24558.927176]  ? __vfs_write+0xd9/0x4ad
> > > >   [24558.927523]  ? kernel_read+0xed/0xed
> > > >   [24558.927872]  ? SyS_getpeername+0x18c/0x18c
> > > >   [24558.928213]  ? bit_waitqueue+0x2a/0x2a
> > > >   [24558.928561]  ? wake_atomic_t_function+0x115/0x115
> > > >   [24558.928898]  vfs_ioctl+0x6e/0x81
> > > >   [24558.929228]  do_vfs_ioctl+0xa00/0xb10
> > > >   [24558.929571]  ? sigprocmask+0x1a6/0x1d0
> > > >   [24558.929907]  ? sigsuspend+0x13e/0x13e
> > > >   [24558.930239]  ? ioctl_preallocate+0x14e/0x14e
> > > >   [24558.930568]  ? SyS_rt_sigprocmask+0xf1/0x142
> > > >   [24558.930904]  ? sigprocmask+0x1d0/0x1d0
> > > >   [24558.931252]  SyS_ioctl+0x39/0x55
> > > >   [24558.931595]  ? do_vfs_ioctl+0xb10/0xb10
> > > >   [24558.931942]  do_syscall_64+0x1b1/0x31f
> > > >   [24558.932288]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > >   [24558.932627] RIP: 0033:0x7f302849d8a7
> > > >   [24558.932965] RSP: 002b:7f3029a52af8 EFLAGS: 0206
> > > > ORIG_RAX:
> > > > 0010
> > > >   [24558.933578] RAX: ffda RBX: 7f3027d861e3 RCX:
> > > > 7f302849d8a7
> > > >   [24558.933927] RDX: 7f3023f49468 RSI: 4004743a RDI:
> > > > 3a67
> > > >   [24558.934266] RBP: 7f3029a52b20 R08:  R09:
> > > > 55c8308d8e40
> > > >   [24558.934607] R10: 0008 R11: 0206 R12:
> > > > 7f3023f49358
> > > >   [24558.934947] R13: 7ffe86e5723f R14:  R15:
> > > > 7f3029a53700
> > > >   [24558.935288]
> > > >   [24558.935626] Allocated by task 12622:
> > > >   [24558.935972]  ppp_register_net_channel+0x5f/0x5c6
> > > > [ppp_generic]
> > > >   [24558.936306]  pppoe_connect+0xab7/0xc71 [pppoe]
> > > >   [24558.936640]  SyS_connect+0x14b/0x1b7
> > > >   [24558.936975]  do_syscall_64+0x1b1/0x31f
> > > >   [24558.937319]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > >   [24558.937655]
> > > >   [24558.937993] Freed by task 12622:
> > > >   [24558.938321]  kfree+0xb0/0x11d
> > > >   [24558.938658]  ppp_release+0x111/0x120 [ppp_generic]
> > > >   [24558.938994]  __fput+0x2ba/0x51a
> > > >   [24558.939332]  task_work_run+0x11c/0x13d
> > > >   [24558.939676]  exit_to_usermode_loop+0x7c/0xaf
> > > >   [24558.940022]  do_syscall_64+0x2ea/0x31f
> > > >   [24558.940368]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > >   [24558.947099]
> > >
> > > Your first guess was right. It looks like we have an issue with
> > > reference counting on the channels. Can you send me your ppp_generic.o?
> > http://nuclearcat.com/ppp_generic.o
> > Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
> >
> From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
> concurrently on the same ppp_file. Even if this ppp_file was pointed at
> by two different file descriptors, I can't see how this could defeat
> the reference counting mechanism. I'm going to think more about it.
>
> Can you test with CONFIG_REFCOUNT_FULL? (and keep
> d780cd44e3ce ("drivers, net, ppp: convert ppp_file.refcnt from
> atomic_t to refcount_t")).
Ok, i will try that tonight. On vanilla kernel or reversing mentioned 
in

previous email patch?

On vanilla kernel. The other is really a shot in the dark.


As far as i can see there is only KASAN triggered again(and server 
rebooted shortly after that), but nothing else:


[ 1848.527234] 
==
[ 1848.527863] BUG: KASAN: use-after-free in ppp_ioctl+0xa68/0x14e7 
[ppp_generic]
[ 1848.528468] Write of size 8 at addr 880354d3fa38 by task 
accel-pppd/12626

[ 1848.528807]
[ 

Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-15 Thread Guillaume Nault
On Thu, Feb 15, 2018 at 09:34:42PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-15 21:31, Guillaume Nault wrote:
> > On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
> > > On 2018-02-15 17:55, Guillaume Nault wrote:
> > > > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
> > > > > Here we go:
> > > > >
> > > > >   [24558.921549]
> > > > > ==
> > > > >   [24558.922167] BUG: KASAN: use-after-free in
> > > > > ppp_ioctl+0xa6a/0x1522
> > > > > [ppp_generic]
> > > > >   [24558.922776] Write of size 8 at addr 8803d35bf3f8 by task
> > > > > accel-pppd/12622
> > > > >   [24558.923113]
> > > > >   [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
> > > > > W
> > > > > 4.15.3-build-0134 #1
> > > > >   [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
> > > > > BIOS P80
> > > > > 04/02/2015
> > > > >   [24558.924406] Call Trace:
> > > > >   [24558.924753]  dump_stack+0x46/0x59
> > > > >   [24558.925103]  print_address_description+0x6b/0x23b
> > > > >   [24558.925451]  ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > > > >   [24558.925797]  kasan_report+0x21b/0x241
> > > > >   [24558.926136]  ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > > > >   [24558.926479]  ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
> > > > >   [24558.926829]  ? sock_sendmsg+0x89/0x99
> > > > >   [24558.927176]  ? __vfs_write+0xd9/0x4ad
> > > > >   [24558.927523]  ? kernel_read+0xed/0xed
> > > > >   [24558.927872]  ? SyS_getpeername+0x18c/0x18c
> > > > >   [24558.928213]  ? bit_waitqueue+0x2a/0x2a
> > > > >   [24558.928561]  ? wake_atomic_t_function+0x115/0x115
> > > > >   [24558.928898]  vfs_ioctl+0x6e/0x81
> > > > >   [24558.929228]  do_vfs_ioctl+0xa00/0xb10
> > > > >   [24558.929571]  ? sigprocmask+0x1a6/0x1d0
> > > > >   [24558.929907]  ? sigsuspend+0x13e/0x13e
> > > > >   [24558.930239]  ? ioctl_preallocate+0x14e/0x14e
> > > > >   [24558.930568]  ? SyS_rt_sigprocmask+0xf1/0x142
> > > > >   [24558.930904]  ? sigprocmask+0x1d0/0x1d0
> > > > >   [24558.931252]  SyS_ioctl+0x39/0x55
> > > > >   [24558.931595]  ? do_vfs_ioctl+0xb10/0xb10
> > > > >   [24558.931942]  do_syscall_64+0x1b1/0x31f
> > > > >   [24558.932288]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > >   [24558.932627] RIP: 0033:0x7f302849d8a7
> > > > >   [24558.932965] RSP: 002b:7f3029a52af8 EFLAGS: 0206
> > > > > ORIG_RAX:
> > > > > 0010
> > > > >   [24558.933578] RAX: ffda RBX: 7f3027d861e3 RCX:
> > > > > 7f302849d8a7
> > > > >   [24558.933927] RDX: 7f3023f49468 RSI: 4004743a RDI:
> > > > > 3a67
> > > > >   [24558.934266] RBP: 7f3029a52b20 R08:  R09:
> > > > > 55c8308d8e40
> > > > >   [24558.934607] R10: 0008 R11: 0206 R12:
> > > > > 7f3023f49358
> > > > >   [24558.934947] R13: 7ffe86e5723f R14:  R15:
> > > > > 7f3029a53700
> > > > >   [24558.935288]
> > > > >   [24558.935626] Allocated by task 12622:
> > > > >   [24558.935972]  ppp_register_net_channel+0x5f/0x5c6
> > > > > [ppp_generic]
> > > > >   [24558.936306]  pppoe_connect+0xab7/0xc71 [pppoe]
> > > > >   [24558.936640]  SyS_connect+0x14b/0x1b7
> > > > >   [24558.936975]  do_syscall_64+0x1b1/0x31f
> > > > >   [24558.937319]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > >   [24558.937655]
> > > > >   [24558.937993] Freed by task 12622:
> > > > >   [24558.938321]  kfree+0xb0/0x11d
> > > > >   [24558.938658]  ppp_release+0x111/0x120 [ppp_generic]
> > > > >   [24558.938994]  __fput+0x2ba/0x51a
> > > > >   [24558.939332]  task_work_run+0x11c/0x13d
> > > > >   [24558.939676]  exit_to_usermode_loop+0x7c/0xaf
> > > > >   [24558.940022]  do_syscall_64+0x2ea/0x31f
> > > > >   [24558.940368]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > > > >   [24558.947099]
> > > >
> > > > Your first guess was right. It looks like we have an issue with
> > > > reference counting on the channels. Can you send me your ppp_generic.o?
> > > http://nuclearcat.com/ppp_generic.o
> > > Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
> > > 
> > From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
> > concurrently on the same ppp_file. Even if this ppp_file was pointed at
> > by two different file descriptors, I can't see how this could defeat
> > the reference counting mechanism. I'm going to think more about it.
> > 
> > Can you test with CONFIG_REFCOUNT_FULL? (and keep
> > d780cd44e3ce ("drivers, net, ppp: convert ppp_file.refcnt from
> > atomic_t to refcount_t")).
> Ok, i will try that tonight. On vanilla kernel or reversing mentioned in
> previous email patch?
On vanilla kernel. The other is really a shot in the dark.


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-15 Thread Denys Fedoryshchenko

On 2018-02-15 21:31, Guillaume Nault wrote:

On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:

On 2018-02-15 17:55, Guillaume Nault wrote:
> On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
> > Here we go:
> >
> >   [24558.921549]
> > ==
> >   [24558.922167] BUG: KASAN: use-after-free in
> > ppp_ioctl+0xa6a/0x1522
> > [ppp_generic]
> >   [24558.922776] Write of size 8 at addr 8803d35bf3f8 by task
> > accel-pppd/12622
> >   [24558.923113]
> >   [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
> > W
> > 4.15.3-build-0134 #1
> >   [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
> > BIOS P80
> > 04/02/2015
> >   [24558.924406] Call Trace:
> >   [24558.924753]  dump_stack+0x46/0x59
> >   [24558.925103]  print_address_description+0x6b/0x23b
> >   [24558.925451]  ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> >   [24558.925797]  kasan_report+0x21b/0x241
> >   [24558.926136]  ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> >   [24558.926479]  ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
> >   [24558.926829]  ? sock_sendmsg+0x89/0x99
> >   [24558.927176]  ? __vfs_write+0xd9/0x4ad
> >   [24558.927523]  ? kernel_read+0xed/0xed
> >   [24558.927872]  ? SyS_getpeername+0x18c/0x18c
> >   [24558.928213]  ? bit_waitqueue+0x2a/0x2a
> >   [24558.928561]  ? wake_atomic_t_function+0x115/0x115
> >   [24558.928898]  vfs_ioctl+0x6e/0x81
> >   [24558.929228]  do_vfs_ioctl+0xa00/0xb10
> >   [24558.929571]  ? sigprocmask+0x1a6/0x1d0
> >   [24558.929907]  ? sigsuspend+0x13e/0x13e
> >   [24558.930239]  ? ioctl_preallocate+0x14e/0x14e
> >   [24558.930568]  ? SyS_rt_sigprocmask+0xf1/0x142
> >   [24558.930904]  ? sigprocmask+0x1d0/0x1d0
> >   [24558.931252]  SyS_ioctl+0x39/0x55
> >   [24558.931595]  ? do_vfs_ioctl+0xb10/0xb10
> >   [24558.931942]  do_syscall_64+0x1b1/0x31f
> >   [24558.932288]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> >   [24558.932627] RIP: 0033:0x7f302849d8a7
> >   [24558.932965] RSP: 002b:7f3029a52af8 EFLAGS: 0206
> > ORIG_RAX:
> > 0010
> >   [24558.933578] RAX: ffda RBX: 7f3027d861e3 RCX:
> > 7f302849d8a7
> >   [24558.933927] RDX: 7f3023f49468 RSI: 4004743a RDI:
> > 3a67
> >   [24558.934266] RBP: 7f3029a52b20 R08:  R09:
> > 55c8308d8e40
> >   [24558.934607] R10: 0008 R11: 0206 R12:
> > 7f3023f49358
> >   [24558.934947] R13: 7ffe86e5723f R14:  R15:
> > 7f3029a53700
> >   [24558.935288]
> >   [24558.935626] Allocated by task 12622:
> >   [24558.935972]  ppp_register_net_channel+0x5f/0x5c6
> > [ppp_generic]
> >   [24558.936306]  pppoe_connect+0xab7/0xc71 [pppoe]
> >   [24558.936640]  SyS_connect+0x14b/0x1b7
> >   [24558.936975]  do_syscall_64+0x1b1/0x31f
> >   [24558.937319]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> >   [24558.937655]
> >   [24558.937993] Freed by task 12622:
> >   [24558.938321]  kfree+0xb0/0x11d
> >   [24558.938658]  ppp_release+0x111/0x120 [ppp_generic]
> >   [24558.938994]  __fput+0x2ba/0x51a
> >   [24558.939332]  task_work_run+0x11c/0x13d
> >   [24558.939676]  exit_to_usermode_loop+0x7c/0xaf
> >   [24558.940022]  do_syscall_64+0x2ea/0x31f
> >   [24558.940368]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> >   [24558.947099]
>
> Your first guess was right. It looks like we have an issue with
> reference counting on the channels. Can you send me your ppp_generic.o?
http://nuclearcat.com/ppp_generic.o
Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)


From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
concurrently on the same ppp_file. Even if this ppp_file was pointed at
by two different file descriptors, I can't see how this could defeat
the reference counting mechanism. I'm going to think more about it.

Can you test with CONFIG_REFCOUNT_FULL? (and keep
d780cd44e3ce ("drivers, net, ppp: convert ppp_file.refcnt from
atomic_t to refcount_t")).
Ok, i will try that tonight. On vanilla kernel or reversing mentioned in 
previous email patch?


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-15 Thread Guillaume Nault
On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-15 17:55, Guillaume Nault wrote:
> > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
> > > Here we go:
> > > 
> > >   [24558.921549]
> > > ==
> > >   [24558.922167] BUG: KASAN: use-after-free in
> > > ppp_ioctl+0xa6a/0x1522
> > > [ppp_generic]
> > >   [24558.922776] Write of size 8 at addr 8803d35bf3f8 by task
> > > accel-pppd/12622
> > >   [24558.923113]
> > >   [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
> > > W
> > > 4.15.3-build-0134 #1
> > >   [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
> > > BIOS P80
> > > 04/02/2015
> > >   [24558.924406] Call Trace:
> > >   [24558.924753]  dump_stack+0x46/0x59
> > >   [24558.925103]  print_address_description+0x6b/0x23b
> > >   [24558.925451]  ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > >   [24558.925797]  kasan_report+0x21b/0x241
> > >   [24558.926136]  ppp_ioctl+0xa6a/0x1522 [ppp_generic]
> > >   [24558.926479]  ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
> > >   [24558.926829]  ? sock_sendmsg+0x89/0x99
> > >   [24558.927176]  ? __vfs_write+0xd9/0x4ad
> > >   [24558.927523]  ? kernel_read+0xed/0xed
> > >   [24558.927872]  ? SyS_getpeername+0x18c/0x18c
> > >   [24558.928213]  ? bit_waitqueue+0x2a/0x2a
> > >   [24558.928561]  ? wake_atomic_t_function+0x115/0x115
> > >   [24558.928898]  vfs_ioctl+0x6e/0x81
> > >   [24558.929228]  do_vfs_ioctl+0xa00/0xb10
> > >   [24558.929571]  ? sigprocmask+0x1a6/0x1d0
> > >   [24558.929907]  ? sigsuspend+0x13e/0x13e
> > >   [24558.930239]  ? ioctl_preallocate+0x14e/0x14e
> > >   [24558.930568]  ? SyS_rt_sigprocmask+0xf1/0x142
> > >   [24558.930904]  ? sigprocmask+0x1d0/0x1d0
> > >   [24558.931252]  SyS_ioctl+0x39/0x55
> > >   [24558.931595]  ? do_vfs_ioctl+0xb10/0xb10
> > >   [24558.931942]  do_syscall_64+0x1b1/0x31f
> > >   [24558.932288]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > >   [24558.932627] RIP: 0033:0x7f302849d8a7
> > >   [24558.932965] RSP: 002b:7f3029a52af8 EFLAGS: 0206
> > > ORIG_RAX:
> > > 0010
> > >   [24558.933578] RAX: ffda RBX: 7f3027d861e3 RCX:
> > > 7f302849d8a7
> > >   [24558.933927] RDX: 7f3023f49468 RSI: 4004743a RDI:
> > > 3a67
> > >   [24558.934266] RBP: 7f3029a52b20 R08:  R09:
> > > 55c8308d8e40
> > >   [24558.934607] R10: 0008 R11: 0206 R12:
> > > 7f3023f49358
> > >   [24558.934947] R13: 7ffe86e5723f R14:  R15:
> > > 7f3029a53700
> > >   [24558.935288]
> > >   [24558.935626] Allocated by task 12622:
> > >   [24558.935972]  ppp_register_net_channel+0x5f/0x5c6
> > > [ppp_generic]
> > >   [24558.936306]  pppoe_connect+0xab7/0xc71 [pppoe]
> > >   [24558.936640]  SyS_connect+0x14b/0x1b7
> > >   [24558.936975]  do_syscall_64+0x1b1/0x31f
> > >   [24558.937319]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > >   [24558.937655]
> > >   [24558.937993] Freed by task 12622:
> > >   [24558.938321]  kfree+0xb0/0x11d
> > >   [24558.938658]  ppp_release+0x111/0x120 [ppp_generic]
> > >   [24558.938994]  __fput+0x2ba/0x51a
> > >   [24558.939332]  task_work_run+0x11c/0x13d
> > >   [24558.939676]  exit_to_usermode_loop+0x7c/0xaf
> > >   [24558.940022]  do_syscall_64+0x2ea/0x31f
> > >   [24558.940368]  entry_SYSCALL_64_after_hwframe+0x21/0x86
> > >   [24558.947099]
> > 
> > Your first guess was right. It looks like we have an issue with
> > reference counting on the channels. Can you send me your ppp_generic.o?
> http://nuclearcat.com/ppp_generic.o
> Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
> 
>From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
concurrently on the same ppp_file. Even if this ppp_file was pointed at
by two different file descriptors, I can't see how this could defeat
the reference counting mechanism. I'm going to think more about it.

Can you test with CONFIG_REFCOUNT_FULL? (and keep
d780cd44e3ce ("drivers, net, ppp: convert ppp_file.refcnt from atomic_t to 
refcount_t")).


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-15 Thread Guillaume Nault
On Wed, Feb 14, 2018 at 06:49:19PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-14 18:47, Guillaume Nault wrote:
> > On Wed, Feb 14, 2018 at 06:29:34PM +0200, Denys Fedoryshchenko wrote:
> > > On 2018-02-14 18:07, Guillaume Nault wrote:
> > > > On Wed, Feb 14, 2018 at 03:17:23PM +0200, Denys Fedoryshchenko wrote:
> > > > > Hi,
> > > > >
> > > > > Upgraded kernel to 4.15.3, still it crashes after while (several
> > > > > hours,
> > > > > cannot do bisect, as it is production server).
> > > > >
> > > > > dev ppp # gdb ppp_generic.o
> > > > > GNU gdb (Gentoo 7.12.1 vanilla) 7.12.1
> > > > > <>
> > > > > Reading symbols from ppp_generic.o...done.
> > > > > (gdb) list *ppp_push+0x73
> > > > > 0x681 is in ppp_push (drivers/net/ppp/ppp_generic.c:1663).
> > > > > 1658  list = list->next;
> > > > > 1659  pch = list_entry(list, struct channel, clist);
> > > > > 1660
> > > > > 1661  spin_lock(>downl);
> > > > > 1662  if (pch->chan) {
> > > > > 1663  if 
> > > > > (pch->chan->ops->start_xmit(pch->chan, skb))
> > > > > 1664  ppp->xmit_pending = NULL;
> > > > > 1665  } else {
> > > > > 1666  /* channel got unregistered */
> > > > > 1667  kfree_skb(skb);
> > > > >
> > > > >
> > > > I expect a memory corruption. Do you have the possibility to run with
> > > > KASAN by any chance?
> > > I will try to enable it tonight. For now i reverted "drivers, net,
> > > ppp:
> > > convert ppp_file.refcnt from atomic_t to refcount_t" for test.
> > > 
> > This commit looks good to me. Do you have doubts about it because it's
> > new in 4.15? Does it mean that your last known-good kernel is 4.14?
> 
> I am just doing "manual" bisect, checking all possibilities, and picking
> patch to revert randomly.
> Yes, correct, my known-good is 4.14.2.
> 
Then maybe try reverting commit 0171c4183559 ("ppp: unlock all_ppp_mutex before 
registering device").
I can't see how it could lead to the bug you observed, but the other
ppp_generic patches introduced since 4.14 were rather trivial.


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-15 Thread Denys Fedoryshchenko

On 2018-02-15 17:55, Guillaume Nault wrote:

On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:

Here we go:

  [24558.921549]
==
  [24558.922167] BUG: KASAN: use-after-free in 
ppp_ioctl+0xa6a/0x1522

[ppp_generic]
  [24558.922776] Write of size 8 at addr 8803d35bf3f8 by task
accel-pppd/12622
  [24558.923113]
  [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
W

4.15.3-build-0134 #1
  [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS 
P80

04/02/2015
  [24558.924406] Call Trace:
  [24558.924753]  dump_stack+0x46/0x59
  [24558.925103]  print_address_description+0x6b/0x23b
  [24558.925451]  ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
  [24558.925797]  kasan_report+0x21b/0x241
  [24558.926136]  ppp_ioctl+0xa6a/0x1522 [ppp_generic]
  [24558.926479]  ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
  [24558.926829]  ? sock_sendmsg+0x89/0x99
  [24558.927176]  ? __vfs_write+0xd9/0x4ad
  [24558.927523]  ? kernel_read+0xed/0xed
  [24558.927872]  ? SyS_getpeername+0x18c/0x18c
  [24558.928213]  ? bit_waitqueue+0x2a/0x2a
  [24558.928561]  ? wake_atomic_t_function+0x115/0x115
  [24558.928898]  vfs_ioctl+0x6e/0x81
  [24558.929228]  do_vfs_ioctl+0xa00/0xb10
  [24558.929571]  ? sigprocmask+0x1a6/0x1d0
  [24558.929907]  ? sigsuspend+0x13e/0x13e
  [24558.930239]  ? ioctl_preallocate+0x14e/0x14e
  [24558.930568]  ? SyS_rt_sigprocmask+0xf1/0x142
  [24558.930904]  ? sigprocmask+0x1d0/0x1d0
  [24558.931252]  SyS_ioctl+0x39/0x55
  [24558.931595]  ? do_vfs_ioctl+0xb10/0xb10
  [24558.931942]  do_syscall_64+0x1b1/0x31f
  [24558.932288]  entry_SYSCALL_64_after_hwframe+0x21/0x86
  [24558.932627] RIP: 0033:0x7f302849d8a7
  [24558.932965] RSP: 002b:7f3029a52af8 EFLAGS: 0206 
ORIG_RAX:

0010
  [24558.933578] RAX: ffda RBX: 7f3027d861e3 RCX:
7f302849d8a7
  [24558.933927] RDX: 7f3023f49468 RSI: 4004743a RDI:
3a67
  [24558.934266] RBP: 7f3029a52b20 R08:  R09:
55c8308d8e40
  [24558.934607] R10: 0008 R11: 0206 R12:
7f3023f49358
  [24558.934947] R13: 7ffe86e5723f R14:  R15:
7f3029a53700
  [24558.935288]
  [24558.935626] Allocated by task 12622:
  [24558.935972]  ppp_register_net_channel+0x5f/0x5c6 
[ppp_generic]

  [24558.936306]  pppoe_connect+0xab7/0xc71 [pppoe]
  [24558.936640]  SyS_connect+0x14b/0x1b7
  [24558.936975]  do_syscall_64+0x1b1/0x31f
  [24558.937319]  entry_SYSCALL_64_after_hwframe+0x21/0x86
  [24558.937655]
  [24558.937993] Freed by task 12622:
  [24558.938321]  kfree+0xb0/0x11d
  [24558.938658]  ppp_release+0x111/0x120 [ppp_generic]
  [24558.938994]  __fput+0x2ba/0x51a
  [24558.939332]  task_work_run+0x11c/0x13d
  [24558.939676]  exit_to_usermode_loop+0x7c/0xaf
  [24558.940022]  do_syscall_64+0x2ea/0x31f
  [24558.940368]  entry_SYSCALL_64_after_hwframe+0x21/0x86
  [24558.947099]


Your first guess was right. It looks like we have an issue with
reference counting on the channels. Can you send me your ppp_generic.o?

http://nuclearcat.com/ppp_generic.o
Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-15 Thread Guillaume Nault
On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
> Here we go:
> 
>   [24558.921549]
> ==
>   [24558.922167] BUG: KASAN: use-after-free in ppp_ioctl+0xa6a/0x1522
> [ppp_generic]
>   [24558.922776] Write of size 8 at addr 8803d35bf3f8 by task
> accel-pppd/12622
>   [24558.923113]
>   [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: GW
> 4.15.3-build-0134 #1
>   [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
> 04/02/2015
>   [24558.924406] Call Trace:
>   [24558.924753]  dump_stack+0x46/0x59
>   [24558.925103]  print_address_description+0x6b/0x23b
>   [24558.925451]  ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>   [24558.925797]  kasan_report+0x21b/0x241
>   [24558.926136]  ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>   [24558.926479]  ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
>   [24558.926829]  ? sock_sendmsg+0x89/0x99
>   [24558.927176]  ? __vfs_write+0xd9/0x4ad
>   [24558.927523]  ? kernel_read+0xed/0xed
>   [24558.927872]  ? SyS_getpeername+0x18c/0x18c
>   [24558.928213]  ? bit_waitqueue+0x2a/0x2a
>   [24558.928561]  ? wake_atomic_t_function+0x115/0x115
>   [24558.928898]  vfs_ioctl+0x6e/0x81
>   [24558.929228]  do_vfs_ioctl+0xa00/0xb10
>   [24558.929571]  ? sigprocmask+0x1a6/0x1d0
>   [24558.929907]  ? sigsuspend+0x13e/0x13e
>   [24558.930239]  ? ioctl_preallocate+0x14e/0x14e
>   [24558.930568]  ? SyS_rt_sigprocmask+0xf1/0x142
>   [24558.930904]  ? sigprocmask+0x1d0/0x1d0
>   [24558.931252]  SyS_ioctl+0x39/0x55
>   [24558.931595]  ? do_vfs_ioctl+0xb10/0xb10
>   [24558.931942]  do_syscall_64+0x1b1/0x31f
>   [24558.932288]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>   [24558.932627] RIP: 0033:0x7f302849d8a7
>   [24558.932965] RSP: 002b:7f3029a52af8 EFLAGS: 0206 ORIG_RAX:
> 0010
>   [24558.933578] RAX: ffda RBX: 7f3027d861e3 RCX:
> 7f302849d8a7
>   [24558.933927] RDX: 7f3023f49468 RSI: 4004743a RDI:
> 3a67
>   [24558.934266] RBP: 7f3029a52b20 R08:  R09:
> 55c8308d8e40
>   [24558.934607] R10: 0008 R11: 0206 R12:
> 7f3023f49358
>   [24558.934947] R13: 7ffe86e5723f R14:  R15:
> 7f3029a53700
>   [24558.935288]
>   [24558.935626] Allocated by task 12622:
>   [24558.935972]  ppp_register_net_channel+0x5f/0x5c6 [ppp_generic]
>   [24558.936306]  pppoe_connect+0xab7/0xc71 [pppoe]
>   [24558.936640]  SyS_connect+0x14b/0x1b7
>   [24558.936975]  do_syscall_64+0x1b1/0x31f
>   [24558.937319]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>   [24558.937655]
>   [24558.937993] Freed by task 12622:
>   [24558.938321]  kfree+0xb0/0x11d
>   [24558.938658]  ppp_release+0x111/0x120 [ppp_generic]
>   [24558.938994]  __fput+0x2ba/0x51a
>   [24558.939332]  task_work_run+0x11c/0x13d
>   [24558.939676]  exit_to_usermode_loop+0x7c/0xaf
>   [24558.940022]  do_syscall_64+0x2ea/0x31f
>   [24558.940368]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>   [24558.947099]

Your first guess was right. It looks like we have an issue with
reference counting on the channels. Can you send me your ppp_generic.o?


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-15 Thread Denys Fedoryshchenko

On 2018-02-14 19:25, Guillaume Nault wrote:

On Wed, Feb 14, 2018 at 06:49:19PM +0200, Denys Fedoryshchenko wrote:

On 2018-02-14 18:47, Guillaume Nault wrote:
> On Wed, Feb 14, 2018 at 06:29:34PM +0200, Denys Fedoryshchenko wrote:
> > On 2018-02-14 18:07, Guillaume Nault wrote:
> > > On Wed, Feb 14, 2018 at 03:17:23PM +0200, Denys Fedoryshchenko wrote:
> > > > Hi,
> > > >
> > > > Upgraded kernel to 4.15.3, still it crashes after while (several
> > > > hours,
> > > > cannot do bisect, as it is production server).
> > > >
> > > > dev ppp # gdb ppp_generic.o
> > > > GNU gdb (Gentoo 7.12.1 vanilla) 7.12.1
> > > > <>
> > > > Reading symbols from ppp_generic.o...done.
> > > > (gdb) list *ppp_push+0x73
> > > > 0x681 is in ppp_push (drivers/net/ppp/ppp_generic.c:1663).
> > > > 1658list = list->next;
> > > > 1659pch = list_entry(list, struct channel, 
clist);
> > > > 1660
> > > > 1661spin_lock(>downl);
> > > > 1662if (pch->chan) {
> > > > 1663if 
(pch->chan->ops->start_xmit(pch->chan, skb))
> > > > 1664ppp->xmit_pending = NULL;
> > > > 1665} else {
> > > > 1666/* channel got unregistered */
> > > > 1667kfree_skb(skb);
> > > >
> > > >
> > > I expect a memory corruption. Do you have the possibility to run with
> > > KASAN by any chance?
> > I will try to enable it tonight. For now i reverted "drivers, net,
> > ppp:
> > convert ppp_file.refcnt from atomic_t to refcount_t" for test.
> >
> This commit looks good to me. Do you have doubts about it because it's
> new in 4.15? Does it mean that your last known-good kernel is 4.14?

I am just doing "manual" bisect, checking all possibilities, and 
picking

patch to revert randomly.


Must be a painful process. Are all of your networking modules required?
With luck, you might be able to isolate a faulty module in fewer steps.


Yes, correct, my known-good is 4.14.2.


Good to know.

Let me know if you can get a KASAN trace.

Here we go:

  [24558.921549] 
==
  [24558.922167] BUG: KASAN: use-after-free in 
ppp_ioctl+0xa6a/0x1522 [ppp_generic]
  [24558.922776] Write of size 8 at addr 8803d35bf3f8 by task 
accel-pppd/12622

  [24558.923113]
  [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G  
  W4.15.3-build-0134 #1
  [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS 
P80 04/02/2015

  [24558.924406] Call Trace:
  [24558.924753]  dump_stack+0x46/0x59
  [24558.925103]  print_address_description+0x6b/0x23b
  [24558.925451]  ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
  [24558.925797]  kasan_report+0x21b/0x241
  [24558.926136]  ppp_ioctl+0xa6a/0x1522 [ppp_generic]
  [24558.926479]  ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
  [24558.926829]  ? sock_sendmsg+0x89/0x99
  [24558.927176]  ? __vfs_write+0xd9/0x4ad
  [24558.927523]  ? kernel_read+0xed/0xed
  [24558.927872]  ? SyS_getpeername+0x18c/0x18c
  [24558.928213]  ? bit_waitqueue+0x2a/0x2a
  [24558.928561]  ? wake_atomic_t_function+0x115/0x115
  [24558.928898]  vfs_ioctl+0x6e/0x81
  [24558.929228]  do_vfs_ioctl+0xa00/0xb10
  [24558.929571]  ? sigprocmask+0x1a6/0x1d0
  [24558.929907]  ? sigsuspend+0x13e/0x13e
  [24558.930239]  ? ioctl_preallocate+0x14e/0x14e
  [24558.930568]  ? SyS_rt_sigprocmask+0xf1/0x142
  [24558.930904]  ? sigprocmask+0x1d0/0x1d0
  [24558.931252]  SyS_ioctl+0x39/0x55
  [24558.931595]  ? do_vfs_ioctl+0xb10/0xb10
  [24558.931942]  do_syscall_64+0x1b1/0x31f
  [24558.932288]  entry_SYSCALL_64_after_hwframe+0x21/0x86
  [24558.932627] RIP: 0033:0x7f302849d8a7
  [24558.932965] RSP: 002b:7f3029a52af8 EFLAGS: 0206 
ORIG_RAX: 0010
  [24558.933578] RAX: ffda RBX: 7f3027d861e3 RCX: 
7f302849d8a7
  [24558.933927] RDX: 7f3023f49468 RSI: 4004743a RDI: 
3a67
  [24558.934266] RBP: 7f3029a52b20 R08:  R09: 
55c8308d8e40
  [24558.934607] R10: 0008 R11: 0206 R12: 
7f3023f49358
  [24558.934947] R13: 7ffe86e5723f R14:  R15: 
7f3029a53700

  [24558.935288]
  [24558.935626] Allocated by task 12622:
  [24558.935972]  ppp_register_net_channel+0x5f/0x5c6 [ppp_generic]
  [24558.936306]  pppoe_connect+0xab7/0xc71 [pppoe]
  [24558.936640]  SyS_connect+0x14b/0x1b7
  [24558.936975]  do_syscall_64+0x1b1/0x31f
  [24558.937319]  entry_SYSCALL_64_after_hwframe+0x21/0x86
  [24558.937655]
  [24558.937993] Freed by task 12622:
  [24558.938321]  kfree+0xb0/0x11d
  [24558.938658]  ppp_release+0x111/0x120 [ppp_generic]
  [24558.938994]  __fput+0x2ba/0x51a
  [24558.939332]  task_work_run+0x11c/0x13d
  [24558.939676]  exit_to_usermode_loop+0x7c/0xaf
  [24558.940022]  do_syscall_64+0x2ea/0x31f
  [24558.940368]  entry_SYSCALL_64_after_hwframe+0x21/0x86
  

Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-14 Thread Guillaume Nault
On Wed, Feb 14, 2018 at 06:49:19PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-14 18:47, Guillaume Nault wrote:
> > On Wed, Feb 14, 2018 at 06:29:34PM +0200, Denys Fedoryshchenko wrote:
> > > On 2018-02-14 18:07, Guillaume Nault wrote:
> > > > On Wed, Feb 14, 2018 at 03:17:23PM +0200, Denys Fedoryshchenko wrote:
> > > > > Hi,
> > > > >
> > > > > Upgraded kernel to 4.15.3, still it crashes after while (several
> > > > > hours,
> > > > > cannot do bisect, as it is production server).
> > > > >
> > > > > dev ppp # gdb ppp_generic.o
> > > > > GNU gdb (Gentoo 7.12.1 vanilla) 7.12.1
> > > > > <>
> > > > > Reading symbols from ppp_generic.o...done.
> > > > > (gdb) list *ppp_push+0x73
> > > > > 0x681 is in ppp_push (drivers/net/ppp/ppp_generic.c:1663).
> > > > > 1658  list = list->next;
> > > > > 1659  pch = list_entry(list, struct channel, clist);
> > > > > 1660
> > > > > 1661  spin_lock(>downl);
> > > > > 1662  if (pch->chan) {
> > > > > 1663  if 
> > > > > (pch->chan->ops->start_xmit(pch->chan, skb))
> > > > > 1664  ppp->xmit_pending = NULL;
> > > > > 1665  } else {
> > > > > 1666  /* channel got unregistered */
> > > > > 1667  kfree_skb(skb);
> > > > >
> > > > >
> > > > I expect a memory corruption. Do you have the possibility to run with
> > > > KASAN by any chance?
> > > I will try to enable it tonight. For now i reverted "drivers, net,
> > > ppp:
> > > convert ppp_file.refcnt from atomic_t to refcount_t" for test.
> > > 
> > This commit looks good to me. Do you have doubts about it because it's
> > new in 4.15? Does it mean that your last known-good kernel is 4.14?
> 
> I am just doing "manual" bisect, checking all possibilities, and picking
> patch to revert randomly.
> 
Must be a painful process. Are all of your networking modules required?
With luck, you might be able to isolate a faulty module in fewer steps.

> Yes, correct, my known-good is 4.14.2.
> 
Good to know.

Let me know if you can get a KASAN trace.



Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-14 Thread Denys Fedoryshchenko

On 2018-02-14 18:47, Guillaume Nault wrote:

On Wed, Feb 14, 2018 at 06:29:34PM +0200, Denys Fedoryshchenko wrote:

On 2018-02-14 18:07, Guillaume Nault wrote:
> On Wed, Feb 14, 2018 at 03:17:23PM +0200, Denys Fedoryshchenko wrote:
> > Hi,
> >
> > Upgraded kernel to 4.15.3, still it crashes after while (several
> > hours,
> > cannot do bisect, as it is production server).
> >
> > dev ppp # gdb ppp_generic.o
> > GNU gdb (Gentoo 7.12.1 vanilla) 7.12.1
> > <>
> > Reading symbols from ppp_generic.o...done.
> > (gdb) list *ppp_push+0x73
> > 0x681 is in ppp_push (drivers/net/ppp/ppp_generic.c:1663).
> > 1658  list = list->next;
> > 1659  pch = list_entry(list, struct channel, clist);
> > 1660
> > 1661  spin_lock(>downl);
> > 1662  if (pch->chan) {
> > 1663  if (pch->chan->ops->start_xmit(pch->chan, 
skb))
> > 1664  ppp->xmit_pending = NULL;
> > 1665  } else {
> > 1666  /* channel got unregistered */
> > 1667  kfree_skb(skb);
> >
> >
> I expect a memory corruption. Do you have the possibility to run with
> KASAN by any chance?
I will try to enable it tonight. For now i reverted "drivers, net, 
ppp:

convert ppp_file.refcnt from atomic_t to refcount_t" for test.


This commit looks good to me. Do you have doubts about it because it's
new in 4.15? Does it mean that your last known-good kernel is 4.14?


I am just doing "manual" bisect, checking all possibilities, and picking 
patch to revert randomly.

Yes, correct, my known-good is 4.14.2.


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-14 Thread Guillaume Nault
On Wed, Feb 14, 2018 at 06:29:34PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-14 18:07, Guillaume Nault wrote:
> > On Wed, Feb 14, 2018 at 03:17:23PM +0200, Denys Fedoryshchenko wrote:
> > > Hi,
> > > 
> > > Upgraded kernel to 4.15.3, still it crashes after while (several
> > > hours,
> > > cannot do bisect, as it is production server).
> > > 
> > > dev ppp # gdb ppp_generic.o
> > > GNU gdb (Gentoo 7.12.1 vanilla) 7.12.1
> > > <>
> > > Reading symbols from ppp_generic.o...done.
> > > (gdb) list *ppp_push+0x73
> > > 0x681 is in ppp_push (drivers/net/ppp/ppp_generic.c:1663).
> > > 1658  list = list->next;
> > > 1659  pch = list_entry(list, struct channel, clist);
> > > 1660
> > > 1661  spin_lock(>downl);
> > > 1662  if (pch->chan) {
> > > 1663  if 
> > > (pch->chan->ops->start_xmit(pch->chan, skb))
> > > 1664  ppp->xmit_pending = NULL;
> > > 1665  } else {
> > > 1666  /* channel got unregistered */
> > > 1667  kfree_skb(skb);
> > > 
> > > 
> > I expect a memory corruption. Do you have the possibility to run with
> > KASAN by any chance?
> I will try to enable it tonight. For now i reverted "drivers, net, ppp:
> convert ppp_file.refcnt from atomic_t to refcount_t" for test.
> 
This commit looks good to me. Do you have doubts about it because it's
new in 4.15? Does it mean that your last known-good kernel is 4.14?


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-14 Thread Denys Fedoryshchenko

On 2018-02-14 18:07, Guillaume Nault wrote:

On Wed, Feb 14, 2018 at 03:17:23PM +0200, Denys Fedoryshchenko wrote:

Hi,

Upgraded kernel to 4.15.3, still it crashes after while (several 
hours,

cannot do bisect, as it is production server).

dev ppp # gdb ppp_generic.o
GNU gdb (Gentoo 7.12.1 vanilla) 7.12.1
<>
Reading symbols from ppp_generic.o...done.
(gdb) list *ppp_push+0x73
0x681 is in ppp_push (drivers/net/ppp/ppp_generic.c:1663).
1658list = list->next;
1659pch = list_entry(list, struct channel, clist);
1660
1661spin_lock(>downl);
1662if (pch->chan) {
1663if (pch->chan->ops->start_xmit(pch->chan, skb))
1664ppp->xmit_pending = NULL;
1665} else {
1666/* channel got unregistered */
1667kfree_skb(skb);



I expect a memory corruption. Do you have the possibility to run with
KASAN by any chance?
I will try to enable it tonight. For now i reverted "drivers, net, ppp: 
convert ppp_file.refcnt from atomic_t to refcount_t" for test.


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-14 Thread Guillaume Nault
On Wed, Feb 14, 2018 at 03:17:23PM +0200, Denys Fedoryshchenko wrote:
> Hi,
> 
> Upgraded kernel to 4.15.3, still it crashes after while (several hours,
> cannot do bisect, as it is production server).
> 
> dev ppp # gdb ppp_generic.o
> GNU gdb (Gentoo 7.12.1 vanilla) 7.12.1
> <>
> Reading symbols from ppp_generic.o...done.
> (gdb) list *ppp_push+0x73
> 0x681 is in ppp_push (drivers/net/ppp/ppp_generic.c:1663).
> 1658  list = list->next;
> 1659  pch = list_entry(list, struct channel, clist);
> 1660
> 1661  spin_lock(>downl);
> 1662  if (pch->chan) {
> 1663  if (pch->chan->ops->start_xmit(pch->chan, skb))
> 1664  ppp->xmit_pending = NULL;
> 1665  } else {
> 1666  /* channel got unregistered */
> 1667  kfree_skb(skb);
> 
> 
I expect a memory corruption. Do you have the possibility to run with
KASAN by any chance?


ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-14 Thread Denys Fedoryshchenko

Hi,

Upgraded kernel to 4.15.3, still it crashes after while (several hours, 
cannot do bisect, as it is production server).


dev ppp # gdb ppp_generic.o
GNU gdb (Gentoo 7.12.1 vanilla) 7.12.1
<>
Reading symbols from ppp_generic.o...done.
(gdb) list *ppp_push+0x73
0x681 is in ppp_push (drivers/net/ppp/ppp_generic.c:1663).
1658list = list->next;
1659pch = list_entry(list, struct channel, clist);
1660
1661spin_lock(>downl);
1662if (pch->chan) {
1663if (pch->chan->ops->start_xmit(pch->chan, skb))
1664ppp->xmit_pending = NULL;
1665} else {
1666/* channel got unregistered */
1667kfree_skb(skb);



Feb 14 08:32:00  [17937.863304] general protection fault:  [#1] 
SMP
Feb 14 08:32:00  [17937.863638] Modules linked in: pppoe pppox 
ppp_generic slhc netconsole configfs coretemp nf_nat_pptp 
nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE 
nf_dup_ipv4 x
t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4 
xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net 
ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na
t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
ip_tables x_tables 8021q garp mrp stp llc ixgbe dca
Feb 14 08:32:00  [17937.865619] CPU: 6 PID: 12543 Comm: accel-pppd 
Not tainted 4.15.3-build-0134 #4
Feb 14 08:32:00  [17937.866211] Hardware name: HP ProLiant DL320e 
Gen8 v2, BIOS P80 04/02/2015
Feb 14 08:32:00  [17937.866542] RIP: 0010:ppp_push+0x73/0x4ec 
[ppp_generic]
Feb 14 08:32:00  [17937.866865] RSP: 0018:c90001fa7d50 EFLAGS: 
00010282
Feb 14 08:32:00  [17937.867191] RAX: 0fd54d16ec03 RBX: 
8803eeb207b8 RCX: 0101
Feb 14 08:32:00  [17937.867517] RDX:  RSI: 
8803f9fb5000 RDI: 8803eed1e443
Feb 14 08:32:00  [17937.867844] RBP: 8803f9fb5000 R08: 
0001 R09: 
Feb 14 08:32:00  [17937.868171] R10: 7f0a75fba758 R11: 
0293 R12: 8021
Feb 14 08:32:00  [17937.868499] R13: 8804144c7880 R14: 
8021 R15: 8804144c7800
Feb 14 08:32:00  [17937.868824] FS:  7f0a7ecd8700() 
GS:88043418() knlGS:
Feb 14 08:32:00  [17937.869408] CS:  0010 DS:  ES:  CR0: 
80050033
Feb 14 08:32:00  [17937.869729] CR2: 7fa87a187978 CR3: 
00042a6cd005 CR4: 001606e0

Feb 14 08:32:00  [17937.870053] Call Trace:
Feb 14 08:32:00  [17937.870375]  ? 
__kmalloc_node_track_caller+0xb5/0xd6
Feb 14 08:32:00  [17937.870700]  __ppp_xmit_process+0x35/0x4c6 
[ppp_generic]
Feb 14 08:32:00  [17937.871025]  ppp_xmit_process+0x35/0x88 
[ppp_generic]

Feb 14 08:32:00  [17937.871350]  ppp_write+0xb1/0xbb [ppp_generic]
Feb 14 08:32:00  [17937.871678]  __vfs_write+0x1c/0x118
Feb 14 08:32:00  [17937.872003]  ? SyS_epoll_ctl+0x399/0x871
Feb 14 08:32:00  [17937.872328]  vfs_write+0xc6/0x169
Feb 14 08:32:00  [17937.872654]  SyS_write+0x48/0x81
Feb 14 08:32:00  [17937.872980]  do_syscall_64+0x5f/0xea
Feb 14 08:32:00  [17937.873310]  
entry_SYSCALL_64_after_hwframe+0x21/0x86

Feb 14 08:32:00  [17937.873638] RIP: 0033:0x7f0a7e4bfb2d
Feb 14 08:32:00  [17937.873963] RSP: 002b:7f0a7ecd7b00 EFLAGS: 
0293 ORIG_RAX: 0001
Feb 14 08:32:00  [17937.874554] RAX: ffda RBX: 
7f0a7d00b1e3 RCX: 7f0a7e4bfb2d
Feb 14 08:32:00  [17937.874881] RDX: 000c RSI: 
7f0a74175c80 RDI: 3ef8
Feb 14 08:32:00  [17937.875207] RBP: 7f0a7ecd7b30 R08: 
 R09: 55776e7a5e40
Feb 14 08:32:00  [17937.875536] R10: 7f0a75fba758 R11: 
0293 R12: 7f0a7550dd18
Feb 14 08:32:00  [17937.875863] R13: 7ffd4c941eaf R14: 
 R15: 7f0a7ecd8700
Feb 14 08:32:00  [17937.876190] Code: 94 00 00 00 49 89 ff 0f ba e0 
0a 72 43 48 8b 5f 68 48 8d 7b e8 e8 88 4f 84 e1 48 8b 7b b8 48 85 ff 74 
10 48 8b 47 08 48 8b 34 24  10 85 c0 75 0b eb 14 48 8b 3c 2

4 e8 d8 6c 76 e1 49 c7 87 c8
Feb 14 08:32:00  [17937.877071] RIP: ppp_push+0x73/0x4ec 
[ppp_generic] RSP: c90001fa7d50
Feb 14 08:32:00  [17937.877435] ---[ end trace 30a3cc6a49109783 
]---
Feb 14 08:32:00  [17937.878370] Kernel panic - not syncing: Fatal 
exception in interrupt

Feb 14 08:32:00  [17937.878715] Kernel Offset: disabled
Feb 14 08:32:00  [17937.879771] Rebooting in 5 seconds..