Re: Kernel crash in free_pipe_info()

2017-11-10 Thread Linus Torvalds
On Thu, Nov 9, 2017 at 10:07 PM, Simon Brewer  wrote:
>
> This looks familiar...
>
> https://github.com/moby/moby/issues/34472
>
> From the bug report:
> "In particular, it looks like either docker-containerd or
> docker-containerd-shim (the log is cut off) has a pipe open that is
> causing a kernel BUG when attempting to kill the process. Fun times."

Interestingly, some of the "reproducers" show a totally different problem, with

  kernel BUG at /build/linux-5EyXrQ/linux-4.4.0/mm/slub.c:1495!

which is this BUG():

  1493  if (unlikely(flags & GFP_SLAB_BUG_MASK)) {
  1494  pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK);
  1495  BUG();
  1496  }

which implies a completely invalid gfp mask.

The stack trace shows it's unix_stream_sendmsg -> sock_alloc_send_pskb
allocation, which uses "sk->sk_allocation".

Odd. af_unix just does

sk->sk_allocation   = GFP_KERNEL_ACCOUNT;

That is new since v4.4, though - but before that it should have been
set by sock_init_data to just GFP_KERNEL.

So there it looks like a socket is entirely corrupted.

So that other backtrace looks entirely bogus too. More "that looks
like corrupted kernel data structures".

Enabling SLUB debugging would be really good here too.

But that is still quite an old kernel. I wish somebody could reproduce
this with something newer.

I added those wishes to the github thing.

 Linus


Re: Kernel crash in free_pipe_info()

2017-11-10 Thread Linus Torvalds
On Thu, Nov 9, 2017 at 10:07 PM, Simon Brewer  wrote:
>
> This looks familiar...
>
> https://github.com/moby/moby/issues/34472
>
> From the bug report:
> "In particular, it looks like either docker-containerd or
> docker-containerd-shim (the log is cut off) has a pipe open that is
> causing a kernel BUG when attempting to kill the process. Fun times."

Interestingly, some of the "reproducers" show a totally different problem, with

  kernel BUG at /build/linux-5EyXrQ/linux-4.4.0/mm/slub.c:1495!

which is this BUG():

  1493  if (unlikely(flags & GFP_SLAB_BUG_MASK)) {
  1494  pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK);
  1495  BUG();
  1496  }

which implies a completely invalid gfp mask.

The stack trace shows it's unix_stream_sendmsg -> sock_alloc_send_pskb
allocation, which uses "sk->sk_allocation".

Odd. af_unix just does

sk->sk_allocation   = GFP_KERNEL_ACCOUNT;

That is new since v4.4, though - but before that it should have been
set by sock_init_data to just GFP_KERNEL.

So there it looks like a socket is entirely corrupted.

So that other backtrace looks entirely bogus too. More "that looks
like corrupted kernel data structures".

Enabling SLUB debugging would be really good here too.

But that is still quite an old kernel. I wish somebody could reproduce
this with something newer.

I added those wishes to the github thing.

 Linus


Re: Kernel crash in free_pipe_info()

2017-11-10 Thread Cong Wang
Hi, Simon

On Thu, Nov 9, 2017 at 10:07 PM, Simon Brewer  wrote:
> This looks familiar...
>
> https://github.com/moby/moby/issues/34472
>
> From the bug report:
> "In particular, it looks like either docker-containerd or
> docker-containerd-shim (the log is cut off) has a pipe open that is
> causing a kernel BUG when attempting to kill the process. Fun times."

Yeah, looks nearly the same.

If you know how to reproduce it reliably, that would help a lot.

Also, as suggested by Linus, turning on slub debug would also help.
Or enable kdump to get a core dump.

Thanks!


Re: Kernel crash in free_pipe_info()

2017-11-10 Thread Cong Wang
Hi, Simon

On Thu, Nov 9, 2017 at 10:07 PM, Simon Brewer  wrote:
> This looks familiar...
>
> https://github.com/moby/moby/issues/34472
>
> From the bug report:
> "In particular, it looks like either docker-containerd or
> docker-containerd-shim (the log is cut off) has a pipe open that is
> causing a kernel BUG when attempting to kill the process. Fun times."

Yeah, looks nearly the same.

If you know how to reproduce it reliably, that would help a lot.

Also, as suggested by Linus, turning on slub debug would also help.
Or enable kdump to get a core dump.

Thanks!


Re: Kernel crash in free_pipe_info()

2017-11-09 Thread Simon Brewer
On 1 November 2017 at 14:19, Cong Wang  wrote:
> On Mon, Oct 30, 2017 at 7:08 PM, Linus Torvalds
>  wrote:
>> On Mon, Oct 30, 2017 at 6:19 PM, Cong Wang  wrote:
>>>
>>> 1. The faulty addresses are all near 0001, with one exception
>>> of null (which is the most recent one)
>>
>> Well, they're at 8(%rax), except for that last case.
>>
>> And in every case (_including_ that last case), %rax has a very
>> interesting pattern.. That's the (bad) buf->ops pointer that  was
>> loaded from the somehow corrupted "buf".
>>
>> The values in all cases are
>>
>> fffa
>> fffd
>> fff1
>> fff7
>> fff4
>> fffa
>> fffd
>> fffd
>> fffa
>> ffe8
>> fff1
>> fff7
>>
>> which kind of looks like a 32-bit error value. So we have (n, val, (errno)):
>>
>>   1 -24 (EMFILE)
>>   2 -15 (ENOTBLK)
>>   1 -12 (ENOMEM)
>>   2 -9 (EBADF)
>>   3 -6 (ENXIO)
>>   3 -3 (ESRCH)
>>
>> none of which makes any sense to me, but it's an interesting pattern
>> nonetheless.
>
>
> Yeah, good find!
>
>
>>
>>> 2. R12 register, which should map to the local vairable 'i', is always 0x8
>>> at the time of crash.
>>
>> So _if_ this is some kind of use-after-free thing, and the allocation
>> got re-used for something else, that might just be related to whatever
>> ends up being the offset that is filled in with the (int) error
>> number.
>>
>> Except the offset is that %r12*0x28+0x10, so we're talking a byte
>> offset of 330 bytes into the allocation, and apparently the eight
>> previous (0-7) iterations were fine.
>>
>> Which is really odd.
>>
>> I'm not seeing anything that makes sense. I'll have to think about this.
>>
>> I'm assuming you don't have slub debugging enabled, and no way to
>> enable it and try to catch this?
>
> We enable it at compile-time but not at run-time:
>
> CONFIG_SLUB_DEBUG=y
> CONFIG_SLUB=y
> CONFIG_SLUB_CPU_PARTIAL=y
> # CONFIG_SLUB_DEBUG_ON is not set
> # CONFIG_SLUB_STATS is not set
>
> I can try to manually add slub_debug in boot parameters, but still
> have no idea how and when can trigger this bug again.
>
>
> Thanks!

This looks familiar...

https://github.com/moby/moby/issues/34472

>From the bug report:
"In particular, it looks like either docker-containerd or
docker-containerd-shim (the log is cut off) has a pipe open that is
causing a kernel BUG when attempting to kill the process. Fun times."


Re: Kernel crash in free_pipe_info()

2017-11-09 Thread Simon Brewer
On 1 November 2017 at 14:19, Cong Wang  wrote:
> On Mon, Oct 30, 2017 at 7:08 PM, Linus Torvalds
>  wrote:
>> On Mon, Oct 30, 2017 at 6:19 PM, Cong Wang  wrote:
>>>
>>> 1. The faulty addresses are all near 0001, with one exception
>>> of null (which is the most recent one)
>>
>> Well, they're at 8(%rax), except for that last case.
>>
>> And in every case (_including_ that last case), %rax has a very
>> interesting pattern.. That's the (bad) buf->ops pointer that  was
>> loaded from the somehow corrupted "buf".
>>
>> The values in all cases are
>>
>> fffa
>> fffd
>> fff1
>> fff7
>> fff4
>> fffa
>> fffd
>> fffd
>> fffa
>> ffe8
>> fff1
>> fff7
>>
>> which kind of looks like a 32-bit error value. So we have (n, val, (errno)):
>>
>>   1 -24 (EMFILE)
>>   2 -15 (ENOTBLK)
>>   1 -12 (ENOMEM)
>>   2 -9 (EBADF)
>>   3 -6 (ENXIO)
>>   3 -3 (ESRCH)
>>
>> none of which makes any sense to me, but it's an interesting pattern
>> nonetheless.
>
>
> Yeah, good find!
>
>
>>
>>> 2. R12 register, which should map to the local vairable 'i', is always 0x8
>>> at the time of crash.
>>
>> So _if_ this is some kind of use-after-free thing, and the allocation
>> got re-used for something else, that might just be related to whatever
>> ends up being the offset that is filled in with the (int) error
>> number.
>>
>> Except the offset is that %r12*0x28+0x10, so we're talking a byte
>> offset of 330 bytes into the allocation, and apparently the eight
>> previous (0-7) iterations were fine.
>>
>> Which is really odd.
>>
>> I'm not seeing anything that makes sense. I'll have to think about this.
>>
>> I'm assuming you don't have slub debugging enabled, and no way to
>> enable it and try to catch this?
>
> We enable it at compile-time but not at run-time:
>
> CONFIG_SLUB_DEBUG=y
> CONFIG_SLUB=y
> CONFIG_SLUB_CPU_PARTIAL=y
> # CONFIG_SLUB_DEBUG_ON is not set
> # CONFIG_SLUB_STATS is not set
>
> I can try to manually add slub_debug in boot parameters, but still
> have no idea how and when can trigger this bug again.
>
>
> Thanks!

This looks familiar...

https://github.com/moby/moby/issues/34472

>From the bug report:
"In particular, it looks like either docker-containerd or
docker-containerd-shim (the log is cut off) has a pipe open that is
causing a kernel BUG when attempting to kill the process. Fun times."


Re: Kernel crash in free_pipe_info()

2017-10-31 Thread Cong Wang
On Mon, Oct 30, 2017 at 7:08 PM, Linus Torvalds
 wrote:
> On Mon, Oct 30, 2017 at 6:19 PM, Cong Wang  wrote:
>>
>> 1. The faulty addresses are all near 0001, with one exception
>> of null (which is the most recent one)
>
> Well, they're at 8(%rax), except for that last case.
>
> And in every case (_including_ that last case), %rax has a very
> interesting pattern.. That's the (bad) buf->ops pointer that  was
> loaded from the somehow corrupted "buf".
>
> The values in all cases are
>
> fffa
> fffd
> fff1
> fff7
> fff4
> fffa
> fffd
> fffd
> fffa
> ffe8
> fff1
> fff7
>
> which kind of looks like a 32-bit error value. So we have (n, val, (errno)):
>
>   1 -24 (EMFILE)
>   2 -15 (ENOTBLK)
>   1 -12 (ENOMEM)
>   2 -9 (EBADF)
>   3 -6 (ENXIO)
>   3 -3 (ESRCH)
>
> none of which makes any sense to me, but it's an interesting pattern
> nonetheless.


Yeah, good find!


>
>> 2. R12 register, which should map to the local vairable 'i', is always 0x8
>> at the time of crash.
>
> So _if_ this is some kind of use-after-free thing, and the allocation
> got re-used for something else, that might just be related to whatever
> ends up being the offset that is filled in with the (int) error
> number.
>
> Except the offset is that %r12*0x28+0x10, so we're talking a byte
> offset of 330 bytes into the allocation, and apparently the eight
> previous (0-7) iterations were fine.
>
> Which is really odd.
>
> I'm not seeing anything that makes sense. I'll have to think about this.
>
> I'm assuming you don't have slub debugging enabled, and no way to
> enable it and try to catch this?

We enable it at compile-time but not at run-time:

CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y
CONFIG_SLUB_CPU_PARTIAL=y
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set

I can try to manually add slub_debug in boot parameters, but still
have no idea how and when can trigger this bug again.


Thanks!


Re: Kernel crash in free_pipe_info()

2017-10-31 Thread Cong Wang
On Mon, Oct 30, 2017 at 7:08 PM, Linus Torvalds
 wrote:
> On Mon, Oct 30, 2017 at 6:19 PM, Cong Wang  wrote:
>>
>> 1. The faulty addresses are all near 0001, with one exception
>> of null (which is the most recent one)
>
> Well, they're at 8(%rax), except for that last case.
>
> And in every case (_including_ that last case), %rax has a very
> interesting pattern.. That's the (bad) buf->ops pointer that  was
> loaded from the somehow corrupted "buf".
>
> The values in all cases are
>
> fffa
> fffd
> fff1
> fff7
> fff4
> fffa
> fffd
> fffd
> fffa
> ffe8
> fff1
> fff7
>
> which kind of looks like a 32-bit error value. So we have (n, val, (errno)):
>
>   1 -24 (EMFILE)
>   2 -15 (ENOTBLK)
>   1 -12 (ENOMEM)
>   2 -9 (EBADF)
>   3 -6 (ENXIO)
>   3 -3 (ESRCH)
>
> none of which makes any sense to me, but it's an interesting pattern
> nonetheless.


Yeah, good find!


>
>> 2. R12 register, which should map to the local vairable 'i', is always 0x8
>> at the time of crash.
>
> So _if_ this is some kind of use-after-free thing, and the allocation
> got re-used for something else, that might just be related to whatever
> ends up being the offset that is filled in with the (int) error
> number.
>
> Except the offset is that %r12*0x28+0x10, so we're talking a byte
> offset of 330 bytes into the allocation, and apparently the eight
> previous (0-7) iterations were fine.
>
> Which is really odd.
>
> I'm not seeing anything that makes sense. I'll have to think about this.
>
> I'm assuming you don't have slub debugging enabled, and no way to
> enable it and try to catch this?

We enable it at compile-time but not at run-time:

CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y
CONFIG_SLUB_CPU_PARTIAL=y
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set

I can try to manually add slub_debug in boot parameters, but still
have no idea how and when can trigger this bug again.


Thanks!


Re: Kernel crash in free_pipe_info()

2017-10-31 Thread Linus Torvalds
On Mon, Oct 30, 2017 at 9:44 PM, Al Viro  wrote:
> On Mon, Oct 30, 2017 at 07:08:46PM -0700, Linus Torvalds wrote:
>>
>> Well, they're at 8(%rax), except for that last case.
>
> 0x10(%rax)?

Duh, yes.

>> Except the offset is that %r12*0x28+0x10, so we're talking a byte
>> offset of 330 bytes into the allocation, and apparently the eight
>> previous (0-7) iterations were fine.
>>
>> Which is really odd.
>
> I wonder what pipe->buffers is equal to here...

Sadly, we never even bother loading it so it doesn't show up in the
register state, we just iterate over the whole table.

The (one) ppc oops that looks like it might be the same issue has a
totally different pattern. Instead of having that "error number"
looking thing as the invalid pointer base, it has the magic number
from a spinlock. And rather than being about "pipe->bufs[]" array,
it's the pipe pointer itself that seems corrupted, and thus the oops
happens in the account_pipe_buffers() code instead of in the loop over
the buffers.

Of course, both are consistent with that "pipe_inode_info" simply
having been overwritten by something else (possibly, but not
necessarily, due to a use-after-free).

> FWIW, I would try to slap
> if (buf->ops && (unsigned long)buf->ops <= 0x)
> dump the living hell out of that thing
> and see what it catches...

Actually, I'm looking at *another* error path - the one in named pipes.

Named pipes are why we do that nasty "inode->i_pipe" thing - if it was
for just the regular pipes, we'd be able to just do
file->f_private_data and be done with it. But named pipes have to have
the pipe data associated with a particular inode.

And that code actually does look wrong.

Look at fifo_open(): it increments the pipe->files as it sets
filp->private_data to point to the pipe_inode_info. Good so far.

But look at the error case. It does that put_pipe_info() to release it
again, but filp->private_data still contains the pipe_inode_info
pointer.

So what happens on a failed open of a named pipe? The *normal* code
all is very careful to _not_ use "fput()", but instead use
"put_filp(f)", which will just free the file pointer.

But what if somebody does "vfs_open()" on one of those things, and
then does "fput()" in the failure case?

In "do_last()" we have that FILE_OPENED protection:

if (unlikely(error) && (*opened & FILE_OPENED))
fput(file);

and path_openat() is again very careful to then use put_filp(file); if
FILE_OPENED was never set. And do_o_path() does the same.

I'm not seeing anybody who does the wrong thing, but there's a number
of ways to get this entirely wrong, and I worry some path does.

I would be a *lot* happier if we didn't have that very subtle
fput()-vs-put_filp() issue going on.

Again, I cannot see anything wrong, but this feels very very fragile to me.

   Linus


Re: Kernel crash in free_pipe_info()

2017-10-31 Thread Linus Torvalds
On Mon, Oct 30, 2017 at 9:44 PM, Al Viro  wrote:
> On Mon, Oct 30, 2017 at 07:08:46PM -0700, Linus Torvalds wrote:
>>
>> Well, they're at 8(%rax), except for that last case.
>
> 0x10(%rax)?

Duh, yes.

>> Except the offset is that %r12*0x28+0x10, so we're talking a byte
>> offset of 330 bytes into the allocation, and apparently the eight
>> previous (0-7) iterations were fine.
>>
>> Which is really odd.
>
> I wonder what pipe->buffers is equal to here...

Sadly, we never even bother loading it so it doesn't show up in the
register state, we just iterate over the whole table.

The (one) ppc oops that looks like it might be the same issue has a
totally different pattern. Instead of having that "error number"
looking thing as the invalid pointer base, it has the magic number
from a spinlock. And rather than being about "pipe->bufs[]" array,
it's the pipe pointer itself that seems corrupted, and thus the oops
happens in the account_pipe_buffers() code instead of in the loop over
the buffers.

Of course, both are consistent with that "pipe_inode_info" simply
having been overwritten by something else (possibly, but not
necessarily, due to a use-after-free).

> FWIW, I would try to slap
> if (buf->ops && (unsigned long)buf->ops <= 0x)
> dump the living hell out of that thing
> and see what it catches...

Actually, I'm looking at *another* error path - the one in named pipes.

Named pipes are why we do that nasty "inode->i_pipe" thing - if it was
for just the regular pipes, we'd be able to just do
file->f_private_data and be done with it. But named pipes have to have
the pipe data associated with a particular inode.

And that code actually does look wrong.

Look at fifo_open(): it increments the pipe->files as it sets
filp->private_data to point to the pipe_inode_info. Good so far.

But look at the error case. It does that put_pipe_info() to release it
again, but filp->private_data still contains the pipe_inode_info
pointer.

So what happens on a failed open of a named pipe? The *normal* code
all is very careful to _not_ use "fput()", but instead use
"put_filp(f)", which will just free the file pointer.

But what if somebody does "vfs_open()" on one of those things, and
then does "fput()" in the failure case?

In "do_last()" we have that FILE_OPENED protection:

if (unlikely(error) && (*opened & FILE_OPENED))
fput(file);

and path_openat() is again very careful to then use put_filp(file); if
FILE_OPENED was never set. And do_o_path() does the same.

I'm not seeing anybody who does the wrong thing, but there's a number
of ways to get this entirely wrong, and I worry some path does.

I would be a *lot* happier if we didn't have that very subtle
fput()-vs-put_filp() issue going on.

Again, I cannot see anything wrong, but this feels very very fragile to me.

   Linus


Re: Kernel crash in free_pipe_info()

2017-10-30 Thread Al Viro
On Mon, Oct 30, 2017 at 08:06:23PM -0700, Linus Torvalds wrote:

> We do that "free_pipe_info(inode->i_pipe);", but we never actually
> clear inode->i_pipe, so now we have an inode that looks like a pipe
> inode, and has a stale pointer to a pipe_inode_info.
> 
> It all looks technically correct. It's fine to use put_filp(), because
> the file pointer has never really been used. And the inode should
> never get re-used anyway without going through the whole reinit in
> inode_init_always().
> 
> So I don't see anything *wrong*, but I see a lot that is just unusual,

FWIW, it's really brittle - consider
if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
i_readcount_inc(path->dentry->d_inode);
in alloc_file().  It's not the source of trouble in this case, but only
because it's the second call that gets FMODE_READ; reorder them in
create_pipe_files() and you've got a bug.

I considered using fput() there, but that would've required manually
decrementing pipe->files first, which made it rather unappealing...

I don't see anything relevant there, but that's not saying much - flu
and debugging do not mix well, and lack of sleep also doesn't help ;-/


Re: Kernel crash in free_pipe_info()

2017-10-30 Thread Al Viro
On Mon, Oct 30, 2017 at 08:06:23PM -0700, Linus Torvalds wrote:

> We do that "free_pipe_info(inode->i_pipe);", but we never actually
> clear inode->i_pipe, so now we have an inode that looks like a pipe
> inode, and has a stale pointer to a pipe_inode_info.
> 
> It all looks technically correct. It's fine to use put_filp(), because
> the file pointer has never really been used. And the inode should
> never get re-used anyway without going through the whole reinit in
> inode_init_always().
> 
> So I don't see anything *wrong*, but I see a lot that is just unusual,

FWIW, it's really brittle - consider
if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
i_readcount_inc(path->dentry->d_inode);
in alloc_file().  It's not the source of trouble in this case, but only
because it's the second call that gets FMODE_READ; reorder them in
create_pipe_files() and you've got a bug.

I considered using fput() there, but that would've required manually
decrementing pipe->files first, which made it rather unappealing...

I don't see anything relevant there, but that's not saying much - flu
and debugging do not mix well, and lack of sleep also doesn't help ;-/


Re: Kernel crash in free_pipe_info()

2017-10-30 Thread Al Viro
On Mon, Oct 30, 2017 at 07:08:46PM -0700, Linus Torvalds wrote:
> On Mon, Oct 30, 2017 at 6:19 PM, Cong Wang  wrote:
> >
> > 1. The faulty addresses are all near 0001, with one exception
> > of null (which is the most recent one)
> 
> Well, they're at 8(%rax), except for that last case.

0x10(%rax)?

> And in every case (_including_ that last case), %rax has a very
> interesting pattern.. That's the (bad) buf->ops pointer that  was
> loaded from the somehow corrupted "buf".

> So _if_ this is some kind of use-after-free thing, and the allocation
> got re-used for something else, that might just be related to whatever
> ends up being the offset that is filled in with the (int) error
> number.
> 
> Except the offset is that %r12*0x28+0x10, so we're talking a byte
> offset of 330 bytes into the allocation, and apparently the eight
> previous (0-7) iterations were fine.
> 
> Which is really odd.

I wonder what pipe->buffers is equal to here...

> I'm not seeing anything that makes sense. I'll have to think about this.
> 
> I'm assuming you don't have slub debugging enabled, and no way to
> enable it and try to catch this?

FWIW, I would try to slap
if (buf->ops && (unsigned long)buf->ops <= 0x)
dump the living hell out of that thing
and see what it catches...


Re: Kernel crash in free_pipe_info()

2017-10-30 Thread Al Viro
On Mon, Oct 30, 2017 at 07:08:46PM -0700, Linus Torvalds wrote:
> On Mon, Oct 30, 2017 at 6:19 PM, Cong Wang  wrote:
> >
> > 1. The faulty addresses are all near 0001, with one exception
> > of null (which is the most recent one)
> 
> Well, they're at 8(%rax), except for that last case.

0x10(%rax)?

> And in every case (_including_ that last case), %rax has a very
> interesting pattern.. That's the (bad) buf->ops pointer that  was
> loaded from the somehow corrupted "buf".

> So _if_ this is some kind of use-after-free thing, and the allocation
> got re-used for something else, that might just be related to whatever
> ends up being the offset that is filled in with the (int) error
> number.
> 
> Except the offset is that %r12*0x28+0x10, so we're talking a byte
> offset of 330 bytes into the allocation, and apparently the eight
> previous (0-7) iterations were fine.
> 
> Which is really odd.

I wonder what pipe->buffers is equal to here...

> I'm not seeing anything that makes sense. I'll have to think about this.
> 
> I'm assuming you don't have slub debugging enabled, and no way to
> enable it and try to catch this?

FWIW, I would try to slap
if (buf->ops && (unsigned long)buf->ops <= 0x)
dump the living hell out of that thing
and see what it catches...


Re: Kernel crash in free_pipe_info()

2017-10-30 Thread Linus Torvalds
On Mon, Oct 30, 2017 at 7:08 PM, Linus Torvalds
 wrote:
>
> I'm not seeing anything that makes sense. I'll have to think about this.

Al, would you mind taking a look at the error handling in create_pipe_files().

In particular, look here:

 - we start out allocating the inode with "get_pipe_inode().

   That sets up a inode->i_pipe, with pipe->files initialized to 2.
Fine. We're going to have two file descriptors.

 - we then create the dummy path:

path.dentry = d_alloc_pseudo(pipe_mnt->mnt_sb, _name);

   fine fine. Again, this looks all good for the success cases.

But the *error* cases are a bit dodgy, aren't they?

We have three different error cases:

 - we couldn't even allocate a dentry. We do

free_pipe_info(inode->i_pipe);
iput(inode);

 - we couldn't allocate any file at all:

free_pipe_info(inode->i_pipe);
path_put();

 - we allocated the first file, but not the second:

put_filp(f);
free_pipe_info(inode->i_pipe);
path_put();

and it worries me a bit that in all those error cases, we end up doing
that "free_pipe_info()", but we basically do this half-arsed job of
freeing things.

For example, we use "put_filp()" to free the file pointer, not "fput()".

We do that "free_pipe_info(inode->i_pipe);", but we never actually
clear inode->i_pipe, so now we have an inode that looks like a pipe
inode, and has a stale pointer to a pipe_inode_info.

It all looks technically correct. It's fine to use put_filp(), because
the file pointer has never really been used. And the inode should
never get re-used anyway without going through the whole reinit in
inode_init_always().

So I don't see anything *wrong*, but I see a lot that is just unusual,
and seems to depend on half-initialized state being fine. Can you look
at this?

Linus


Re: Kernel crash in free_pipe_info()

2017-10-30 Thread Linus Torvalds
On Mon, Oct 30, 2017 at 7:08 PM, Linus Torvalds
 wrote:
>
> I'm not seeing anything that makes sense. I'll have to think about this.

Al, would you mind taking a look at the error handling in create_pipe_files().

In particular, look here:

 - we start out allocating the inode with "get_pipe_inode().

   That sets up a inode->i_pipe, with pipe->files initialized to 2.
Fine. We're going to have two file descriptors.

 - we then create the dummy path:

path.dentry = d_alloc_pseudo(pipe_mnt->mnt_sb, _name);

   fine fine. Again, this looks all good for the success cases.

But the *error* cases are a bit dodgy, aren't they?

We have three different error cases:

 - we couldn't even allocate a dentry. We do

free_pipe_info(inode->i_pipe);
iput(inode);

 - we couldn't allocate any file at all:

free_pipe_info(inode->i_pipe);
path_put();

 - we allocated the first file, but not the second:

put_filp(f);
free_pipe_info(inode->i_pipe);
path_put();

and it worries me a bit that in all those error cases, we end up doing
that "free_pipe_info()", but we basically do this half-arsed job of
freeing things.

For example, we use "put_filp()" to free the file pointer, not "fput()".

We do that "free_pipe_info(inode->i_pipe);", but we never actually
clear inode->i_pipe, so now we have an inode that looks like a pipe
inode, and has a stale pointer to a pipe_inode_info.

It all looks technically correct. It's fine to use put_filp(), because
the file pointer has never really been used. And the inode should
never get re-used anyway without going through the whole reinit in
inode_init_always().

So I don't see anything *wrong*, but I see a lot that is just unusual,
and seems to depend on half-initialized state being fine. Can you look
at this?

Linus


Re: Kernel crash in free_pipe_info()

2017-10-30 Thread Linus Torvalds
On Mon, Oct 30, 2017 at 6:19 PM, Cong Wang  wrote:
>
> 1. The faulty addresses are all near 0001, with one exception
> of null (which is the most recent one)

Well, they're at 8(%rax), except for that last case.

And in every case (_including_ that last case), %rax has a very
interesting pattern.. That's the (bad) buf->ops pointer that  was
loaded from the somehow corrupted "buf".

The values in all cases are

fffa
fffd
fff1
fff7
fff4
fffa
fffd
fffd
fffa
ffe8
fff1
fff7

which kind of looks like a 32-bit error value. So we have (n, val, (errno)):

  1 -24 (EMFILE)
  2 -15 (ENOTBLK)
  1 -12 (ENOMEM)
  2 -9 (EBADF)
  3 -6 (ENXIO)
  3 -3 (ESRCH)

none of which makes any sense to me, but it's an interesting pattern
nonetheless.

> 2. R12 register, which should map to the local vairable 'i', is always 0x8
> at the time of crash.

So _if_ this is some kind of use-after-free thing, and the allocation
got re-used for something else, that might just be related to whatever
ends up being the offset that is filled in with the (int) error
number.

Except the offset is that %r12*0x28+0x10, so we're talking a byte
offset of 330 bytes into the allocation, and apparently the eight
previous (0-7) iterations were fine.

Which is really odd.

I'm not seeing anything that makes sense. I'll have to think about this.

I'm assuming you don't have slub debugging enabled, and no way to
enable it and try to catch this?

   Linus


Re: Kernel crash in free_pipe_info()

2017-10-30 Thread Linus Torvalds
On Mon, Oct 30, 2017 at 6:19 PM, Cong Wang  wrote:
>
> 1. The faulty addresses are all near 0001, with one exception
> of null (which is the most recent one)

Well, they're at 8(%rax), except for that last case.

And in every case (_including_ that last case), %rax has a very
interesting pattern.. That's the (bad) buf->ops pointer that  was
loaded from the somehow corrupted "buf".

The values in all cases are

fffa
fffd
fff1
fff7
fff4
fffa
fffd
fffd
fffa
ffe8
fff1
fff7

which kind of looks like a 32-bit error value. So we have (n, val, (errno)):

  1 -24 (EMFILE)
  2 -15 (ENOTBLK)
  1 -12 (ENOMEM)
  2 -9 (EBADF)
  3 -6 (ENXIO)
  3 -3 (ESRCH)

none of which makes any sense to me, but it's an interesting pattern
nonetheless.

> 2. R12 register, which should map to the local vairable 'i', is always 0x8
> at the time of crash.

So _if_ this is some kind of use-after-free thing, and the allocation
got re-used for something else, that might just be related to whatever
ends up being the offset that is filled in with the (int) error
number.

Except the offset is that %r12*0x28+0x10, so we're talking a byte
offset of 330 bytes into the allocation, and apparently the eight
previous (0-7) iterations were fine.

Which is really odd.

I'm not seeing anything that makes sense. I'll have to think about this.

I'm assuming you don't have slub debugging enabled, and no way to
enable it and try to catch this?

   Linus


Re: Kernel crash in free_pipe_info()

2017-10-30 Thread Cong Wang
On Mon, Oct 30, 2017 at 3:14 PM, Linus Torvalds
 wrote:
> On Mon, Oct 30, 2017 at 1:58 PM, Cong Wang  wrote:
>>
>> We got more than a dozen of kernel crashes at free_pipe_info() on our
>> 4.1 kernel, they are all very similar to this one (with slightly
>> different faulty addresses):
>
> Were it not for the pointer to the much more recent powerpc version at
>
> https://www.spinics.net/lists/kernel/msg2616441.html
>
> that you referenced, I'd be inclined to ignore this as much too old.
> Al fixed a number of issues with splice() some time ago.
>
> And honestly, splice() is pretty much the only thing that I can think
> of that can trigger issues wrt the pipe_inode_info structure.
>
> But as you say, we've had subtle races just due to the named pipes:


It is unlikely that we use splice() or named pipes, I highly suspect it
is caused by the pipe(), although stack traces can't tell.


>
>> Although it is too early to blame your commit b0d8d2292160 ("vfs: fix
>> subtle use-after-free of pipe_inode_info"), it is probably related.
>
> I do hate the games we play with the stupid pipe_inode_info. The way
> we have locking in the inode, instead of in the pipe_inode_info
> itself.
>
> There are hysterical raisins for that, and due to names piped we have
> to keep the reference to the pipe_inode_info in the inode, but it's
> still very very annoying.
>
> Al, do you see anything?
>
> Cong, do you have any way to trigger these? Is there any pattern to
> when they happen or what is going on when they do?
>

I have no idea how this was triggered, all of the stack traces are reported
by ABRT service on each host.

Note, since kdump is not properly config'ed, there could be even more
machines crashed silently in a corner without even an ABRT report.

I will try to gather more information.

Thanks!


Re: Kernel crash in free_pipe_info()

2017-10-30 Thread Cong Wang
On Mon, Oct 30, 2017 at 3:14 PM, Linus Torvalds
 wrote:
> On Mon, Oct 30, 2017 at 1:58 PM, Cong Wang  wrote:
>>
>> We got more than a dozen of kernel crashes at free_pipe_info() on our
>> 4.1 kernel, they are all very similar to this one (with slightly
>> different faulty addresses):
>
> Were it not for the pointer to the much more recent powerpc version at
>
> https://www.spinics.net/lists/kernel/msg2616441.html
>
> that you referenced, I'd be inclined to ignore this as much too old.
> Al fixed a number of issues with splice() some time ago.
>
> And honestly, splice() is pretty much the only thing that I can think
> of that can trigger issues wrt the pipe_inode_info structure.
>
> But as you say, we've had subtle races just due to the named pipes:


It is unlikely that we use splice() or named pipes, I highly suspect it
is caused by the pipe(), although stack traces can't tell.


>
>> Although it is too early to blame your commit b0d8d2292160 ("vfs: fix
>> subtle use-after-free of pipe_inode_info"), it is probably related.
>
> I do hate the games we play with the stupid pipe_inode_info. The way
> we have locking in the inode, instead of in the pipe_inode_info
> itself.
>
> There are hysterical raisins for that, and due to names piped we have
> to keep the reference to the pipe_inode_info in the inode, but it's
> still very very annoying.
>
> Al, do you see anything?
>
> Cong, do you have any way to trigger these? Is there any pattern to
> when they happen or what is going on when they do?
>

I have no idea how this was triggered, all of the stack traces are reported
by ABRT service on each host.

Note, since kdump is not properly config'ed, there could be even more
machines crashed silently in a corner without even an ABRT report.

I will try to gather more information.

Thanks!


Re: Kernel crash in free_pipe_info()

2017-10-30 Thread Cong Wang
On Mon, Oct 30, 2017 at 3:26 PM, Linus Torvalds
 wrote:
> On Mon, Oct 30, 2017 at 3:14 PM, Linus Torvalds
>  wrote:
>> On Mon, Oct 30, 2017 at 1:58 PM, Cong Wang  wrote:
>>>
>>> We got more than a dozen of kernel crashes at free_pipe_info() on our
>>> 4.1 kernel, they are all very similar to this one (with slightly
>>> different faulty addresses):
>>
>> Were it not for the pointer to the much more recent powerpc version at
>>
>> https://www.spinics.net/lists/kernel/msg2616441.html
>>
>> that you referenced, I'd be inclined to ignore this as much too old.
>> Al fixed a number of issues with splice() some time ago.
>
> Hmm. So I tried to google for this oops, and the only oops I find is
> hat ppc one you reference.
>
> Was that how you found it too?

Yes.


>
> The fact that google doesn't report any other oopses would seem to
> indicate that this is not a widespread issue. Where are the "more than
> a dozen" ones you refer to being reported? I'd like to see them all if
> at all possible, to see if there's some pattern to them..
>

We have thousands of machines running 4.1, so even with a dozen
crashes it is still a very small percentage. :)

I am attaching these stack traces that I copy-n-paste from the ABRT
email reports.

There are two patterns I can observe:

1. The faulty addresses are all near 0001, with one exception
of null (which is the most recent one)

2. R12 register, which should map to the local vairable 'i', is always 0x8
at the time of crash.


Hope it helps.

Thanks!

:[840243.314457] BUG: unable to handle kernel NULL pointer dereference at   
(null)
:[840243.322420] IP: [<  (null)>]   (null)
:[840243.327582] PGD 18782067 PUD 24d42067 PMD 0
:[840243.331989] Oops: 0010 [#1] SMP
:[840243.335346] Modules linked in: sch_htb cls_basic act_mirred cls_u32 veth 
sch_ingress fuse cpufreq_ondemand x86_pkg_temp_thermal coretemp kvm_intel kvm 
microcode lp parport iTCO_wdt tcp_diag iTCO_vendor_support inet_diag sb_edac 
edac_core lpc_ich i2c_i801 i2c_core mfd_core ioatdma dca hed wmi shpchp ipmi_si 
ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c mlx4_en 
vxlan ip6_udp_tunnel udp_tunnel ptp pps_core crc32c_intel serio_raw mlx4_core 
isci libsas scsi_transport_sas ipv6
:[840243.379972] CPU: 5 PID: 9355 Comm: java Not tainted 4.1.41.el7.x86_64 #1
:[840243.387803] Hardware name: SYNNEX HYVE-ZEUS/X9DRD-iF, BIOS 3.0.4 12/06/2013
:[840243.394855] task: 880866ba ti: 88036094 task.ti: 
88036094
:[840243.402426] RIP: 0010:[<>]  [<  (null)>]   
(null)
:[840243.410005] RSP: 0018:880360943e30  EFLAGS: 00010206
:[840243.415400] RAX: fffa RBX: 8804697c4f00 RCX: 
001a
:[840243.422617] RDX: 001b RSI: 8807cdc1b940 RDI: 
8804697c4f00
:[840243.429835] RBP: 880360943e48 R08:  R09: 

:[840243.437059] R10: b708 R11: 88049f34b010 R12: 
0008
:[840243.444275] R13: 880866500aa0 R14: 8808689681a0 R15: 
880866500aa0
:[840243.451492] FS:  7f01c40ba700() GS:88046fca() 
knlGS:
:[840243.459672] CS:  0010 DS:  ES:  CR0: 80050033
:[840243.465509] CR2:  CR3: 78b43000 CR4: 
001406e0
:[840243.472734] Stack:
:[840243.474841]  8117660f 880866500aa0 8804697c4f00 
880360943e68
:[840243.482386]  8117668e 8804697c4f00 88049f34b000 
880360943e98
:[840243.489933]  8117672a  88049f34b000 
0010
:[840243.497479] Call Trace:
:[840243.500019]  [] ? free_pipe_info+0x57/0x85
:[840243.505768]  [] put_pipe_info+0x51/0x56
:[840243.511254]  [] pipe_release+0x97/0xa1
:[840243.516659]  [] __fput+0xf9/0x1b8
:[840243.521628]  [] fput+0xe/0x10
:[840243.526601]  [] task_work_run+0x8a/0xa1
:[840243.532089]  [] do_notify_resume+0x56/0x65
:[840243.537842]  [] int_signal+0x12/0x17
:[840243.543068] Code:  Bad RIP value.
:[840243.546519] RIP  [<  (null)>]   (null)
:[840243.551757]  RSP 
:[840243.555332] CR2: 
:[840243.559179] ---[ end trace b725063bc2e1920b ]---



:[319591.443093] IP: [] free_pipe_info+0x54/0x85
:[319591.448958] PGD 0
:[319591.450964] Oops:  [#1] SMP
:[319591.454327] Modules linked in: sch_htb cls_basic act_mirred cls_u32 veth 
sch_ingress fuse cpufreq_ondemand xfs libcrc32c intel_rapl iosf_mbi 
x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support dcdbas 
microcode sb_edac lpc_ich ioatdma mfd_core shpchp i2c_i801 edac_core acpi_pad 
wmi hed lp parport tcp_diag inet_diag ipmi_si ipmi_devintf ipmi_msghandler 
acpi_cpufreq sch_fq_codel crc32c_intel igb ptp pps_core i2c_algo_bit i2c_core 
dca ipv6
:[319591.494830] CPU: 4 PID: 1221 Comm: sshd Not tainted 4.1.41.el7.x86_64 #1
:[319591.502659] Hardware name: 

Re: Kernel crash in free_pipe_info()

2017-10-30 Thread Cong Wang
On Mon, Oct 30, 2017 at 3:26 PM, Linus Torvalds
 wrote:
> On Mon, Oct 30, 2017 at 3:14 PM, Linus Torvalds
>  wrote:
>> On Mon, Oct 30, 2017 at 1:58 PM, Cong Wang  wrote:
>>>
>>> We got more than a dozen of kernel crashes at free_pipe_info() on our
>>> 4.1 kernel, they are all very similar to this one (with slightly
>>> different faulty addresses):
>>
>> Were it not for the pointer to the much more recent powerpc version at
>>
>> https://www.spinics.net/lists/kernel/msg2616441.html
>>
>> that you referenced, I'd be inclined to ignore this as much too old.
>> Al fixed a number of issues with splice() some time ago.
>
> Hmm. So I tried to google for this oops, and the only oops I find is
> hat ppc one you reference.
>
> Was that how you found it too?

Yes.


>
> The fact that google doesn't report any other oopses would seem to
> indicate that this is not a widespread issue. Where are the "more than
> a dozen" ones you refer to being reported? I'd like to see them all if
> at all possible, to see if there's some pattern to them..
>

We have thousands of machines running 4.1, so even with a dozen
crashes it is still a very small percentage. :)

I am attaching these stack traces that I copy-n-paste from the ABRT
email reports.

There are two patterns I can observe:

1. The faulty addresses are all near 0001, with one exception
of null (which is the most recent one)

2. R12 register, which should map to the local vairable 'i', is always 0x8
at the time of crash.


Hope it helps.

Thanks!

:[840243.314457] BUG: unable to handle kernel NULL pointer dereference at   
(null)
:[840243.322420] IP: [<  (null)>]   (null)
:[840243.327582] PGD 18782067 PUD 24d42067 PMD 0
:[840243.331989] Oops: 0010 [#1] SMP
:[840243.335346] Modules linked in: sch_htb cls_basic act_mirred cls_u32 veth 
sch_ingress fuse cpufreq_ondemand x86_pkg_temp_thermal coretemp kvm_intel kvm 
microcode lp parport iTCO_wdt tcp_diag iTCO_vendor_support inet_diag sb_edac 
edac_core lpc_ich i2c_i801 i2c_core mfd_core ioatdma dca hed wmi shpchp ipmi_si 
ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c mlx4_en 
vxlan ip6_udp_tunnel udp_tunnel ptp pps_core crc32c_intel serio_raw mlx4_core 
isci libsas scsi_transport_sas ipv6
:[840243.379972] CPU: 5 PID: 9355 Comm: java Not tainted 4.1.41.el7.x86_64 #1
:[840243.387803] Hardware name: SYNNEX HYVE-ZEUS/X9DRD-iF, BIOS 3.0.4 12/06/2013
:[840243.394855] task: 880866ba ti: 88036094 task.ti: 
88036094
:[840243.402426] RIP: 0010:[<>]  [<  (null)>]   
(null)
:[840243.410005] RSP: 0018:880360943e30  EFLAGS: 00010206
:[840243.415400] RAX: fffa RBX: 8804697c4f00 RCX: 
001a
:[840243.422617] RDX: 001b RSI: 8807cdc1b940 RDI: 
8804697c4f00
:[840243.429835] RBP: 880360943e48 R08:  R09: 

:[840243.437059] R10: b708 R11: 88049f34b010 R12: 
0008
:[840243.444275] R13: 880866500aa0 R14: 8808689681a0 R15: 
880866500aa0
:[840243.451492] FS:  7f01c40ba700() GS:88046fca() 
knlGS:
:[840243.459672] CS:  0010 DS:  ES:  CR0: 80050033
:[840243.465509] CR2:  CR3: 78b43000 CR4: 
001406e0
:[840243.472734] Stack:
:[840243.474841]  8117660f 880866500aa0 8804697c4f00 
880360943e68
:[840243.482386]  8117668e 8804697c4f00 88049f34b000 
880360943e98
:[840243.489933]  8117672a  88049f34b000 
0010
:[840243.497479] Call Trace:
:[840243.500019]  [] ? free_pipe_info+0x57/0x85
:[840243.505768]  [] put_pipe_info+0x51/0x56
:[840243.511254]  [] pipe_release+0x97/0xa1
:[840243.516659]  [] __fput+0xf9/0x1b8
:[840243.521628]  [] fput+0xe/0x10
:[840243.526601]  [] task_work_run+0x8a/0xa1
:[840243.532089]  [] do_notify_resume+0x56/0x65
:[840243.537842]  [] int_signal+0x12/0x17
:[840243.543068] Code:  Bad RIP value.
:[840243.546519] RIP  [<  (null)>]   (null)
:[840243.551757]  RSP 
:[840243.555332] CR2: 
:[840243.559179] ---[ end trace b725063bc2e1920b ]---



:[319591.443093] IP: [] free_pipe_info+0x54/0x85
:[319591.448958] PGD 0
:[319591.450964] Oops:  [#1] SMP
:[319591.454327] Modules linked in: sch_htb cls_basic act_mirred cls_u32 veth 
sch_ingress fuse cpufreq_ondemand xfs libcrc32c intel_rapl iosf_mbi 
x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support dcdbas 
microcode sb_edac lpc_ich ioatdma mfd_core shpchp i2c_i801 edac_core acpi_pad 
wmi hed lp parport tcp_diag inet_diag ipmi_si ipmi_devintf ipmi_msghandler 
acpi_cpufreq sch_fq_codel crc32c_intel igb ptp pps_core i2c_algo_bit i2c_core 
dca ipv6
:[319591.494830] CPU: 4 PID: 1221 Comm: sshd Not tainted 4.1.41.el7.x86_64 #1
:[319591.502659] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.2.3 
11/07/2013
:[319591.510310] task: 

Re: Kernel crash in free_pipe_info()

2017-10-30 Thread Linus Torvalds
On Mon, Oct 30, 2017 at 3:14 PM, Linus Torvalds
 wrote:
> On Mon, Oct 30, 2017 at 1:58 PM, Cong Wang  wrote:
>>
>> We got more than a dozen of kernel crashes at free_pipe_info() on our
>> 4.1 kernel, they are all very similar to this one (with slightly
>> different faulty addresses):
>
> Were it not for the pointer to the much more recent powerpc version at
>
> https://www.spinics.net/lists/kernel/msg2616441.html
>
> that you referenced, I'd be inclined to ignore this as much too old.
> Al fixed a number of issues with splice() some time ago.

Hmm. So I tried to google for this oops, and the only oops I find is
hat ppc one you reference.

Was that how you found it too?

The fact that google doesn't report any other oopses would seem to
indicate that this is not a widespread issue. Where are the "more than
a dozen" ones you refer to being reported? I'd like to see them all if
at all possible, to see if there's some pattern to them..

Linus


Re: Kernel crash in free_pipe_info()

2017-10-30 Thread Linus Torvalds
On Mon, Oct 30, 2017 at 3:14 PM, Linus Torvalds
 wrote:
> On Mon, Oct 30, 2017 at 1:58 PM, Cong Wang  wrote:
>>
>> We got more than a dozen of kernel crashes at free_pipe_info() on our
>> 4.1 kernel, they are all very similar to this one (with slightly
>> different faulty addresses):
>
> Were it not for the pointer to the much more recent powerpc version at
>
> https://www.spinics.net/lists/kernel/msg2616441.html
>
> that you referenced, I'd be inclined to ignore this as much too old.
> Al fixed a number of issues with splice() some time ago.

Hmm. So I tried to google for this oops, and the only oops I find is
hat ppc one you reference.

Was that how you found it too?

The fact that google doesn't report any other oopses would seem to
indicate that this is not a widespread issue. Where are the "more than
a dozen" ones you refer to being reported? I'd like to see them all if
at all possible, to see if there's some pattern to them..

Linus


Re: Kernel crash in free_pipe_info()

2017-10-30 Thread Linus Torvalds
On Mon, Oct 30, 2017 at 1:58 PM, Cong Wang  wrote:
>
> We got more than a dozen of kernel crashes at free_pipe_info() on our
> 4.1 kernel, they are all very similar to this one (with slightly
> different faulty addresses):

Were it not for the pointer to the much more recent powerpc version at

https://www.spinics.net/lists/kernel/msg2616441.html

that you referenced, I'd be inclined to ignore this as much too old.
Al fixed a number of issues with splice() some time ago.

And honestly, splice() is pretty much the only thing that I can think
of that can trigger issues wrt the pipe_inode_info structure.

But as you say, we've had subtle races just due to the named pipes:

> Although it is too early to blame your commit b0d8d2292160 ("vfs: fix
> subtle use-after-free of pipe_inode_info"), it is probably related.

I do hate the games we play with the stupid pipe_inode_info. The way
we have locking in the inode, instead of in the pipe_inode_info
itself.

There are hysterical raisins for that, and due to names piped we have
to keep the reference to the pipe_inode_info in the inode, but it's
still very very annoying.

Al, do you see anything?

Cong, do you have any way to trigger these? Is there any pattern to
when they happen or what is going on when they do?

   Linus


Re: Kernel crash in free_pipe_info()

2017-10-30 Thread Linus Torvalds
On Mon, Oct 30, 2017 at 1:58 PM, Cong Wang  wrote:
>
> We got more than a dozen of kernel crashes at free_pipe_info() on our
> 4.1 kernel, they are all very similar to this one (with slightly
> different faulty addresses):

Were it not for the pointer to the much more recent powerpc version at

https://www.spinics.net/lists/kernel/msg2616441.html

that you referenced, I'd be inclined to ignore this as much too old.
Al fixed a number of issues with splice() some time ago.

And honestly, splice() is pretty much the only thing that I can think
of that can trigger issues wrt the pipe_inode_info structure.

But as you say, we've had subtle races just due to the named pipes:

> Although it is too early to blame your commit b0d8d2292160 ("vfs: fix
> subtle use-after-free of pipe_inode_info"), it is probably related.

I do hate the games we play with the stupid pipe_inode_info. The way
we have locking in the inode, instead of in the pipe_inode_info
itself.

There are hysterical raisins for that, and due to names piped we have
to keep the reference to the pipe_inode_info in the inode, but it's
still very very annoying.

Al, do you see anything?

Cong, do you have any way to trigger these? Is there any pattern to
when they happen or what is going on when they do?

   Linus


Kernel crash in free_pipe_info()

2017-10-30 Thread Cong Wang
Hi, Linus


We got more than a dozen of kernel crashes at free_pipe_info() on our
4.1 kernel, they are all very similar to this one (with slightly
different faulty addresses):


[433948.755473] BUG: unable to handle kernel paging request at 0001000d
[433948.762570] IP: [] free_pipe_info+0x54/0x85
[433948.768493] PGD 0
[433948.770537] Oops:  [#1] SMP
[433948.773768] Modules linked in: sch_htb cls_basic act_mirred
cls_u32 veth sch_ingress fuse cpufreq_ondemand intel_rapl iosf_mbi
x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt
iTCO_vendor_support dcdbas microcode sb_edac wmi edac_core i2c_i801
hed lpc_ich mfd_core ioatdma shpchp lp parport tcp_diag inet_diag
ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs
libcrc32c crc32c_intel igb ptp pps_core i2c_algo_bit dca i2c_core ipv6
[433948.813421] CPU: 10 PID: 50667 Comm: rsync Not tainted 4.1.41.el7.x86_64 #1
[433948.821498] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS
2.2.3 11/07/2013
[433948.829247] task: 880f47fde240 ti: 880d282d8000 task.ti:
880d282d8000
[433948.836770] RIP: 0010:[]  []
free_pipe_info+0x54/0x85
[433948.845088] RSP: 0018:880d282dbd98  EFLAGS: 00010202
[433948.850473] RAX: fffd RBX: 8817e7692240 RCX:
02c5
[433948.857603] RDX: 02c6 RSI: 8817c30be540 RDI:
8817e7692240
[433948.864881] RBP: 880d282dbda8 R08:  R09:

[433948.871988] R10: 880c474a0c60 R11: 8812c3c44310 R12:
0008
[433948.879224] R13: 88182d8a9100 R14: 880c47b981a0 R15:
88182d8a9100
[433948.886514] FS:  () GS:88184ee8()
knlGS:
[433948.894678] CS:  0010 DS:  ES:  CR0: 80050033
[433948.900609] CR2: 0001000d CR3: 01a0b000 CR4:
000406e0
[433948.907840] Stack:
[433948.909876]  88182d8a9100 8817e7692240 880d282dbdc8
8117668e
[433948.917331]  8817e7692240 8812c3c44300 880d282dbdf8
8117672a
[433948.924799]   8812c3c44300 0010
88182d8a9100
[433948.932234] Call Trace:
[433948.934780]  [] put_pipe_info+0x51/0x56
[433948.940320]  [] pipe_release+0x97/0xa1
[433948.945734]  [] __fput+0xf9/0x1b8
[433948.950747]  [] fput+0xe/0x10
[433948.955665]  [] task_work_run+0x8a/0xa1
[433948.961121]  [] do_exit+0x400/0x8ba
[433948.966224]  [] ? fdput_pos.isra.10+0x29/0x30
[433948.972306]  [] do_group_exit+0x44/0x98
[433948.977847]  [] SyS_exit_group+0x14/0x14
[433948.983423]  [] system_call_fastpath+0x12/0x6a
[433948.989477] Code: 8b bb 88 00 00 00 e8 73 a5 ef ff 44 3b 63 48 73
22 49 63 f4 48 6b f6 28 48 03 b3 80 00 00 00 48 8b 46 10 48 85 c0 74
06 48 89 df  50 10 41 ff c4 eb d8 48 8b 7b 68 48 85 ff 74 07 31 f6
e8 be
[433949.008413] RIP  [] free_pipe_info+0x54/0x85
[433949.014335]  RSP 
[433949.017850] CR2: 0001000d
[433949.021820] ---[ end trace 95f92776b6f34085 ]---


For this one, the IP maps to fs/pipe.c:651

 642 void free_pipe_info(struct pipe_inode_info *pipe)
 643 {
 644 int i;
 645
 646 account_pipe_buffers(pipe, pipe->buffers, 0);
 647 free_uid(pipe->user);
 648 for (i = 0; i < pipe->buffers; i++) {
 649 struct pipe_buffer *buf = pipe->bufs + i;
 650 if (buf->ops)
 651 buf->ops->release(pipe, buf);
 652 }
 653 if (pipe->tmp_page)
 654 __free_page(pipe->tmp_page);
 655 kfree(pipe->bufs);
 656 kfree(pipe);
 657 }

And the faulty instruction is:

All code

   0:   8b bb 88 00 00 00   mov0x88(%rbx),%edi
   6:   e8 73 a5 ef ff  callq  0xffefa57e
   b:   44 3b 63 48 cmp0x48(%rbx),%r12d
   f:   73 22   jae0x33
  11:   49 63 f4movslq %r12d,%rsi
  14:   48 6b f6 28 imul   $0x28,%rsi,%rsi
  18:   48 03 b3 80 00 00 00add0x80(%rbx),%rsi
  1f:   48 8b 46 10 mov0x10(%rsi),%rax
  23:   48 85 c0test   %rax,%rax
  26:   74 06   je 0x2e
  28:   48 89 dfmov%rbx,%rdi
  2b:*  ff 50 10callq  *0x10(%rax)  <--
trapping instruction
  2e:   41 ff c4inc%r12d
  31:   eb d8   jmp0xb
  33:   48 8b 7b 68 mov0x68(%rbx),%rdi
  37:   48 85 fftest   %rdi,%rdi
  3a:   74 07   je 0x43
  3c:   31 f6   xor%esi,%esi

It looks like pipe->bufs was already freed for some reason. The code
is pretty solid, I can't find out any bug, as there should be only one
thread could invoke free_pipe_info().

Kdump is not properly setup on this host, so unfortunately I can't
inspect further on struct pipe_inode_info in memory.

Although it is too early to blame your commit b0d8d2292160 ("vfs: fix
subtle use-after-free of pipe_inode_info"), it is probably related.


Kernel crash in free_pipe_info()

2017-10-30 Thread Cong Wang
Hi, Linus


We got more than a dozen of kernel crashes at free_pipe_info() on our
4.1 kernel, they are all very similar to this one (with slightly
different faulty addresses):


[433948.755473] BUG: unable to handle kernel paging request at 0001000d
[433948.762570] IP: [] free_pipe_info+0x54/0x85
[433948.768493] PGD 0
[433948.770537] Oops:  [#1] SMP
[433948.773768] Modules linked in: sch_htb cls_basic act_mirred
cls_u32 veth sch_ingress fuse cpufreq_ondemand intel_rapl iosf_mbi
x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt
iTCO_vendor_support dcdbas microcode sb_edac wmi edac_core i2c_i801
hed lpc_ich mfd_core ioatdma shpchp lp parport tcp_diag inet_diag
ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs
libcrc32c crc32c_intel igb ptp pps_core i2c_algo_bit dca i2c_core ipv6
[433948.813421] CPU: 10 PID: 50667 Comm: rsync Not tainted 4.1.41.el7.x86_64 #1
[433948.821498] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS
2.2.3 11/07/2013
[433948.829247] task: 880f47fde240 ti: 880d282d8000 task.ti:
880d282d8000
[433948.836770] RIP: 0010:[]  []
free_pipe_info+0x54/0x85
[433948.845088] RSP: 0018:880d282dbd98  EFLAGS: 00010202
[433948.850473] RAX: fffd RBX: 8817e7692240 RCX:
02c5
[433948.857603] RDX: 02c6 RSI: 8817c30be540 RDI:
8817e7692240
[433948.864881] RBP: 880d282dbda8 R08:  R09:

[433948.871988] R10: 880c474a0c60 R11: 8812c3c44310 R12:
0008
[433948.879224] R13: 88182d8a9100 R14: 880c47b981a0 R15:
88182d8a9100
[433948.886514] FS:  () GS:88184ee8()
knlGS:
[433948.894678] CS:  0010 DS:  ES:  CR0: 80050033
[433948.900609] CR2: 0001000d CR3: 01a0b000 CR4:
000406e0
[433948.907840] Stack:
[433948.909876]  88182d8a9100 8817e7692240 880d282dbdc8
8117668e
[433948.917331]  8817e7692240 8812c3c44300 880d282dbdf8
8117672a
[433948.924799]   8812c3c44300 0010
88182d8a9100
[433948.932234] Call Trace:
[433948.934780]  [] put_pipe_info+0x51/0x56
[433948.940320]  [] pipe_release+0x97/0xa1
[433948.945734]  [] __fput+0xf9/0x1b8
[433948.950747]  [] fput+0xe/0x10
[433948.955665]  [] task_work_run+0x8a/0xa1
[433948.961121]  [] do_exit+0x400/0x8ba
[433948.966224]  [] ? fdput_pos.isra.10+0x29/0x30
[433948.972306]  [] do_group_exit+0x44/0x98
[433948.977847]  [] SyS_exit_group+0x14/0x14
[433948.983423]  [] system_call_fastpath+0x12/0x6a
[433948.989477] Code: 8b bb 88 00 00 00 e8 73 a5 ef ff 44 3b 63 48 73
22 49 63 f4 48 6b f6 28 48 03 b3 80 00 00 00 48 8b 46 10 48 85 c0 74
06 48 89 df  50 10 41 ff c4 eb d8 48 8b 7b 68 48 85 ff 74 07 31 f6
e8 be
[433949.008413] RIP  [] free_pipe_info+0x54/0x85
[433949.014335]  RSP 
[433949.017850] CR2: 0001000d
[433949.021820] ---[ end trace 95f92776b6f34085 ]---


For this one, the IP maps to fs/pipe.c:651

 642 void free_pipe_info(struct pipe_inode_info *pipe)
 643 {
 644 int i;
 645
 646 account_pipe_buffers(pipe, pipe->buffers, 0);
 647 free_uid(pipe->user);
 648 for (i = 0; i < pipe->buffers; i++) {
 649 struct pipe_buffer *buf = pipe->bufs + i;
 650 if (buf->ops)
 651 buf->ops->release(pipe, buf);
 652 }
 653 if (pipe->tmp_page)
 654 __free_page(pipe->tmp_page);
 655 kfree(pipe->bufs);
 656 kfree(pipe);
 657 }

And the faulty instruction is:

All code

   0:   8b bb 88 00 00 00   mov0x88(%rbx),%edi
   6:   e8 73 a5 ef ff  callq  0xffefa57e
   b:   44 3b 63 48 cmp0x48(%rbx),%r12d
   f:   73 22   jae0x33
  11:   49 63 f4movslq %r12d,%rsi
  14:   48 6b f6 28 imul   $0x28,%rsi,%rsi
  18:   48 03 b3 80 00 00 00add0x80(%rbx),%rsi
  1f:   48 8b 46 10 mov0x10(%rsi),%rax
  23:   48 85 c0test   %rax,%rax
  26:   74 06   je 0x2e
  28:   48 89 dfmov%rbx,%rdi
  2b:*  ff 50 10callq  *0x10(%rax)  <--
trapping instruction
  2e:   41 ff c4inc%r12d
  31:   eb d8   jmp0xb
  33:   48 8b 7b 68 mov0x68(%rbx),%rdi
  37:   48 85 fftest   %rdi,%rdi
  3a:   74 07   je 0x43
  3c:   31 f6   xor%esi,%esi

It looks like pipe->bufs was already freed for some reason. The code
is pretty solid, I can't find out any bug, as there should be only one
thread could invoke free_pipe_info().

Kdump is not properly setup on this host, so unfortunately I can't
inspect further on struct pipe_inode_info in memory.

Although it is too early to blame your commit b0d8d2292160 ("vfs: fix
subtle use-after-free of pipe_inode_info"), it is probably related.