Re: Kernel crash in free_pipe_info()
On Thu, Nov 9, 2017 at 10:07 PM, Simon Brewerwrote: > > This looks familiar... > > https://github.com/moby/moby/issues/34472 > > From the bug report: > "In particular, it looks like either docker-containerd or > docker-containerd-shim (the log is cut off) has a pipe open that is > causing a kernel BUG when attempting to kill the process. Fun times." Interestingly, some of the "reproducers" show a totally different problem, with kernel BUG at /build/linux-5EyXrQ/linux-4.4.0/mm/slub.c:1495! which is this BUG(): 1493 if (unlikely(flags & GFP_SLAB_BUG_MASK)) { 1494 pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK); 1495 BUG(); 1496 } which implies a completely invalid gfp mask. The stack trace shows it's unix_stream_sendmsg -> sock_alloc_send_pskb allocation, which uses "sk->sk_allocation". Odd. af_unix just does sk->sk_allocation = GFP_KERNEL_ACCOUNT; That is new since v4.4, though - but before that it should have been set by sock_init_data to just GFP_KERNEL. So there it looks like a socket is entirely corrupted. So that other backtrace looks entirely bogus too. More "that looks like corrupted kernel data structures". Enabling SLUB debugging would be really good here too. But that is still quite an old kernel. I wish somebody could reproduce this with something newer. I added those wishes to the github thing. Linus
Re: Kernel crash in free_pipe_info()
On Thu, Nov 9, 2017 at 10:07 PM, Simon Brewer wrote: > > This looks familiar... > > https://github.com/moby/moby/issues/34472 > > From the bug report: > "In particular, it looks like either docker-containerd or > docker-containerd-shim (the log is cut off) has a pipe open that is > causing a kernel BUG when attempting to kill the process. Fun times." Interestingly, some of the "reproducers" show a totally different problem, with kernel BUG at /build/linux-5EyXrQ/linux-4.4.0/mm/slub.c:1495! which is this BUG(): 1493 if (unlikely(flags & GFP_SLAB_BUG_MASK)) { 1494 pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK); 1495 BUG(); 1496 } which implies a completely invalid gfp mask. The stack trace shows it's unix_stream_sendmsg -> sock_alloc_send_pskb allocation, which uses "sk->sk_allocation". Odd. af_unix just does sk->sk_allocation = GFP_KERNEL_ACCOUNT; That is new since v4.4, though - but before that it should have been set by sock_init_data to just GFP_KERNEL. So there it looks like a socket is entirely corrupted. So that other backtrace looks entirely bogus too. More "that looks like corrupted kernel data structures". Enabling SLUB debugging would be really good here too. But that is still quite an old kernel. I wish somebody could reproduce this with something newer. I added those wishes to the github thing. Linus
Re: Kernel crash in free_pipe_info()
Hi, Simon On Thu, Nov 9, 2017 at 10:07 PM, Simon Brewerwrote: > This looks familiar... > > https://github.com/moby/moby/issues/34472 > > From the bug report: > "In particular, it looks like either docker-containerd or > docker-containerd-shim (the log is cut off) has a pipe open that is > causing a kernel BUG when attempting to kill the process. Fun times." Yeah, looks nearly the same. If you know how to reproduce it reliably, that would help a lot. Also, as suggested by Linus, turning on slub debug would also help. Or enable kdump to get a core dump. Thanks!
Re: Kernel crash in free_pipe_info()
Hi, Simon On Thu, Nov 9, 2017 at 10:07 PM, Simon Brewer wrote: > This looks familiar... > > https://github.com/moby/moby/issues/34472 > > From the bug report: > "In particular, it looks like either docker-containerd or > docker-containerd-shim (the log is cut off) has a pipe open that is > causing a kernel BUG when attempting to kill the process. Fun times." Yeah, looks nearly the same. If you know how to reproduce it reliably, that would help a lot. Also, as suggested by Linus, turning on slub debug would also help. Or enable kdump to get a core dump. Thanks!
Re: Kernel crash in free_pipe_info()
On 1 November 2017 at 14:19, Cong Wangwrote: > On Mon, Oct 30, 2017 at 7:08 PM, Linus Torvalds > wrote: >> On Mon, Oct 30, 2017 at 6:19 PM, Cong Wang wrote: >>> >>> 1. The faulty addresses are all near 0001, with one exception >>> of null (which is the most recent one) >> >> Well, they're at 8(%rax), except for that last case. >> >> And in every case (_including_ that last case), %rax has a very >> interesting pattern.. That's the (bad) buf->ops pointer that was >> loaded from the somehow corrupted "buf". >> >> The values in all cases are >> >> fffa >> fffd >> fff1 >> fff7 >> fff4 >> fffa >> fffd >> fffd >> fffa >> ffe8 >> fff1 >> fff7 >> >> which kind of looks like a 32-bit error value. So we have (n, val, (errno)): >> >> 1 -24 (EMFILE) >> 2 -15 (ENOTBLK) >> 1 -12 (ENOMEM) >> 2 -9 (EBADF) >> 3 -6 (ENXIO) >> 3 -3 (ESRCH) >> >> none of which makes any sense to me, but it's an interesting pattern >> nonetheless. > > > Yeah, good find! > > >> >>> 2. R12 register, which should map to the local vairable 'i', is always 0x8 >>> at the time of crash. >> >> So _if_ this is some kind of use-after-free thing, and the allocation >> got re-used for something else, that might just be related to whatever >> ends up being the offset that is filled in with the (int) error >> number. >> >> Except the offset is that %r12*0x28+0x10, so we're talking a byte >> offset of 330 bytes into the allocation, and apparently the eight >> previous (0-7) iterations were fine. >> >> Which is really odd. >> >> I'm not seeing anything that makes sense. I'll have to think about this. >> >> I'm assuming you don't have slub debugging enabled, and no way to >> enable it and try to catch this? > > We enable it at compile-time but not at run-time: > > CONFIG_SLUB_DEBUG=y > CONFIG_SLUB=y > CONFIG_SLUB_CPU_PARTIAL=y > # CONFIG_SLUB_DEBUG_ON is not set > # CONFIG_SLUB_STATS is not set > > I can try to manually add slub_debug in boot parameters, but still > have no idea how and when can trigger this bug again. > > > Thanks! This looks familiar... https://github.com/moby/moby/issues/34472 >From the bug report: "In particular, it looks like either docker-containerd or docker-containerd-shim (the log is cut off) has a pipe open that is causing a kernel BUG when attempting to kill the process. Fun times."
Re: Kernel crash in free_pipe_info()
On 1 November 2017 at 14:19, Cong Wang wrote: > On Mon, Oct 30, 2017 at 7:08 PM, Linus Torvalds > wrote: >> On Mon, Oct 30, 2017 at 6:19 PM, Cong Wang wrote: >>> >>> 1. The faulty addresses are all near 0001, with one exception >>> of null (which is the most recent one) >> >> Well, they're at 8(%rax), except for that last case. >> >> And in every case (_including_ that last case), %rax has a very >> interesting pattern.. That's the (bad) buf->ops pointer that was >> loaded from the somehow corrupted "buf". >> >> The values in all cases are >> >> fffa >> fffd >> fff1 >> fff7 >> fff4 >> fffa >> fffd >> fffd >> fffa >> ffe8 >> fff1 >> fff7 >> >> which kind of looks like a 32-bit error value. So we have (n, val, (errno)): >> >> 1 -24 (EMFILE) >> 2 -15 (ENOTBLK) >> 1 -12 (ENOMEM) >> 2 -9 (EBADF) >> 3 -6 (ENXIO) >> 3 -3 (ESRCH) >> >> none of which makes any sense to me, but it's an interesting pattern >> nonetheless. > > > Yeah, good find! > > >> >>> 2. R12 register, which should map to the local vairable 'i', is always 0x8 >>> at the time of crash. >> >> So _if_ this is some kind of use-after-free thing, and the allocation >> got re-used for something else, that might just be related to whatever >> ends up being the offset that is filled in with the (int) error >> number. >> >> Except the offset is that %r12*0x28+0x10, so we're talking a byte >> offset of 330 bytes into the allocation, and apparently the eight >> previous (0-7) iterations were fine. >> >> Which is really odd. >> >> I'm not seeing anything that makes sense. I'll have to think about this. >> >> I'm assuming you don't have slub debugging enabled, and no way to >> enable it and try to catch this? > > We enable it at compile-time but not at run-time: > > CONFIG_SLUB_DEBUG=y > CONFIG_SLUB=y > CONFIG_SLUB_CPU_PARTIAL=y > # CONFIG_SLUB_DEBUG_ON is not set > # CONFIG_SLUB_STATS is not set > > I can try to manually add slub_debug in boot parameters, but still > have no idea how and when can trigger this bug again. > > > Thanks! This looks familiar... https://github.com/moby/moby/issues/34472 >From the bug report: "In particular, it looks like either docker-containerd or docker-containerd-shim (the log is cut off) has a pipe open that is causing a kernel BUG when attempting to kill the process. Fun times."
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 7:08 PM, Linus Torvaldswrote: > On Mon, Oct 30, 2017 at 6:19 PM, Cong Wang wrote: >> >> 1. The faulty addresses are all near 0001, with one exception >> of null (which is the most recent one) > > Well, they're at 8(%rax), except for that last case. > > And in every case (_including_ that last case), %rax has a very > interesting pattern.. That's the (bad) buf->ops pointer that was > loaded from the somehow corrupted "buf". > > The values in all cases are > > fffa > fffd > fff1 > fff7 > fff4 > fffa > fffd > fffd > fffa > ffe8 > fff1 > fff7 > > which kind of looks like a 32-bit error value. So we have (n, val, (errno)): > > 1 -24 (EMFILE) > 2 -15 (ENOTBLK) > 1 -12 (ENOMEM) > 2 -9 (EBADF) > 3 -6 (ENXIO) > 3 -3 (ESRCH) > > none of which makes any sense to me, but it's an interesting pattern > nonetheless. Yeah, good find! > >> 2. R12 register, which should map to the local vairable 'i', is always 0x8 >> at the time of crash. > > So _if_ this is some kind of use-after-free thing, and the allocation > got re-used for something else, that might just be related to whatever > ends up being the offset that is filled in with the (int) error > number. > > Except the offset is that %r12*0x28+0x10, so we're talking a byte > offset of 330 bytes into the allocation, and apparently the eight > previous (0-7) iterations were fine. > > Which is really odd. > > I'm not seeing anything that makes sense. I'll have to think about this. > > I'm assuming you don't have slub debugging enabled, and no way to > enable it and try to catch this? We enable it at compile-time but not at run-time: CONFIG_SLUB_DEBUG=y CONFIG_SLUB=y CONFIG_SLUB_CPU_PARTIAL=y # CONFIG_SLUB_DEBUG_ON is not set # CONFIG_SLUB_STATS is not set I can try to manually add slub_debug in boot parameters, but still have no idea how and when can trigger this bug again. Thanks!
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 7:08 PM, Linus Torvalds wrote: > On Mon, Oct 30, 2017 at 6:19 PM, Cong Wang wrote: >> >> 1. The faulty addresses are all near 0001, with one exception >> of null (which is the most recent one) > > Well, they're at 8(%rax), except for that last case. > > And in every case (_including_ that last case), %rax has a very > interesting pattern.. That's the (bad) buf->ops pointer that was > loaded from the somehow corrupted "buf". > > The values in all cases are > > fffa > fffd > fff1 > fff7 > fff4 > fffa > fffd > fffd > fffa > ffe8 > fff1 > fff7 > > which kind of looks like a 32-bit error value. So we have (n, val, (errno)): > > 1 -24 (EMFILE) > 2 -15 (ENOTBLK) > 1 -12 (ENOMEM) > 2 -9 (EBADF) > 3 -6 (ENXIO) > 3 -3 (ESRCH) > > none of which makes any sense to me, but it's an interesting pattern > nonetheless. Yeah, good find! > >> 2. R12 register, which should map to the local vairable 'i', is always 0x8 >> at the time of crash. > > So _if_ this is some kind of use-after-free thing, and the allocation > got re-used for something else, that might just be related to whatever > ends up being the offset that is filled in with the (int) error > number. > > Except the offset is that %r12*0x28+0x10, so we're talking a byte > offset of 330 bytes into the allocation, and apparently the eight > previous (0-7) iterations were fine. > > Which is really odd. > > I'm not seeing anything that makes sense. I'll have to think about this. > > I'm assuming you don't have slub debugging enabled, and no way to > enable it and try to catch this? We enable it at compile-time but not at run-time: CONFIG_SLUB_DEBUG=y CONFIG_SLUB=y CONFIG_SLUB_CPU_PARTIAL=y # CONFIG_SLUB_DEBUG_ON is not set # CONFIG_SLUB_STATS is not set I can try to manually add slub_debug in boot parameters, but still have no idea how and when can trigger this bug again. Thanks!
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 9:44 PM, Al Virowrote: > On Mon, Oct 30, 2017 at 07:08:46PM -0700, Linus Torvalds wrote: >> >> Well, they're at 8(%rax), except for that last case. > > 0x10(%rax)? Duh, yes. >> Except the offset is that %r12*0x28+0x10, so we're talking a byte >> offset of 330 bytes into the allocation, and apparently the eight >> previous (0-7) iterations were fine. >> >> Which is really odd. > > I wonder what pipe->buffers is equal to here... Sadly, we never even bother loading it so it doesn't show up in the register state, we just iterate over the whole table. The (one) ppc oops that looks like it might be the same issue has a totally different pattern. Instead of having that "error number" looking thing as the invalid pointer base, it has the magic number from a spinlock. And rather than being about "pipe->bufs[]" array, it's the pipe pointer itself that seems corrupted, and thus the oops happens in the account_pipe_buffers() code instead of in the loop over the buffers. Of course, both are consistent with that "pipe_inode_info" simply having been overwritten by something else (possibly, but not necessarily, due to a use-after-free). > FWIW, I would try to slap > if (buf->ops && (unsigned long)buf->ops <= 0x) > dump the living hell out of that thing > and see what it catches... Actually, I'm looking at *another* error path - the one in named pipes. Named pipes are why we do that nasty "inode->i_pipe" thing - if it was for just the regular pipes, we'd be able to just do file->f_private_data and be done with it. But named pipes have to have the pipe data associated with a particular inode. And that code actually does look wrong. Look at fifo_open(): it increments the pipe->files as it sets filp->private_data to point to the pipe_inode_info. Good so far. But look at the error case. It does that put_pipe_info() to release it again, but filp->private_data still contains the pipe_inode_info pointer. So what happens on a failed open of a named pipe? The *normal* code all is very careful to _not_ use "fput()", but instead use "put_filp(f)", which will just free the file pointer. But what if somebody does "vfs_open()" on one of those things, and then does "fput()" in the failure case? In "do_last()" we have that FILE_OPENED protection: if (unlikely(error) && (*opened & FILE_OPENED)) fput(file); and path_openat() is again very careful to then use put_filp(file); if FILE_OPENED was never set. And do_o_path() does the same. I'm not seeing anybody who does the wrong thing, but there's a number of ways to get this entirely wrong, and I worry some path does. I would be a *lot* happier if we didn't have that very subtle fput()-vs-put_filp() issue going on. Again, I cannot see anything wrong, but this feels very very fragile to me. Linus
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 9:44 PM, Al Viro wrote: > On Mon, Oct 30, 2017 at 07:08:46PM -0700, Linus Torvalds wrote: >> >> Well, they're at 8(%rax), except for that last case. > > 0x10(%rax)? Duh, yes. >> Except the offset is that %r12*0x28+0x10, so we're talking a byte >> offset of 330 bytes into the allocation, and apparently the eight >> previous (0-7) iterations were fine. >> >> Which is really odd. > > I wonder what pipe->buffers is equal to here... Sadly, we never even bother loading it so it doesn't show up in the register state, we just iterate over the whole table. The (one) ppc oops that looks like it might be the same issue has a totally different pattern. Instead of having that "error number" looking thing as the invalid pointer base, it has the magic number from a spinlock. And rather than being about "pipe->bufs[]" array, it's the pipe pointer itself that seems corrupted, and thus the oops happens in the account_pipe_buffers() code instead of in the loop over the buffers. Of course, both are consistent with that "pipe_inode_info" simply having been overwritten by something else (possibly, but not necessarily, due to a use-after-free). > FWIW, I would try to slap > if (buf->ops && (unsigned long)buf->ops <= 0x) > dump the living hell out of that thing > and see what it catches... Actually, I'm looking at *another* error path - the one in named pipes. Named pipes are why we do that nasty "inode->i_pipe" thing - if it was for just the regular pipes, we'd be able to just do file->f_private_data and be done with it. But named pipes have to have the pipe data associated with a particular inode. And that code actually does look wrong. Look at fifo_open(): it increments the pipe->files as it sets filp->private_data to point to the pipe_inode_info. Good so far. But look at the error case. It does that put_pipe_info() to release it again, but filp->private_data still contains the pipe_inode_info pointer. So what happens on a failed open of a named pipe? The *normal* code all is very careful to _not_ use "fput()", but instead use "put_filp(f)", which will just free the file pointer. But what if somebody does "vfs_open()" on one of those things, and then does "fput()" in the failure case? In "do_last()" we have that FILE_OPENED protection: if (unlikely(error) && (*opened & FILE_OPENED)) fput(file); and path_openat() is again very careful to then use put_filp(file); if FILE_OPENED was never set. And do_o_path() does the same. I'm not seeing anybody who does the wrong thing, but there's a number of ways to get this entirely wrong, and I worry some path does. I would be a *lot* happier if we didn't have that very subtle fput()-vs-put_filp() issue going on. Again, I cannot see anything wrong, but this feels very very fragile to me. Linus
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 08:06:23PM -0700, Linus Torvalds wrote: > We do that "free_pipe_info(inode->i_pipe);", but we never actually > clear inode->i_pipe, so now we have an inode that looks like a pipe > inode, and has a stale pointer to a pipe_inode_info. > > It all looks technically correct. It's fine to use put_filp(), because > the file pointer has never really been used. And the inode should > never get re-used anyway without going through the whole reinit in > inode_init_always(). > > So I don't see anything *wrong*, but I see a lot that is just unusual, FWIW, it's really brittle - consider if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) i_readcount_inc(path->dentry->d_inode); in alloc_file(). It's not the source of trouble in this case, but only because it's the second call that gets FMODE_READ; reorder them in create_pipe_files() and you've got a bug. I considered using fput() there, but that would've required manually decrementing pipe->files first, which made it rather unappealing... I don't see anything relevant there, but that's not saying much - flu and debugging do not mix well, and lack of sleep also doesn't help ;-/
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 08:06:23PM -0700, Linus Torvalds wrote: > We do that "free_pipe_info(inode->i_pipe);", but we never actually > clear inode->i_pipe, so now we have an inode that looks like a pipe > inode, and has a stale pointer to a pipe_inode_info. > > It all looks technically correct. It's fine to use put_filp(), because > the file pointer has never really been used. And the inode should > never get re-used anyway without going through the whole reinit in > inode_init_always(). > > So I don't see anything *wrong*, but I see a lot that is just unusual, FWIW, it's really brittle - consider if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) i_readcount_inc(path->dentry->d_inode); in alloc_file(). It's not the source of trouble in this case, but only because it's the second call that gets FMODE_READ; reorder them in create_pipe_files() and you've got a bug. I considered using fput() there, but that would've required manually decrementing pipe->files first, which made it rather unappealing... I don't see anything relevant there, but that's not saying much - flu and debugging do not mix well, and lack of sleep also doesn't help ;-/
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 07:08:46PM -0700, Linus Torvalds wrote: > On Mon, Oct 30, 2017 at 6:19 PM, Cong Wangwrote: > > > > 1. The faulty addresses are all near 0001, with one exception > > of null (which is the most recent one) > > Well, they're at 8(%rax), except for that last case. 0x10(%rax)? > And in every case (_including_ that last case), %rax has a very > interesting pattern.. That's the (bad) buf->ops pointer that was > loaded from the somehow corrupted "buf". > So _if_ this is some kind of use-after-free thing, and the allocation > got re-used for something else, that might just be related to whatever > ends up being the offset that is filled in with the (int) error > number. > > Except the offset is that %r12*0x28+0x10, so we're talking a byte > offset of 330 bytes into the allocation, and apparently the eight > previous (0-7) iterations were fine. > > Which is really odd. I wonder what pipe->buffers is equal to here... > I'm not seeing anything that makes sense. I'll have to think about this. > > I'm assuming you don't have slub debugging enabled, and no way to > enable it and try to catch this? FWIW, I would try to slap if (buf->ops && (unsigned long)buf->ops <= 0x) dump the living hell out of that thing and see what it catches...
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 07:08:46PM -0700, Linus Torvalds wrote: > On Mon, Oct 30, 2017 at 6:19 PM, Cong Wang wrote: > > > > 1. The faulty addresses are all near 0001, with one exception > > of null (which is the most recent one) > > Well, they're at 8(%rax), except for that last case. 0x10(%rax)? > And in every case (_including_ that last case), %rax has a very > interesting pattern.. That's the (bad) buf->ops pointer that was > loaded from the somehow corrupted "buf". > So _if_ this is some kind of use-after-free thing, and the allocation > got re-used for something else, that might just be related to whatever > ends up being the offset that is filled in with the (int) error > number. > > Except the offset is that %r12*0x28+0x10, so we're talking a byte > offset of 330 bytes into the allocation, and apparently the eight > previous (0-7) iterations were fine. > > Which is really odd. I wonder what pipe->buffers is equal to here... > I'm not seeing anything that makes sense. I'll have to think about this. > > I'm assuming you don't have slub debugging enabled, and no way to > enable it and try to catch this? FWIW, I would try to slap if (buf->ops && (unsigned long)buf->ops <= 0x) dump the living hell out of that thing and see what it catches...
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 7:08 PM, Linus Torvaldswrote: > > I'm not seeing anything that makes sense. I'll have to think about this. Al, would you mind taking a look at the error handling in create_pipe_files(). In particular, look here: - we start out allocating the inode with "get_pipe_inode(). That sets up a inode->i_pipe, with pipe->files initialized to 2. Fine. We're going to have two file descriptors. - we then create the dummy path: path.dentry = d_alloc_pseudo(pipe_mnt->mnt_sb, _name); fine fine. Again, this looks all good for the success cases. But the *error* cases are a bit dodgy, aren't they? We have three different error cases: - we couldn't even allocate a dentry. We do free_pipe_info(inode->i_pipe); iput(inode); - we couldn't allocate any file at all: free_pipe_info(inode->i_pipe); path_put(); - we allocated the first file, but not the second: put_filp(f); free_pipe_info(inode->i_pipe); path_put(); and it worries me a bit that in all those error cases, we end up doing that "free_pipe_info()", but we basically do this half-arsed job of freeing things. For example, we use "put_filp()" to free the file pointer, not "fput()". We do that "free_pipe_info(inode->i_pipe);", but we never actually clear inode->i_pipe, so now we have an inode that looks like a pipe inode, and has a stale pointer to a pipe_inode_info. It all looks technically correct. It's fine to use put_filp(), because the file pointer has never really been used. And the inode should never get re-used anyway without going through the whole reinit in inode_init_always(). So I don't see anything *wrong*, but I see a lot that is just unusual, and seems to depend on half-initialized state being fine. Can you look at this? Linus
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 7:08 PM, Linus Torvalds wrote: > > I'm not seeing anything that makes sense. I'll have to think about this. Al, would you mind taking a look at the error handling in create_pipe_files(). In particular, look here: - we start out allocating the inode with "get_pipe_inode(). That sets up a inode->i_pipe, with pipe->files initialized to 2. Fine. We're going to have two file descriptors. - we then create the dummy path: path.dentry = d_alloc_pseudo(pipe_mnt->mnt_sb, _name); fine fine. Again, this looks all good for the success cases. But the *error* cases are a bit dodgy, aren't they? We have three different error cases: - we couldn't even allocate a dentry. We do free_pipe_info(inode->i_pipe); iput(inode); - we couldn't allocate any file at all: free_pipe_info(inode->i_pipe); path_put(); - we allocated the first file, but not the second: put_filp(f); free_pipe_info(inode->i_pipe); path_put(); and it worries me a bit that in all those error cases, we end up doing that "free_pipe_info()", but we basically do this half-arsed job of freeing things. For example, we use "put_filp()" to free the file pointer, not "fput()". We do that "free_pipe_info(inode->i_pipe);", but we never actually clear inode->i_pipe, so now we have an inode that looks like a pipe inode, and has a stale pointer to a pipe_inode_info. It all looks technically correct. It's fine to use put_filp(), because the file pointer has never really been used. And the inode should never get re-used anyway without going through the whole reinit in inode_init_always(). So I don't see anything *wrong*, but I see a lot that is just unusual, and seems to depend on half-initialized state being fine. Can you look at this? Linus
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 6:19 PM, Cong Wangwrote: > > 1. The faulty addresses are all near 0001, with one exception > of null (which is the most recent one) Well, they're at 8(%rax), except for that last case. And in every case (_including_ that last case), %rax has a very interesting pattern.. That's the (bad) buf->ops pointer that was loaded from the somehow corrupted "buf". The values in all cases are fffa fffd fff1 fff7 fff4 fffa fffd fffd fffa ffe8 fff1 fff7 which kind of looks like a 32-bit error value. So we have (n, val, (errno)): 1 -24 (EMFILE) 2 -15 (ENOTBLK) 1 -12 (ENOMEM) 2 -9 (EBADF) 3 -6 (ENXIO) 3 -3 (ESRCH) none of which makes any sense to me, but it's an interesting pattern nonetheless. > 2. R12 register, which should map to the local vairable 'i', is always 0x8 > at the time of crash. So _if_ this is some kind of use-after-free thing, and the allocation got re-used for something else, that might just be related to whatever ends up being the offset that is filled in with the (int) error number. Except the offset is that %r12*0x28+0x10, so we're talking a byte offset of 330 bytes into the allocation, and apparently the eight previous (0-7) iterations were fine. Which is really odd. I'm not seeing anything that makes sense. I'll have to think about this. I'm assuming you don't have slub debugging enabled, and no way to enable it and try to catch this? Linus
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 6:19 PM, Cong Wang wrote: > > 1. The faulty addresses are all near 0001, with one exception > of null (which is the most recent one) Well, they're at 8(%rax), except for that last case. And in every case (_including_ that last case), %rax has a very interesting pattern.. That's the (bad) buf->ops pointer that was loaded from the somehow corrupted "buf". The values in all cases are fffa fffd fff1 fff7 fff4 fffa fffd fffd fffa ffe8 fff1 fff7 which kind of looks like a 32-bit error value. So we have (n, val, (errno)): 1 -24 (EMFILE) 2 -15 (ENOTBLK) 1 -12 (ENOMEM) 2 -9 (EBADF) 3 -6 (ENXIO) 3 -3 (ESRCH) none of which makes any sense to me, but it's an interesting pattern nonetheless. > 2. R12 register, which should map to the local vairable 'i', is always 0x8 > at the time of crash. So _if_ this is some kind of use-after-free thing, and the allocation got re-used for something else, that might just be related to whatever ends up being the offset that is filled in with the (int) error number. Except the offset is that %r12*0x28+0x10, so we're talking a byte offset of 330 bytes into the allocation, and apparently the eight previous (0-7) iterations were fine. Which is really odd. I'm not seeing anything that makes sense. I'll have to think about this. I'm assuming you don't have slub debugging enabled, and no way to enable it and try to catch this? Linus
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 3:14 PM, Linus Torvaldswrote: > On Mon, Oct 30, 2017 at 1:58 PM, Cong Wang wrote: >> >> We got more than a dozen of kernel crashes at free_pipe_info() on our >> 4.1 kernel, they are all very similar to this one (with slightly >> different faulty addresses): > > Were it not for the pointer to the much more recent powerpc version at > > https://www.spinics.net/lists/kernel/msg2616441.html > > that you referenced, I'd be inclined to ignore this as much too old. > Al fixed a number of issues with splice() some time ago. > > And honestly, splice() is pretty much the only thing that I can think > of that can trigger issues wrt the pipe_inode_info structure. > > But as you say, we've had subtle races just due to the named pipes: It is unlikely that we use splice() or named pipes, I highly suspect it is caused by the pipe(), although stack traces can't tell. > >> Although it is too early to blame your commit b0d8d2292160 ("vfs: fix >> subtle use-after-free of pipe_inode_info"), it is probably related. > > I do hate the games we play with the stupid pipe_inode_info. The way > we have locking in the inode, instead of in the pipe_inode_info > itself. > > There are hysterical raisins for that, and due to names piped we have > to keep the reference to the pipe_inode_info in the inode, but it's > still very very annoying. > > Al, do you see anything? > > Cong, do you have any way to trigger these? Is there any pattern to > when they happen or what is going on when they do? > I have no idea how this was triggered, all of the stack traces are reported by ABRT service on each host. Note, since kdump is not properly config'ed, there could be even more machines crashed silently in a corner without even an ABRT report. I will try to gather more information. Thanks!
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 3:14 PM, Linus Torvalds wrote: > On Mon, Oct 30, 2017 at 1:58 PM, Cong Wang wrote: >> >> We got more than a dozen of kernel crashes at free_pipe_info() on our >> 4.1 kernel, they are all very similar to this one (with slightly >> different faulty addresses): > > Were it not for the pointer to the much more recent powerpc version at > > https://www.spinics.net/lists/kernel/msg2616441.html > > that you referenced, I'd be inclined to ignore this as much too old. > Al fixed a number of issues with splice() some time ago. > > And honestly, splice() is pretty much the only thing that I can think > of that can trigger issues wrt the pipe_inode_info structure. > > But as you say, we've had subtle races just due to the named pipes: It is unlikely that we use splice() or named pipes, I highly suspect it is caused by the pipe(), although stack traces can't tell. > >> Although it is too early to blame your commit b0d8d2292160 ("vfs: fix >> subtle use-after-free of pipe_inode_info"), it is probably related. > > I do hate the games we play with the stupid pipe_inode_info. The way > we have locking in the inode, instead of in the pipe_inode_info > itself. > > There are hysterical raisins for that, and due to names piped we have > to keep the reference to the pipe_inode_info in the inode, but it's > still very very annoying. > > Al, do you see anything? > > Cong, do you have any way to trigger these? Is there any pattern to > when they happen or what is going on when they do? > I have no idea how this was triggered, all of the stack traces are reported by ABRT service on each host. Note, since kdump is not properly config'ed, there could be even more machines crashed silently in a corner without even an ABRT report. I will try to gather more information. Thanks!
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 3:26 PM, Linus Torvaldswrote: > On Mon, Oct 30, 2017 at 3:14 PM, Linus Torvalds > wrote: >> On Mon, Oct 30, 2017 at 1:58 PM, Cong Wang wrote: >>> >>> We got more than a dozen of kernel crashes at free_pipe_info() on our >>> 4.1 kernel, they are all very similar to this one (with slightly >>> different faulty addresses): >> >> Were it not for the pointer to the much more recent powerpc version at >> >> https://www.spinics.net/lists/kernel/msg2616441.html >> >> that you referenced, I'd be inclined to ignore this as much too old. >> Al fixed a number of issues with splice() some time ago. > > Hmm. So I tried to google for this oops, and the only oops I find is > hat ppc one you reference. > > Was that how you found it too? Yes. > > The fact that google doesn't report any other oopses would seem to > indicate that this is not a widespread issue. Where are the "more than > a dozen" ones you refer to being reported? I'd like to see them all if > at all possible, to see if there's some pattern to them.. > We have thousands of machines running 4.1, so even with a dozen crashes it is still a very small percentage. :) I am attaching these stack traces that I copy-n-paste from the ABRT email reports. There are two patterns I can observe: 1. The faulty addresses are all near 0001, with one exception of null (which is the most recent one) 2. R12 register, which should map to the local vairable 'i', is always 0x8 at the time of crash. Hope it helps. Thanks! :[840243.314457] BUG: unable to handle kernel NULL pointer dereference at (null) :[840243.322420] IP: [< (null)>] (null) :[840243.327582] PGD 18782067 PUD 24d42067 PMD 0 :[840243.331989] Oops: 0010 [#1] SMP :[840243.335346] Modules linked in: sch_htb cls_basic act_mirred cls_u32 veth sch_ingress fuse cpufreq_ondemand x86_pkg_temp_thermal coretemp kvm_intel kvm microcode lp parport iTCO_wdt tcp_diag iTCO_vendor_support inet_diag sb_edac edac_core lpc_ich i2c_i801 i2c_core mfd_core ioatdma dca hed wmi shpchp ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c mlx4_en vxlan ip6_udp_tunnel udp_tunnel ptp pps_core crc32c_intel serio_raw mlx4_core isci libsas scsi_transport_sas ipv6 :[840243.379972] CPU: 5 PID: 9355 Comm: java Not tainted 4.1.41.el7.x86_64 #1 :[840243.387803] Hardware name: SYNNEX HYVE-ZEUS/X9DRD-iF, BIOS 3.0.4 12/06/2013 :[840243.394855] task: 880866ba ti: 88036094 task.ti: 88036094 :[840243.402426] RIP: 0010:[<>] [< (null)>] (null) :[840243.410005] RSP: 0018:880360943e30 EFLAGS: 00010206 :[840243.415400] RAX: fffa RBX: 8804697c4f00 RCX: 001a :[840243.422617] RDX: 001b RSI: 8807cdc1b940 RDI: 8804697c4f00 :[840243.429835] RBP: 880360943e48 R08: R09: :[840243.437059] R10: b708 R11: 88049f34b010 R12: 0008 :[840243.444275] R13: 880866500aa0 R14: 8808689681a0 R15: 880866500aa0 :[840243.451492] FS: 7f01c40ba700() GS:88046fca() knlGS: :[840243.459672] CS: 0010 DS: ES: CR0: 80050033 :[840243.465509] CR2: CR3: 78b43000 CR4: 001406e0 :[840243.472734] Stack: :[840243.474841] 8117660f 880866500aa0 8804697c4f00 880360943e68 :[840243.482386] 8117668e 8804697c4f00 88049f34b000 880360943e98 :[840243.489933] 8117672a 88049f34b000 0010 :[840243.497479] Call Trace: :[840243.500019] [] ? free_pipe_info+0x57/0x85 :[840243.505768] [] put_pipe_info+0x51/0x56 :[840243.511254] [] pipe_release+0x97/0xa1 :[840243.516659] [] __fput+0xf9/0x1b8 :[840243.521628] [] fput+0xe/0x10 :[840243.526601] [] task_work_run+0x8a/0xa1 :[840243.532089] [] do_notify_resume+0x56/0x65 :[840243.537842] [] int_signal+0x12/0x17 :[840243.543068] Code: Bad RIP value. :[840243.546519] RIP [< (null)>] (null) :[840243.551757] RSP :[840243.555332] CR2: :[840243.559179] ---[ end trace b725063bc2e1920b ]--- :[319591.443093] IP: [] free_pipe_info+0x54/0x85 :[319591.448958] PGD 0 :[319591.450964] Oops: [#1] SMP :[319591.454327] Modules linked in: sch_htb cls_basic act_mirred cls_u32 veth sch_ingress fuse cpufreq_ondemand xfs libcrc32c intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support dcdbas microcode sb_edac lpc_ich ioatdma mfd_core shpchp i2c_i801 edac_core acpi_pad wmi hed lp parport tcp_diag inet_diag ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel crc32c_intel igb ptp pps_core i2c_algo_bit i2c_core dca ipv6 :[319591.494830] CPU: 4 PID: 1221 Comm: sshd Not tainted 4.1.41.el7.x86_64 #1 :[319591.502659] Hardware name:
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 3:26 PM, Linus Torvalds wrote: > On Mon, Oct 30, 2017 at 3:14 PM, Linus Torvalds > wrote: >> On Mon, Oct 30, 2017 at 1:58 PM, Cong Wang wrote: >>> >>> We got more than a dozen of kernel crashes at free_pipe_info() on our >>> 4.1 kernel, they are all very similar to this one (with slightly >>> different faulty addresses): >> >> Were it not for the pointer to the much more recent powerpc version at >> >> https://www.spinics.net/lists/kernel/msg2616441.html >> >> that you referenced, I'd be inclined to ignore this as much too old. >> Al fixed a number of issues with splice() some time ago. > > Hmm. So I tried to google for this oops, and the only oops I find is > hat ppc one you reference. > > Was that how you found it too? Yes. > > The fact that google doesn't report any other oopses would seem to > indicate that this is not a widespread issue. Where are the "more than > a dozen" ones you refer to being reported? I'd like to see them all if > at all possible, to see if there's some pattern to them.. > We have thousands of machines running 4.1, so even with a dozen crashes it is still a very small percentage. :) I am attaching these stack traces that I copy-n-paste from the ABRT email reports. There are two patterns I can observe: 1. The faulty addresses are all near 0001, with one exception of null (which is the most recent one) 2. R12 register, which should map to the local vairable 'i', is always 0x8 at the time of crash. Hope it helps. Thanks! :[840243.314457] BUG: unable to handle kernel NULL pointer dereference at (null) :[840243.322420] IP: [< (null)>] (null) :[840243.327582] PGD 18782067 PUD 24d42067 PMD 0 :[840243.331989] Oops: 0010 [#1] SMP :[840243.335346] Modules linked in: sch_htb cls_basic act_mirred cls_u32 veth sch_ingress fuse cpufreq_ondemand x86_pkg_temp_thermal coretemp kvm_intel kvm microcode lp parport iTCO_wdt tcp_diag iTCO_vendor_support inet_diag sb_edac edac_core lpc_ich i2c_i801 i2c_core mfd_core ioatdma dca hed wmi shpchp ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c mlx4_en vxlan ip6_udp_tunnel udp_tunnel ptp pps_core crc32c_intel serio_raw mlx4_core isci libsas scsi_transport_sas ipv6 :[840243.379972] CPU: 5 PID: 9355 Comm: java Not tainted 4.1.41.el7.x86_64 #1 :[840243.387803] Hardware name: SYNNEX HYVE-ZEUS/X9DRD-iF, BIOS 3.0.4 12/06/2013 :[840243.394855] task: 880866ba ti: 88036094 task.ti: 88036094 :[840243.402426] RIP: 0010:[<>] [< (null)>] (null) :[840243.410005] RSP: 0018:880360943e30 EFLAGS: 00010206 :[840243.415400] RAX: fffa RBX: 8804697c4f00 RCX: 001a :[840243.422617] RDX: 001b RSI: 8807cdc1b940 RDI: 8804697c4f00 :[840243.429835] RBP: 880360943e48 R08: R09: :[840243.437059] R10: b708 R11: 88049f34b010 R12: 0008 :[840243.444275] R13: 880866500aa0 R14: 8808689681a0 R15: 880866500aa0 :[840243.451492] FS: 7f01c40ba700() GS:88046fca() knlGS: :[840243.459672] CS: 0010 DS: ES: CR0: 80050033 :[840243.465509] CR2: CR3: 78b43000 CR4: 001406e0 :[840243.472734] Stack: :[840243.474841] 8117660f 880866500aa0 8804697c4f00 880360943e68 :[840243.482386] 8117668e 8804697c4f00 88049f34b000 880360943e98 :[840243.489933] 8117672a 88049f34b000 0010 :[840243.497479] Call Trace: :[840243.500019] [] ? free_pipe_info+0x57/0x85 :[840243.505768] [] put_pipe_info+0x51/0x56 :[840243.511254] [] pipe_release+0x97/0xa1 :[840243.516659] [] __fput+0xf9/0x1b8 :[840243.521628] [] fput+0xe/0x10 :[840243.526601] [] task_work_run+0x8a/0xa1 :[840243.532089] [] do_notify_resume+0x56/0x65 :[840243.537842] [] int_signal+0x12/0x17 :[840243.543068] Code: Bad RIP value. :[840243.546519] RIP [< (null)>] (null) :[840243.551757] RSP :[840243.555332] CR2: :[840243.559179] ---[ end trace b725063bc2e1920b ]--- :[319591.443093] IP: [] free_pipe_info+0x54/0x85 :[319591.448958] PGD 0 :[319591.450964] Oops: [#1] SMP :[319591.454327] Modules linked in: sch_htb cls_basic act_mirred cls_u32 veth sch_ingress fuse cpufreq_ondemand xfs libcrc32c intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support dcdbas microcode sb_edac lpc_ich ioatdma mfd_core shpchp i2c_i801 edac_core acpi_pad wmi hed lp parport tcp_diag inet_diag ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel crc32c_intel igb ptp pps_core i2c_algo_bit i2c_core dca ipv6 :[319591.494830] CPU: 4 PID: 1221 Comm: sshd Not tainted 4.1.41.el7.x86_64 #1 :[319591.502659] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.2.3 11/07/2013 :[319591.510310] task:
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 3:14 PM, Linus Torvaldswrote: > On Mon, Oct 30, 2017 at 1:58 PM, Cong Wang wrote: >> >> We got more than a dozen of kernel crashes at free_pipe_info() on our >> 4.1 kernel, they are all very similar to this one (with slightly >> different faulty addresses): > > Were it not for the pointer to the much more recent powerpc version at > > https://www.spinics.net/lists/kernel/msg2616441.html > > that you referenced, I'd be inclined to ignore this as much too old. > Al fixed a number of issues with splice() some time ago. Hmm. So I tried to google for this oops, and the only oops I find is hat ppc one you reference. Was that how you found it too? The fact that google doesn't report any other oopses would seem to indicate that this is not a widespread issue. Where are the "more than a dozen" ones you refer to being reported? I'd like to see them all if at all possible, to see if there's some pattern to them.. Linus
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 3:14 PM, Linus Torvalds wrote: > On Mon, Oct 30, 2017 at 1:58 PM, Cong Wang wrote: >> >> We got more than a dozen of kernel crashes at free_pipe_info() on our >> 4.1 kernel, they are all very similar to this one (with slightly >> different faulty addresses): > > Were it not for the pointer to the much more recent powerpc version at > > https://www.spinics.net/lists/kernel/msg2616441.html > > that you referenced, I'd be inclined to ignore this as much too old. > Al fixed a number of issues with splice() some time ago. Hmm. So I tried to google for this oops, and the only oops I find is hat ppc one you reference. Was that how you found it too? The fact that google doesn't report any other oopses would seem to indicate that this is not a widespread issue. Where are the "more than a dozen" ones you refer to being reported? I'd like to see them all if at all possible, to see if there's some pattern to them.. Linus
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 1:58 PM, Cong Wangwrote: > > We got more than a dozen of kernel crashes at free_pipe_info() on our > 4.1 kernel, they are all very similar to this one (with slightly > different faulty addresses): Were it not for the pointer to the much more recent powerpc version at https://www.spinics.net/lists/kernel/msg2616441.html that you referenced, I'd be inclined to ignore this as much too old. Al fixed a number of issues with splice() some time ago. And honestly, splice() is pretty much the only thing that I can think of that can trigger issues wrt the pipe_inode_info structure. But as you say, we've had subtle races just due to the named pipes: > Although it is too early to blame your commit b0d8d2292160 ("vfs: fix > subtle use-after-free of pipe_inode_info"), it is probably related. I do hate the games we play with the stupid pipe_inode_info. The way we have locking in the inode, instead of in the pipe_inode_info itself. There are hysterical raisins for that, and due to names piped we have to keep the reference to the pipe_inode_info in the inode, but it's still very very annoying. Al, do you see anything? Cong, do you have any way to trigger these? Is there any pattern to when they happen or what is going on when they do? Linus
Re: Kernel crash in free_pipe_info()
On Mon, Oct 30, 2017 at 1:58 PM, Cong Wang wrote: > > We got more than a dozen of kernel crashes at free_pipe_info() on our > 4.1 kernel, they are all very similar to this one (with slightly > different faulty addresses): Were it not for the pointer to the much more recent powerpc version at https://www.spinics.net/lists/kernel/msg2616441.html that you referenced, I'd be inclined to ignore this as much too old. Al fixed a number of issues with splice() some time ago. And honestly, splice() is pretty much the only thing that I can think of that can trigger issues wrt the pipe_inode_info structure. But as you say, we've had subtle races just due to the named pipes: > Although it is too early to blame your commit b0d8d2292160 ("vfs: fix > subtle use-after-free of pipe_inode_info"), it is probably related. I do hate the games we play with the stupid pipe_inode_info. The way we have locking in the inode, instead of in the pipe_inode_info itself. There are hysterical raisins for that, and due to names piped we have to keep the reference to the pipe_inode_info in the inode, but it's still very very annoying. Al, do you see anything? Cong, do you have any way to trigger these? Is there any pattern to when they happen or what is going on when they do? Linus
Kernel crash in free_pipe_info()
Hi, Linus We got more than a dozen of kernel crashes at free_pipe_info() on our 4.1 kernel, they are all very similar to this one (with slightly different faulty addresses): [433948.755473] BUG: unable to handle kernel paging request at 0001000d [433948.762570] IP: [] free_pipe_info+0x54/0x85 [433948.768493] PGD 0 [433948.770537] Oops: [#1] SMP [433948.773768] Modules linked in: sch_htb cls_basic act_mirred cls_u32 veth sch_ingress fuse cpufreq_ondemand intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support dcdbas microcode sb_edac wmi edac_core i2c_i801 hed lpc_ich mfd_core ioatdma shpchp lp parport tcp_diag inet_diag ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel igb ptp pps_core i2c_algo_bit dca i2c_core ipv6 [433948.813421] CPU: 10 PID: 50667 Comm: rsync Not tainted 4.1.41.el7.x86_64 #1 [433948.821498] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.2.3 11/07/2013 [433948.829247] task: 880f47fde240 ti: 880d282d8000 task.ti: 880d282d8000 [433948.836770] RIP: 0010:[] [] free_pipe_info+0x54/0x85 [433948.845088] RSP: 0018:880d282dbd98 EFLAGS: 00010202 [433948.850473] RAX: fffd RBX: 8817e7692240 RCX: 02c5 [433948.857603] RDX: 02c6 RSI: 8817c30be540 RDI: 8817e7692240 [433948.864881] RBP: 880d282dbda8 R08: R09: [433948.871988] R10: 880c474a0c60 R11: 8812c3c44310 R12: 0008 [433948.879224] R13: 88182d8a9100 R14: 880c47b981a0 R15: 88182d8a9100 [433948.886514] FS: () GS:88184ee8() knlGS: [433948.894678] CS: 0010 DS: ES: CR0: 80050033 [433948.900609] CR2: 0001000d CR3: 01a0b000 CR4: 000406e0 [433948.907840] Stack: [433948.909876] 88182d8a9100 8817e7692240 880d282dbdc8 8117668e [433948.917331] 8817e7692240 8812c3c44300 880d282dbdf8 8117672a [433948.924799] 8812c3c44300 0010 88182d8a9100 [433948.932234] Call Trace: [433948.934780] [] put_pipe_info+0x51/0x56 [433948.940320] [] pipe_release+0x97/0xa1 [433948.945734] [] __fput+0xf9/0x1b8 [433948.950747] [] fput+0xe/0x10 [433948.955665] [] task_work_run+0x8a/0xa1 [433948.961121] [] do_exit+0x400/0x8ba [433948.966224] [] ? fdput_pos.isra.10+0x29/0x30 [433948.972306] [] do_group_exit+0x44/0x98 [433948.977847] [] SyS_exit_group+0x14/0x14 [433948.983423] [] system_call_fastpath+0x12/0x6a [433948.989477] Code: 8b bb 88 00 00 00 e8 73 a5 ef ff 44 3b 63 48 73 22 49 63 f4 48 6b f6 28 48 03 b3 80 00 00 00 48 8b 46 10 48 85 c0 74 06 48 89 df 50 10 41 ff c4 eb d8 48 8b 7b 68 48 85 ff 74 07 31 f6 e8 be [433949.008413] RIP [] free_pipe_info+0x54/0x85 [433949.014335] RSP [433949.017850] CR2: 0001000d [433949.021820] ---[ end trace 95f92776b6f34085 ]--- For this one, the IP maps to fs/pipe.c:651 642 void free_pipe_info(struct pipe_inode_info *pipe) 643 { 644 int i; 645 646 account_pipe_buffers(pipe, pipe->buffers, 0); 647 free_uid(pipe->user); 648 for (i = 0; i < pipe->buffers; i++) { 649 struct pipe_buffer *buf = pipe->bufs + i; 650 if (buf->ops) 651 buf->ops->release(pipe, buf); 652 } 653 if (pipe->tmp_page) 654 __free_page(pipe->tmp_page); 655 kfree(pipe->bufs); 656 kfree(pipe); 657 } And the faulty instruction is: All code 0: 8b bb 88 00 00 00 mov0x88(%rbx),%edi 6: e8 73 a5 ef ff callq 0xffefa57e b: 44 3b 63 48 cmp0x48(%rbx),%r12d f: 73 22 jae0x33 11: 49 63 f4movslq %r12d,%rsi 14: 48 6b f6 28 imul $0x28,%rsi,%rsi 18: 48 03 b3 80 00 00 00add0x80(%rbx),%rsi 1f: 48 8b 46 10 mov0x10(%rsi),%rax 23: 48 85 c0test %rax,%rax 26: 74 06 je 0x2e 28: 48 89 dfmov%rbx,%rdi 2b:* ff 50 10callq *0x10(%rax) <-- trapping instruction 2e: 41 ff c4inc%r12d 31: eb d8 jmp0xb 33: 48 8b 7b 68 mov0x68(%rbx),%rdi 37: 48 85 fftest %rdi,%rdi 3a: 74 07 je 0x43 3c: 31 f6 xor%esi,%esi It looks like pipe->bufs was already freed for some reason. The code is pretty solid, I can't find out any bug, as there should be only one thread could invoke free_pipe_info(). Kdump is not properly setup on this host, so unfortunately I can't inspect further on struct pipe_inode_info in memory. Although it is too early to blame your commit b0d8d2292160 ("vfs: fix subtle use-after-free of pipe_inode_info"), it is probably related.
Kernel crash in free_pipe_info()
Hi, Linus We got more than a dozen of kernel crashes at free_pipe_info() on our 4.1 kernel, they are all very similar to this one (with slightly different faulty addresses): [433948.755473] BUG: unable to handle kernel paging request at 0001000d [433948.762570] IP: [] free_pipe_info+0x54/0x85 [433948.768493] PGD 0 [433948.770537] Oops: [#1] SMP [433948.773768] Modules linked in: sch_htb cls_basic act_mirred cls_u32 veth sch_ingress fuse cpufreq_ondemand intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support dcdbas microcode sb_edac wmi edac_core i2c_i801 hed lpc_ich mfd_core ioatdma shpchp lp parport tcp_diag inet_diag ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel igb ptp pps_core i2c_algo_bit dca i2c_core ipv6 [433948.813421] CPU: 10 PID: 50667 Comm: rsync Not tainted 4.1.41.el7.x86_64 #1 [433948.821498] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.2.3 11/07/2013 [433948.829247] task: 880f47fde240 ti: 880d282d8000 task.ti: 880d282d8000 [433948.836770] RIP: 0010:[] [] free_pipe_info+0x54/0x85 [433948.845088] RSP: 0018:880d282dbd98 EFLAGS: 00010202 [433948.850473] RAX: fffd RBX: 8817e7692240 RCX: 02c5 [433948.857603] RDX: 02c6 RSI: 8817c30be540 RDI: 8817e7692240 [433948.864881] RBP: 880d282dbda8 R08: R09: [433948.871988] R10: 880c474a0c60 R11: 8812c3c44310 R12: 0008 [433948.879224] R13: 88182d8a9100 R14: 880c47b981a0 R15: 88182d8a9100 [433948.886514] FS: () GS:88184ee8() knlGS: [433948.894678] CS: 0010 DS: ES: CR0: 80050033 [433948.900609] CR2: 0001000d CR3: 01a0b000 CR4: 000406e0 [433948.907840] Stack: [433948.909876] 88182d8a9100 8817e7692240 880d282dbdc8 8117668e [433948.917331] 8817e7692240 8812c3c44300 880d282dbdf8 8117672a [433948.924799] 8812c3c44300 0010 88182d8a9100 [433948.932234] Call Trace: [433948.934780] [] put_pipe_info+0x51/0x56 [433948.940320] [] pipe_release+0x97/0xa1 [433948.945734] [] __fput+0xf9/0x1b8 [433948.950747] [] fput+0xe/0x10 [433948.955665] [] task_work_run+0x8a/0xa1 [433948.961121] [] do_exit+0x400/0x8ba [433948.966224] [] ? fdput_pos.isra.10+0x29/0x30 [433948.972306] [] do_group_exit+0x44/0x98 [433948.977847] [] SyS_exit_group+0x14/0x14 [433948.983423] [] system_call_fastpath+0x12/0x6a [433948.989477] Code: 8b bb 88 00 00 00 e8 73 a5 ef ff 44 3b 63 48 73 22 49 63 f4 48 6b f6 28 48 03 b3 80 00 00 00 48 8b 46 10 48 85 c0 74 06 48 89 df 50 10 41 ff c4 eb d8 48 8b 7b 68 48 85 ff 74 07 31 f6 e8 be [433949.008413] RIP [] free_pipe_info+0x54/0x85 [433949.014335] RSP [433949.017850] CR2: 0001000d [433949.021820] ---[ end trace 95f92776b6f34085 ]--- For this one, the IP maps to fs/pipe.c:651 642 void free_pipe_info(struct pipe_inode_info *pipe) 643 { 644 int i; 645 646 account_pipe_buffers(pipe, pipe->buffers, 0); 647 free_uid(pipe->user); 648 for (i = 0; i < pipe->buffers; i++) { 649 struct pipe_buffer *buf = pipe->bufs + i; 650 if (buf->ops) 651 buf->ops->release(pipe, buf); 652 } 653 if (pipe->tmp_page) 654 __free_page(pipe->tmp_page); 655 kfree(pipe->bufs); 656 kfree(pipe); 657 } And the faulty instruction is: All code 0: 8b bb 88 00 00 00 mov0x88(%rbx),%edi 6: e8 73 a5 ef ff callq 0xffefa57e b: 44 3b 63 48 cmp0x48(%rbx),%r12d f: 73 22 jae0x33 11: 49 63 f4movslq %r12d,%rsi 14: 48 6b f6 28 imul $0x28,%rsi,%rsi 18: 48 03 b3 80 00 00 00add0x80(%rbx),%rsi 1f: 48 8b 46 10 mov0x10(%rsi),%rax 23: 48 85 c0test %rax,%rax 26: 74 06 je 0x2e 28: 48 89 dfmov%rbx,%rdi 2b:* ff 50 10callq *0x10(%rax) <-- trapping instruction 2e: 41 ff c4inc%r12d 31: eb d8 jmp0xb 33: 48 8b 7b 68 mov0x68(%rbx),%rdi 37: 48 85 fftest %rdi,%rdi 3a: 74 07 je 0x43 3c: 31 f6 xor%esi,%esi It looks like pipe->bufs was already freed for some reason. The code is pretty solid, I can't find out any bug, as there should be only one thread could invoke free_pipe_info(). Kdump is not properly setup on this host, so unfortunately I can't inspect further on struct pipe_inode_info in memory. Although it is too early to blame your commit b0d8d2292160 ("vfs: fix subtle use-after-free of pipe_inode_info"), it is probably related.