Re: ipc,sem: sysv semaphore scalability
On Sat, 4 May 2013 20:12:45 +0200, Borislav Petkov wrote: > On Sat, May 04, 2013 at 11:55:58AM -0400, Jörn Engel wrote: > > Blockconsole currently lives here: > > https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/ > > Tja, if only that were upstream... Linus has a pull request. If he prefers to change code until the relevant bits fit into 80x25, that is his choice. ;) Jörn -- The only good bug is a dead bug. -- Starship Troopers -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, 4 May 2013 20:12:45 +0200, Borislav Petkov wrote: On Sat, May 04, 2013 at 11:55:58AM -0400, Jörn Engel wrote: Blockconsole currently lives here: https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/ Tja, if only that were upstream... Linus has a pull request. If he prefers to change code until the relevant bits fit into 80x25, that is his choice. ;) Jörn -- The only good bug is a dead bug. -- Starship Troopers -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, May 04, 2013 at 11:55:58AM -0400, Jörn Engel wrote: > Blockconsole currently lives here: > https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/ Tja, if only that were upstream... -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, 23 March 2013 10:19:16 +0700, Emmanuel Benisty wrote: > > I could reproduce it but could you please let me know what would be > the right tools I should use to catch the original oops? > This is what I got but I doubt it will be helpful: > http://i.imgur.com/Mewi1hC.jpg You could use either netconsole or blockconsole. Netconsole requires a second machine to capture the information, blockconsole requires a usb key or something similar to write to. In both cases you will get the entire console output in a file, often including very helpful messages leading up to the crash. Blockconsole currently lives here: https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/ Jörn -- Unless something dramatically changes, by 2015 we'll be largely wondering what all the fuss surrounding Linux was really about. -- Rob Enderle -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, 23 March 2013 10:19:16 +0700, Emmanuel Benisty wrote: I could reproduce it but could you please let me know what would be the right tools I should use to catch the original oops? This is what I got but I doubt it will be helpful: http://i.imgur.com/Mewi1hC.jpg You could use either netconsole or blockconsole. Netconsole requires a second machine to capture the information, blockconsole requires a usb key or something similar to write to. In both cases you will get the entire console output in a file, often including very helpful messages leading up to the crash. Blockconsole currently lives here: https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/ Jörn -- Unless something dramatically changes, by 2015 we'll be largely wondering what all the fuss surrounding Linux was really about. -- Rob Enderle -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, May 04, 2013 at 11:55:58AM -0400, Jörn Engel wrote: Blockconsole currently lives here: https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/ Tja, if only that were upstream... -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/29/2013 03:01 PM, Dave Jones wrote: On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote: > On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones wrote: > > > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote: > > > > > Whichever way we go, we should get a wiggle on - this has been hanging > > > around for too long. Dave, do you have time to determine whether > > > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger > > > than max") fixes things up? > > > > Ok, with that reverted it's been grinding away for a few hours without incident. > > Normally I see the oops within a minute or so. > > > > OK, thanks, I queued a revert: > > From: Andrew Morton > Subject: revert "ipc: don't allocate a copy larger than max" > > Revert 88b9e456b164. Dave has confirmed that this was causing oopses > during trinity testing. I owe Peter an apology. I just hit it again with that backed out. Andrew, might as well drop that revert. BUG: unable to handle kernel NULL pointer dereference at 000f IP: [] testmsg.isra.5+0x1a/0x60 [...snip...] I think I wasn't seeing that this last week because I had inadvertantly disabled DEBUG_PAGEALLOC and.. we're back to square one. Dave Andrew, I just realized you're still carrying commit 4bea54c91bcc5451f237e6b721b0b35eccd01d17 Author: Andrew Morton Date: Fri Apr 26 10:55:12 2013 +1000 revert "ipc: don't allocate a copy larger than max" Revert 88b9e456b164. Dave has confirmed that this was causing oopses during trinity testing. Cc: Peter Hurley Cc: Stanislav Kinsbursky Reported-by: Dave Jones Cc: Signed-off-by: Andrew Morton Please drop. As quoted above, the testing on mainline that attributed the observed oops to the reverted patch was due to a config error. As Linus pointed out below, this patch fixes real bugs. On 04/02/2013 03:53 PM, Sasha Levin wrote: > On 04/02/2013 01:52 PM, Linus Torvalds wrote: >> On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin wrote: >>> >>> By just playing with the 'msgsz' parameter with MSG_COPY set. >> >> Hmm. Looking closer, I suspect you're testing without commit >> 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That >> should limit the size passed in to prepare_copy -> load_copy to >> msg_ctlmax. > > That commit has a revert in the -next trees, do we need a revert > of the revert? > >commit ff6577a3e714ccae02d4400e989762c19c37b0b3 >Author: Andrew Morton >Date: Wed Mar 27 10:24:02 2013 +1100 > >revert "ipc: don't allocate a copy larger than max" > >Revert 88b9e456b164. Dave has confirmed that this was causing oopses >during trinity testing. > >Cc: Peter Hurley >Cc: Stanislav Kinsbursky >Reported-by: Dave Jones >Cc: >Signed-off-by: Andrew Morton -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/29/2013 03:01 PM, Dave Jones wrote: On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote: On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones da...@redhat.com wrote: On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote: Whichever way we go, we should get a wiggle on - this has been hanging around for too long. Dave, do you have time to determine whether reverting 88b9e456b1649722673ff (ipc: don't allocate a copy larger than max) fixes things up? Ok, with that reverted it's been grinding away for a few hours without incident. Normally I see the oops within a minute or so. OK, thanks, I queued a revert: From: Andrew Morton a...@linux-foundation.org Subject: revert ipc: don't allocate a copy larger than max Revert 88b9e456b164. Dave has confirmed that this was causing oopses during trinity testing. I owe Peter an apology. I just hit it again with that backed out. Andrew, might as well drop that revert. BUG: unable to handle kernel NULL pointer dereference at 000f IP: [812c24ca] testmsg.isra.5+0x1a/0x60 [...snip...] I think I wasn't seeing that this last week because I had inadvertantly disabled DEBUG_PAGEALLOC and.. we're back to square one. Dave Andrew, I just realized you're still carrying commit 4bea54c91bcc5451f237e6b721b0b35eccd01d17 Author: Andrew Morton a...@linux-foundation.org Date: Fri Apr 26 10:55:12 2013 +1000 revert ipc: don't allocate a copy larger than max Revert 88b9e456b164. Dave has confirmed that this was causing oopses during trinity testing. Cc: Peter Hurley pe...@hurleysoftware.com Cc: Stanislav Kinsbursky skinsbur...@parallels.com Reported-by: Dave Jones da...@redhat.com Cc: sta...@vger.kernel.org Signed-off-by: Andrew Morton a...@linux-foundation.org Please drop. As quoted above, the testing on mainline that attributed the observed oops to the reverted patch was due to a config error. As Linus pointed out below, this patch fixes real bugs. On 04/02/2013 03:53 PM, Sasha Levin wrote: On 04/02/2013 01:52 PM, Linus Torvalds wrote: On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin sasha.le...@oracle.com wrote: By just playing with the 'msgsz' parameter with MSG_COPY set. Hmm. Looking closer, I suspect you're testing without commit 88b9e456b164 (ipc: don't allocate a copy larger than max). That should limit the size passed in to prepare_copy - load_copy to msg_ctlmax. That commit has a revert in the -next trees, do we need a revert of the revert? commit ff6577a3e714ccae02d4400e989762c19c37b0b3 Author: Andrew Morton a...@linux-foundation.org Date: Wed Mar 27 10:24:02 2013 +1100 revert ipc: don't allocate a copy larger than max Revert 88b9e456b164. Dave has confirmed that this was causing oopses during trinity testing. Cc: Peter Hurley pe...@hurleysoftware.com Cc: Stanislav Kinsbursky skinsbur...@parallels.com Reported-by: Dave Jones da...@redhat.com Cc: sta...@vger.kernel.org Signed-off-by: Andrew Morton a...@linux-foundation.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, 26 Mar 2013 11:35:33 -0700 Andrew Morton wrote: > Do we need the locking at all? What does it actually do? > > sem_lock_and_putref(sma); > if (sma->sem_perm.deleted) { > sem_unlock(sma, -1); > err = -EIDRM; > goto out_free; > } > sem_unlock(sma, -1); > > We're taking the lock, testing an int and then dropping the lock. > What's the point in that? Rikpoke. The new semctl_main() is now taking a lock, testing sma->sem_perm.deleted then dropping that lock. It looks wrong. What is that lock testing against? What prevents .deleted from changing value 1ns after we dropped that lock? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, 26 Mar 2013 11:35:33 -0700 Andrew Morton a...@linux-foundation.org wrote: Do we need the locking at all? What does it actually do? sem_lock_and_putref(sma); if (sma-sem_perm.deleted) { sem_unlock(sma, -1); err = -EIDRM; goto out_free; } sem_unlock(sma, -1); We're taking the lock, testing an int and then dropping the lock. What's the point in that? Rikpoke. The new semctl_main() is now taking a lock, testing sma-sem_perm.deleted then dropping that lock. It looks wrong. What is that lock testing against? What prevents .deleted from changing value 1ns after we dropped that lock? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, Apr 02, 2013 at 03:53:01PM -0400, Sasha Levin wrote: > On 04/02/2013 01:52 PM, Linus Torvalds wrote: > > On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin wrote: > >> > >> By just playing with the 'msgsz' parameter with MSG_COPY set. > > > > Hmm. Looking closer, I suspect you're testing without commit > > 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That > > should limit the size passed in to prepare_copy -> load_copy to > > msg_ctlmax. > > That commit has a revert in the -next trees, do we need a revert > of the revert? Yeah, I told Andrew to drop that, but I think he's travelling. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 04/02/2013 01:52 PM, Linus Torvalds wrote: > On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin wrote: >> >> By just playing with the 'msgsz' parameter with MSG_COPY set. > > Hmm. Looking closer, I suspect you're testing without commit > 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That > should limit the size passed in to prepare_copy -> load_copy to > msg_ctlmax. That commit has a revert in the -next trees, do we need a revert of the revert? commit ff6577a3e714ccae02d4400e989762c19c37b0b3 Author: Andrew Morton Date: Wed Mar 27 10:24:02 2013 +1100 revert "ipc: don't allocate a copy larger than max" Revert 88b9e456b164. Dave has confirmed that this was causing oopses during trinity testing. Cc: Peter Hurley Cc: Stanislav Kinsbursky Reported-by: Dave Jones Cc: Signed-off-by: Andrew Morton Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin wrote: > > By just playing with the 'msgsz' parameter with MSG_COPY set. Hmm. Looking closer, I suspect you're testing without commit 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That should limit the size passed in to prepare_copy -> load_copy to msg_ctlmax. Now, I think it's possibly still a good idea to limit bufsz to INT_MAX regardless, but as far as I can see that prepare_copy -> load_copy path is the only place that can get confused. Everybody else uses size_t (or "long" in the case of r_maxsize) as far as I can tell. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin wrote: > > If you guys are already looking at this, the conversions between size_t, > long and int in the do_msgrcv/load_msg/alloc_msg code are a mess. You could > trigger anything from: Good catch. Let's just change the "(long)bufsz < 0" into "bufsz > INT_MAX". I suspect we should change some of the "int" arguments to "size_t" too so that we don't have these kinds of odd "different routines see different values due to subtle casting errors", but in the end we don't really want to ever help people have these kinds of potential overflow issues. We already limit normal read/write/sendmsg etc to INT_MAX (although we tend to *truncate* it to INT_MAX rather than return an error, but I think the simpler patch here is preferable unless somebody complains). Comments? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/29/2013 03:36 PM, Peter Hurley wrote: > On Fri, 2013-03-29 at 12:26 -0700, Linus Torvalds wrote: >> On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones wrote: >>> >>> Here's an oops I just hit.. >>> >>> BUG: unable to handle kernel NULL pointer dereference at 000f >>> IP: [] testmsg.isra.5+0x1a/0x60 >> >> Btw, looking at the code leading up to this, what the f*ck is wrong >> with the IPC stuff? >> >> It's using the generic list stuff for most of the lists, but then it >> open-codes the accesses. >> >> So instead of using >> >>for_each_entry(walk_msg, >q_messages, m_list) { >> .. >>} >> >> the ipc/msg.c code does all that by hand, with >> >>tmp = msq->q_messages.next; >>while (tmp != >q_messages) { >> struct msg_msg *walk_msg; >> >> walk_msg = list_entry(tmp, struct msg_msg, m_list); >> ... >> tmp = tmp->next; >>} >> >> Ugh. The code is near unreadable. And then it has magic memory >> barriers etc, implying that it doesn't lock the data structures, but >> no comments about them. See expunge_all() and pipelined_send(). >> >> The code seems entirely random, and it's badly set up (annoyance of >> the day: crazy helper functions in ipc/msgutil.c to make sure that (a) >> you have to spend more effort looking for them, and (b) they won't get >> inlined). >> >> Clearly nobody has cared for the crazy IPC message code in a long time. > > Exactly that's what my patch series does; clean this mess up. > > This is what I wrote to Andrew a couple of days ago. > > On Tue, 2013-03-26 at 22:33 -0400, Peter Hurley wrote: > I just figured out how the queue is being corrupted and why my series >> fixes it. >> >> >> With MSG_COPY set, the queue scan can exit with the local variable > 'msg' >> pointing to a real msg if the msg_counter never reaches the > copy_number. >> >> The remaining execution looks like this: >> >> if (!IS_ERR(msg)) { >> >> if (msgflg & MSG_COPY) >> goto out_unlock; >> >> >> out_unlock: >> msg_unlock(msq); >> break; >> } >> } >> if (IS_ERR(msg)) >> >> >> bufsz = msg_handler(); >> free_msg(msg); << msg never unlinked >> >> >> Since the msg should not have been found (because it failed the match >> criteria), the if (!IS_ERR(msg)) clause should never have executed. >> >> That's why my refactor fixes resolve this; because msg is not >> inadvertently treated as a found msg. >> >> But let's be honest; the real bug here is the poor structure of this >> function that even made this possible. The deepest nesting executes a >> goto to a label in the middle of an if clause. Yuck! No wonder this >> thing's fragile. >> >> So my recommendation still stands. The series that fixes this has been >> getting tested in linux-next for a month. Fixing this some other way > is >> just asking for more trouble. >> >> But why not just revert MSG_COPY altogether for 3.9? If you guys are already looking at this, the conversions between size_t, long and int in the do_msgrcv/load_msg/alloc_msg code are a mess. You could trigger anything from: [ 33.046572] BUG: unable to handle kernel paging request at 88003c2c7000 [ 33.047721] IP: [] bad_from_user+0x4/0x6 [ 33.048528] PGD 7232067 PUD 7233067 PMD 3067 PTE 80003c2c7060 [ 33.049506] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 33.050029] Modules linked in: [ 33.050029] CPU 0 [ 33.050029] Pid: 6885, comm: a.out Tainted: GW 3.9.0-rc4-next-20130328-sasha-00017-g1463000 #321 [ 33.050029] RIP: 0010:[] [] bad_from_user+0x4/0x6 [ 33.050029] RSP: 0018:88003462be40 EFLAGS: 00010246 [ 33.050029] RAX: RBX: fffb RCX: ff06ae2b [ 33.050029] RDX: fffb RSI: 7fffed36d2a0 RDI: 88003c2c7000 [ 33.050029] RBP: 88003462be88 R08: 0280 R09: [ 33.050029] R10: R11: R12: fffb [ 33.050029] R13: 7fffed36d2a0 R14: R15: [ 33.050029] FS: 7f6990044700() GS:88003dc0() knlGS: [ 33.050029] CS: 0010 DS: ES: CR0: 80050033 [ 33.050029] CR2: 88003c2c7000 CR3: 347c8000 CR4: 000406f0 [ 33.050029] DR0: DR1: DR2: [ 33.050029] DR3: DR6: 0ff0 DR7: 0400 [ 33.050029] Process a.out (pid: 6885, threadinfo 88003462a000, task 880034cb3000) [ 33.050029] Stack: [ 33.050029] 8192a6a9 88003462be98 88003b331e00 88003ddd01e0 [ 33.050029] 0001 [ 33.050029] 88003462bf68 8192bb34 [ 33.050029] Call Trace: [ 33.050029] [] ?
Re: ipc,sem: sysv semaphore scalability
On 03/29/2013 03:36 PM, Peter Hurley wrote: On Fri, 2013-03-29 at 12:26 -0700, Linus Torvalds wrote: On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones da...@redhat.com wrote: Here's an oops I just hit.. BUG: unable to handle kernel NULL pointer dereference at 000f IP: [812c24ca] testmsg.isra.5+0x1a/0x60 Btw, looking at the code leading up to this, what the f*ck is wrong with the IPC stuff? It's using the generic list stuff for most of the lists, but then it open-codes the accesses. So instead of using for_each_entry(walk_msg, msq-q_messages, m_list) { .. } the ipc/msg.c code does all that by hand, with tmp = msq-q_messages.next; while (tmp != msq-q_messages) { struct msg_msg *walk_msg; walk_msg = list_entry(tmp, struct msg_msg, m_list); ... tmp = tmp-next; } Ugh. The code is near unreadable. And then it has magic memory barriers etc, implying that it doesn't lock the data structures, but no comments about them. See expunge_all() and pipelined_send(). The code seems entirely random, and it's badly set up (annoyance of the day: crazy helper functions in ipc/msgutil.c to make sure that (a) you have to spend more effort looking for them, and (b) they won't get inlined). Clearly nobody has cared for the crazy IPC message code in a long time. Exactly that's what my patch series does; clean this mess up. This is what I wrote to Andrew a couple of days ago. On Tue, 2013-03-26 at 22:33 -0400, Peter Hurley wrote: I just figured out how the queue is being corrupted and why my series fixes it. With MSG_COPY set, the queue scan can exit with the local variable 'msg' pointing to a real msg if the msg_counter never reaches the copy_number. The remaining execution looks like this: if (!IS_ERR(msg)) { if (msgflg MSG_COPY) goto out_unlock; out_unlock: msg_unlock(msq); break; } } if (IS_ERR(msg)) bufsz = msg_handler(); free_msg(msg); msg never unlinked Since the msg should not have been found (because it failed the match criteria), the if (!IS_ERR(msg)) clause should never have executed. That's why my refactor fixes resolve this; because msg is not inadvertently treated as a found msg. But let's be honest; the real bug here is the poor structure of this function that even made this possible. The deepest nesting executes a goto to a label in the middle of an if clause. Yuck! No wonder this thing's fragile. So my recommendation still stands. The series that fixes this has been getting tested in linux-next for a month. Fixing this some other way is just asking for more trouble. But why not just revert MSG_COPY altogether for 3.9? If you guys are already looking at this, the conversions between size_t, long and int in the do_msgrcv/load_msg/alloc_msg code are a mess. You could trigger anything from: [ 33.046572] BUG: unable to handle kernel paging request at 88003c2c7000 [ 33.047721] IP: [83dbcb34] bad_from_user+0x4/0x6 [ 33.048528] PGD 7232067 PUD 7233067 PMD 3067 PTE 80003c2c7060 [ 33.049506] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 33.050029] Modules linked in: [ 33.050029] CPU 0 [ 33.050029] Pid: 6885, comm: a.out Tainted: GW 3.9.0-rc4-next-20130328-sasha-00017-g1463000 #321 [ 33.050029] RIP: 0010:[83dbcb34] [83dbcb34] bad_from_user+0x4/0x6 [ 33.050029] RSP: 0018:88003462be40 EFLAGS: 00010246 [ 33.050029] RAX: RBX: fffb RCX: ff06ae2b [ 33.050029] RDX: fffb RSI: 7fffed36d2a0 RDI: 88003c2c7000 [ 33.050029] RBP: 88003462be88 R08: 0280 R09: [ 33.050029] R10: R11: R12: fffb [ 33.050029] R13: 7fffed36d2a0 R14: R15: [ 33.050029] FS: 7f6990044700() GS:88003dc0() knlGS: [ 33.050029] CS: 0010 DS: ES: CR0: 80050033 [ 33.050029] CR2: 88003c2c7000 CR3: 347c8000 CR4: 000406f0 [ 33.050029] DR0: DR1: DR2: [ 33.050029] DR3: DR6: 0ff0 DR7: 0400 [ 33.050029] Process a.out (pid: 6885, threadinfo 88003462a000, task 880034cb3000) [ 33.050029] Stack: [ 33.050029] 8192a6a9 88003462be98 88003b331e00 88003ddd01e0 [ 33.050029] 0001 [ 33.050029] 88003462bf68 8192bb34 [ 33.050029] Call Trace: [ 33.050029] [8192a6a9] ? load_msg+0x59/0x100 [ 33.050029] [8192bb34] do_msgrcv+0x74/0x5b0 [ 33.050029]
Re: ipc,sem: sysv semaphore scalability
On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin sasha.le...@oracle.com wrote: If you guys are already looking at this, the conversions between size_t, long and int in the do_msgrcv/load_msg/alloc_msg code are a mess. You could trigger anything from: Good catch. Let's just change the (long)bufsz 0 into bufsz INT_MAX. I suspect we should change some of the int arguments to size_t too so that we don't have these kinds of odd different routines see different values due to subtle casting errors, but in the end we don't really want to ever help people have these kinds of potential overflow issues. We already limit normal read/write/sendmsg etc to INT_MAX (although we tend to *truncate* it to INT_MAX rather than return an error, but I think the simpler patch here is preferable unless somebody complains). Comments? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin sasha.le...@oracle.com wrote: By just playing with the 'msgsz' parameter with MSG_COPY set. Hmm. Looking closer, I suspect you're testing without commit 88b9e456b164 (ipc: don't allocate a copy larger than max). That should limit the size passed in to prepare_copy - load_copy to msg_ctlmax. Now, I think it's possibly still a good idea to limit bufsz to INT_MAX regardless, but as far as I can see that prepare_copy - load_copy path is the only place that can get confused. Everybody else uses size_t (or long in the case of r_maxsize) as far as I can tell. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 04/02/2013 01:52 PM, Linus Torvalds wrote: On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin sasha.le...@oracle.com wrote: By just playing with the 'msgsz' parameter with MSG_COPY set. Hmm. Looking closer, I suspect you're testing without commit 88b9e456b164 (ipc: don't allocate a copy larger than max). That should limit the size passed in to prepare_copy - load_copy to msg_ctlmax. That commit has a revert in the -next trees, do we need a revert of the revert? commit ff6577a3e714ccae02d4400e989762c19c37b0b3 Author: Andrew Morton a...@linux-foundation.org Date: Wed Mar 27 10:24:02 2013 +1100 revert ipc: don't allocate a copy larger than max Revert 88b9e456b164. Dave has confirmed that this was causing oopses during trinity testing. Cc: Peter Hurley pe...@hurleysoftware.com Cc: Stanislav Kinsbursky skinsbur...@parallels.com Reported-by: Dave Jones da...@redhat.com Cc: sta...@vger.kernel.org Signed-off-by: Andrew Morton a...@linux-foundation.org Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, Apr 02, 2013 at 03:53:01PM -0400, Sasha Levin wrote: On 04/02/2013 01:52 PM, Linus Torvalds wrote: On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin sasha.le...@oracle.com wrote: By just playing with the 'msgsz' parameter with MSG_COPY set. Hmm. Looking closer, I suspect you're testing without commit 88b9e456b164 (ipc: don't allocate a copy larger than max). That should limit the size passed in to prepare_copy - load_copy to msg_ctlmax. That commit has a revert in the -next trees, do we need a revert of the revert? Yeah, I told Andrew to drop that, but I think he's travelling. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
29.03.2013 22:43, Linus Torvalds пишет: On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones wrote: > >Now that I have that reverted, I'm not seeing msgrcv traces any more, but >I've started seeing this.. > >general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you disable it? I think I foud at least one bug in the MSG_COPY stuff: it leaks the "copy" allocation if mode == SEARCH_LESSEQUAL Hello, Linus. Sorry, but I don't see copy allocation leak. Dummy message is allocated always in msgflg has MSG_COPY flag being set. Also prepare_copy() use msgtyp as a message number to copy and thus set it to 0. but maybe I'm misreading it. And that shouldn't cause the problem you see, but it's indicative of how badly tested and thought through the MSG_COPY code is. Totally UNTESTED leak fix appended. Stanislav? I don't see, how this patch can help. And we should not release it until copy is done in msg_handler, because msg is equal to copy. Dummy copy message is release either by free_copy() (if msg is error) or free_msg(). But there are two issues here definitely: 1) Poor SELinux support for message copying. This issue was addressed by Stephen Smalley here: https://lkml.org/lkml/2013/2/6/663 But look like he didn't sent the patch to Andrew. 2) Copy leak and queue corruption in case of copy message wasn't found (this was mentioned by Peter in another thread; thanks for catching this, Peter!), because msg will be a valid pointer and all the "message copy" clean up logic doesn't work. I like Peter's cleanup and fix series. But if it looks like too much changes for this old code, I have another small patch, which should fix the issue: ipc: set msg back to -EAGAIN if copy wasn't performed Make sure, that msg pointer is set back to error value in case of MSG_COPY flag is set and desired message to copy wasn't found. This garantees, that msg is either a error pointer or a copy address. Otherwise last message in queue will be freed without unlinking from queue (which leads to memory corruption) plus dummy allocated copy won't be released. Signed-off-by: Stanislav Kinsbursky diff --git a/ipc/msg.c b/ipc/msg.c index 31cd1bf..fede1d0 100644 --- a/ipc/msg.c +++ b/ipc/msg.c @@ -872,6 +872,7 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, goto out_unlock; break; } + msg = ERR_PTR(-EAGAIN); } else break; msg_counter++; Linus patch.diff ipc/msg.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/ipc/msg.c b/ipc/msg.c index 31cd1bf6af27..b841508556cb 100644 --- a/ipc/msg.c +++ b/ipc/msg.c @@ -870,6 +870,7 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, msg = copy_msg(msg, copy); if (IS_ERR(msg)) goto out_unlock; + copy = NULL; break; } } else @@ -976,10 +977,9 @@ out_unlock: break; } } - if (IS_ERR(msg)) { - free_copy(copy); + free_copy(copy); + if (IS_ERR(msg)) return PTR_ERR(msg); - } bufsz = msg_handler(buf, msg, bufsz); free_msg(msg); -- Best regards, Stanislav Kinsbursky -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
29.03.2013 22:43, Linus Torvalds пишет: On Fri, Mar 29, 2013 at 9:17 AM, Dave Jonesda...@redhat.com wrote: Now that I have that reverted, I'm not seeing msgrcv traces any more, but I've started seeing this.. general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you disable it? I think I foud at least one bug in the MSG_COPY stuff: it leaks the copy allocation if mode == SEARCH_LESSEQUAL Hello, Linus. Sorry, but I don't see copy allocation leak. Dummy message is allocated always in msgflg has MSG_COPY flag being set. Also prepare_copy() use msgtyp as a message number to copy and thus set it to 0. but maybe I'm misreading it. And that shouldn't cause the problem you see, but it's indicative of how badly tested and thought through the MSG_COPY code is. Totally UNTESTED leak fix appended. Stanislav? I don't see, how this patch can help. And we should not release it until copy is done in msg_handler, because msg is equal to copy. Dummy copy message is release either by free_copy() (if msg is error) or free_msg(). But there are two issues here definitely: 1) Poor SELinux support for message copying. This issue was addressed by Stephen Smalley here: https://lkml.org/lkml/2013/2/6/663 But look like he didn't sent the patch to Andrew. 2) Copy leak and queue corruption in case of copy message wasn't found (this was mentioned by Peter in another thread; thanks for catching this, Peter!), because msg will be a valid pointer and all the message copy clean up logic doesn't work. I like Peter's cleanup and fix series. But if it looks like too much changes for this old code, I have another small patch, which should fix the issue: ipc: set msg back to -EAGAIN if copy wasn't performed Make sure, that msg pointer is set back to error value in case of MSG_COPY flag is set and desired message to copy wasn't found. This garantees, that msg is either a error pointer or a copy address. Otherwise last message in queue will be freed without unlinking from queue (which leads to memory corruption) plus dummy allocated copy won't be released. Signed-off-by: Stanislav Kinsbursky skinsbur...@parallels.com diff --git a/ipc/msg.c b/ipc/msg.c index 31cd1bf..fede1d0 100644 --- a/ipc/msg.c +++ b/ipc/msg.c @@ -872,6 +872,7 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, goto out_unlock; break; } + msg = ERR_PTR(-EAGAIN); } else break; msg_counter++; Linus patch.diff ipc/msg.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/ipc/msg.c b/ipc/msg.c index 31cd1bf6af27..b841508556cb 100644 --- a/ipc/msg.c +++ b/ipc/msg.c @@ -870,6 +870,7 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, msg = copy_msg(msg, copy); if (IS_ERR(msg)) goto out_unlock; + copy = NULL; break; } } else @@ -976,10 +977,9 @@ out_unlock: break; } } - if (IS_ERR(msg)) { - free_copy(copy); + free_copy(copy); + if (IS_ERR(msg)) return PTR_ERR(msg); - } bufsz = msg_handler(buf, msg, bufsz); free_msg(msg); -- Best regards, Stanislav Kinsbursky -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sun, Mar 31, 2013 at 6:45 AM, Rik van Riel wrote: > > Should we use "semid" here, like Linus suggested, instead of "un->semid"? As Davidlohr noted, in linux-next the rcu read-lock is held over the whole thing, so no, un->semid should be stable once "un" has been re-looked-up under the semaphore lock. In mainline, the problem is that the "sem_lock_check()" is done with "un->semid" *after* we've dropped the RCU read-lock, so "un" at that point is not reliable (it could be free'd at any time underneath us). That said, I really *really* hate what both mainline and linux-next do with the RCU read lock, and linux-next is arguably worse. The whole "take the RCU lock in one place, and release it in another" is confusing and bug-prone as hell. And linux-next made it worse: now sem_lock() no longer takes the read-lock (it expects the caller to take it), but sem_unlock() still drops the read-lock. This is all just f*cking crazy. The rule should be that the rcu read-lock is always and released at the same "level". For example, find_alloc_undo() should just be called with (and unconditionaly return with) the rcu read-lock held, and if it needs to actually do an allocation, it can drop the rcu lock for the duration of the allocation. This whole "conditional locking" depending on error returns and on whether we have undo's etc is bug-prone and confusing. And when you have totally different locking rules for "sem_lock()" vs "sem_unlock()", you know you're confused. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
Hi Davidlohr, On Sun, Mar 31, 2013 at 12:01 PM, Davidlohr Bueso wrote: > Specially dropping the rcu read lock before the continue statement > (sorry for not mentioning this in the last email). I was missing this indeed, thanks. Still the same issues however... I'll do some more testing on the same machine but with a totally different environment, within tomorrow hopefully. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/31/2013 01:01 AM, Davidlohr Bueso wrote: diff --git a/ipc/sem.c b/ipc/sem.c index f257afe..74cedfe 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -1867,8 +1867,7 @@ void exit_sem(struct task_struct *tsk) struct sem_array *sma; struct sem_undo *un; struct list_head tasks; - int semid; - int i; + int semid, i; rcu_read_lock(); un = list_entry_rcu(ulp->list_proc.next, @@ -1884,12 +1883,13 @@ void exit_sem(struct task_struct *tsk) } sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid); Should we use "semid" here, like Linus suggested, instead of "un->semid"? - sem_lock(sma, NULL, -1); - /* exit_sem raced with IPC_RMID, nothing to do */ - if (IS_ERR(sma)) + if (IS_ERR(sma)) { + rcu_read_unlock(); continue; + } + sem_lock(sma, NULL, -1); un = __lookup_undo(ulp, semid); if (un == NULL) { /* exit_sem raced with IPC_RMID+semget() that created -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/31/2013 01:01 AM, Davidlohr Bueso wrote: diff --git a/ipc/sem.c b/ipc/sem.c index f257afe..74cedfe 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -1867,8 +1867,7 @@ void exit_sem(struct task_struct *tsk) struct sem_array *sma; struct sem_undo *un; struct list_head tasks; - int semid; - int i; + int semid, i; rcu_read_lock(); un = list_entry_rcu(ulp-list_proc.next, @@ -1884,12 +1883,13 @@ void exit_sem(struct task_struct *tsk) } sma = sem_obtain_object_check(tsk-nsproxy-ipc_ns, un-semid); Should we use semid here, like Linus suggested, instead of un-semid? - sem_lock(sma, NULL, -1); - /* exit_sem raced with IPC_RMID, nothing to do */ - if (IS_ERR(sma)) + if (IS_ERR(sma)) { + rcu_read_unlock(); continue; + } + sem_lock(sma, NULL, -1); un = __lookup_undo(ulp, semid); if (un == NULL) { /* exit_sem raced with IPC_RMID+semget() that created -- All rights reversed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
Hi Davidlohr, On Sun, Mar 31, 2013 at 12:01 PM, Davidlohr Bueso davidlohr.bu...@hp.com wrote: Specially dropping the rcu read lock before the continue statement (sorry for not mentioning this in the last email). I was missing this indeed, thanks. Still the same issues however... I'll do some more testing on the same machine but with a totally different environment, within tomorrow hopefully. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sun, Mar 31, 2013 at 6:45 AM, Rik van Riel r...@surriel.com wrote: Should we use semid here, like Linus suggested, instead of un-semid? As Davidlohr noted, in linux-next the rcu read-lock is held over the whole thing, so no, un-semid should be stable once un has been re-looked-up under the semaphore lock. In mainline, the problem is that the sem_lock_check() is done with un-semid *after* we've dropped the RCU read-lock, so un at that point is not reliable (it could be free'd at any time underneath us). That said, I really *really* hate what both mainline and linux-next do with the RCU read lock, and linux-next is arguably worse. The whole take the RCU lock in one place, and release it in another is confusing and bug-prone as hell. And linux-next made it worse: now sem_lock() no longer takes the read-lock (it expects the caller to take it), but sem_unlock() still drops the read-lock. This is all just f*cking crazy. The rule should be that the rcu read-lock is always and released at the same level. For example, find_alloc_undo() should just be called with (and unconditionaly return with) the rcu read-lock held, and if it needs to actually do an allocation, it can drop the rcu lock for the duration of the allocation. This whole conditional locking depending on error returns and on whether we have undo's etc is bug-prone and confusing. And when you have totally different locking rules for sem_lock() vs sem_unlock(), you know you're confused. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, 2013-03-30 at 11:33 +0700, Emmanuel Benisty wrote: > On Sat, Mar 30, 2013 at 10:46 AM, Linus Torvalds > wrote: > > On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty > > wrote: > >> > >> Then I start building a random package and the problems start. They > >> may also happen without compiling but this seems to trigger the bug > >> quite quickly. > > > > I suspect it's about preemption, and the build just results in enough > > scheduling load that you start hitting whatever race there is. > > > >> Anyway, some progress here, I hope: dmesg seems to be > >> willing to reveal some secrets (using some pastebin service since this > >> is pretty big): > >> > >> https://gist.github.com/anonymous/5275120 > > > > That looks like exactly the exit_sem() bug that Davidlohr was talking > > about, where the > > > > /* exit_sem raced with IPC_RMID, nothing to do */ > > if (IS_ERR(sma)) > > continue; > > > > should be moved to *before* the > > > > sem_lock(sma, NULL, -1); > > > > call. And apparently the bug I had found is already fixed in -next. > > I just tried the 7 original patches + the 2 one liners from -next + > modified Linus' patch (attached) on the top of 3.9-rc4 using > PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained > above. I was building two packages at the same time, went away for 30 > seconds, came back and everything froze as soon as I touched the > laptop's touchpad. Maybe a coincidence but anyway... Another shot in > the dark, I had this weird message when trying to build gcc: > semop(2): encountered an error: Identifier removed *sigh*. I had high hopes for this being the bug triggering your issue, specially after seeing exit_sem() in the trace. Emmanuel, just to be sure, does your changes reflect the patch below? Specially dropping the rcu read lock before the continue statement (sorry for not mentioning this in the last email). Anyway, this is still a bug. Andrew, the patch below applies to linux-next, please queue this up if you don't have any objections. Thanks, Davidlohr ---8<--- From: Davidlohr Bueso Subject: [PATCH] ipc, sem: do not call sem_lock when bogus sma In exit_sem() we attempt to acquire the sma->sem_perm.lock by calling sem_lock() immediately after obtaining sma. However, if sma isn't valid, then calling sem_lock() will tend to do bad things. Move the sma error check right after the sem_obtain_object_check() call instead. Signed-off-by: Davidlohr Bueso --- ipc/sem.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/ipc/sem.c b/ipc/sem.c index f257afe..74cedfe 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -1867,8 +1867,7 @@ void exit_sem(struct task_struct *tsk) struct sem_array *sma; struct sem_undo *un; struct list_head tasks; - int semid; - int i; + int semid, i; rcu_read_lock(); un = list_entry_rcu(ulp->list_proc.next, @@ -1884,12 +1883,13 @@ void exit_sem(struct task_struct *tsk) } sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid); - sem_lock(sma, NULL, -1); - /* exit_sem raced with IPC_RMID, nothing to do */ - if (IS_ERR(sma)) + if (IS_ERR(sma)) { + rcu_read_unlock(); continue; + } + sem_lock(sma, NULL, -1); un = __lookup_undo(ulp, semid); if (un == NULL) { /* exit_sem raced with IPC_RMID+semget() that created -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
Hi Linus, On Sun, Mar 31, 2013 at 12:22 AM, Linus Torvalds wrote: > On Fri, Mar 29, 2013 at 10:57 PM, Emmanuel Benisty > wrote: >> On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds >>> >>> This came from the gcc build? >> >> yes, very early in the build process, IIRC this line was repeated a >> few times and the build just stalled. > > Ok, we're bringing out the crazy hacks now. > > The attached patch is just insane, doesn't really even work in > general, and only even compiles on 64-bit. But it should work in > *practice* to find if somebody adds the same RCU head to the RCU lists > twice, and ignore the second time it happens (and give a warning that > hopefully pinpoints the backtrace). > > It's ugly. It's broken. It may not work. In other words, I'm not proud > of it. But you seem to be the only one able to trigger the issue > easily, willing to try crazy crap, so "tag, you're it". Maybe this > gives us more information. And maybe it doesn't, and I'm totally wrong > about the whole "rcu head added twice" theory. That's all I could get so far: https://gist.github.com/anonymous/5279255 Losing wireless is generally the start signal of controlled demolition of the machine. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 10:57 PM, Emmanuel Benisty wrote: > On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds >> >> This came from the gcc build? > > yes, very early in the build process, IIRC this line was repeated a > few times and the build just stalled. Ok, we're bringing out the crazy hacks now. The attached patch is just insane, doesn't really even work in general, and only even compiles on 64-bit. But it should work in *practice* to find if somebody adds the same RCU head to the RCU lists twice, and ignore the second time it happens (and give a warning that hopefully pinpoints the backtrace). It's ugly. It's broken. It may not work. In other words, I'm not proud of it. But you seem to be the only one able to trigger the issue easily, willing to try crazy crap, so "tag, you're it". Maybe this gives us more information. And maybe it doesn't, and I'm totally wrong about the whole "rcu head added twice" theory. Linus patch.diff Description: Binary data
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 10:57 PM, Emmanuel Benisty benist...@gmail.com wrote: On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds This came from the gcc build? yes, very early in the build process, IIRC this line was repeated a few times and the build just stalled. Ok, we're bringing out the crazy hacks now. The attached patch is just insane, doesn't really even work in general, and only even compiles on 64-bit. But it should work in *practice* to find if somebody adds the same RCU head to the RCU lists twice, and ignore the second time it happens (and give a warning that hopefully pinpoints the backtrace). It's ugly. It's broken. It may not work. In other words, I'm not proud of it. But you seem to be the only one able to trigger the issue easily, willing to try crazy crap, so tag, you're it. Maybe this gives us more information. And maybe it doesn't, and I'm totally wrong about the whole rcu head added twice theory. Linus patch.diff Description: Binary data
Re: ipc,sem: sysv semaphore scalability
Hi Linus, On Sun, Mar 31, 2013 at 12:22 AM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Mar 29, 2013 at 10:57 PM, Emmanuel Benisty benist...@gmail.com wrote: On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds This came from the gcc build? yes, very early in the build process, IIRC this line was repeated a few times and the build just stalled. Ok, we're bringing out the crazy hacks now. The attached patch is just insane, doesn't really even work in general, and only even compiles on 64-bit. But it should work in *practice* to find if somebody adds the same RCU head to the RCU lists twice, and ignore the second time it happens (and give a warning that hopefully pinpoints the backtrace). It's ugly. It's broken. It may not work. In other words, I'm not proud of it. But you seem to be the only one able to trigger the issue easily, willing to try crazy crap, so tag, you're it. Maybe this gives us more information. And maybe it doesn't, and I'm totally wrong about the whole rcu head added twice theory. That's all I could get so far: https://gist.github.com/anonymous/5279255 Losing wireless is generally the start signal of controlled demolition of the machine. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, 2013-03-30 at 11:33 +0700, Emmanuel Benisty wrote: On Sat, Mar 30, 2013 at 10:46 AM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty benist...@gmail.com wrote: Then I start building a random package and the problems start. They may also happen without compiling but this seems to trigger the bug quite quickly. I suspect it's about preemption, and the build just results in enough scheduling load that you start hitting whatever race there is. Anyway, some progress here, I hope: dmesg seems to be willing to reveal some secrets (using some pastebin service since this is pretty big): https://gist.github.com/anonymous/5275120 That looks like exactly the exit_sem() bug that Davidlohr was talking about, where the /* exit_sem raced with IPC_RMID, nothing to do */ if (IS_ERR(sma)) continue; should be moved to *before* the sem_lock(sma, NULL, -1); call. And apparently the bug I had found is already fixed in -next. I just tried the 7 original patches + the 2 one liners from -next + modified Linus' patch (attached) on the top of 3.9-rc4 using PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained above. I was building two packages at the same time, went away for 30 seconds, came back and everything froze as soon as I touched the laptop's touchpad. Maybe a coincidence but anyway... Another shot in the dark, I had this weird message when trying to build gcc: semop(2): encountered an error: Identifier removed *sigh*. I had high hopes for this being the bug triggering your issue, specially after seeing exit_sem() in the trace. Emmanuel, just to be sure, does your changes reflect the patch below? Specially dropping the rcu read lock before the continue statement (sorry for not mentioning this in the last email). Anyway, this is still a bug. Andrew, the patch below applies to linux-next, please queue this up if you don't have any objections. Thanks, Davidlohr ---8--- From: Davidlohr Bueso davidlohr.bu...@hp.com Subject: [PATCH] ipc, sem: do not call sem_lock when bogus sma In exit_sem() we attempt to acquire the sma-sem_perm.lock by calling sem_lock() immediately after obtaining sma. However, if sma isn't valid, then calling sem_lock() will tend to do bad things. Move the sma error check right after the sem_obtain_object_check() call instead. Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com --- ipc/sem.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/ipc/sem.c b/ipc/sem.c index f257afe..74cedfe 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -1867,8 +1867,7 @@ void exit_sem(struct task_struct *tsk) struct sem_array *sma; struct sem_undo *un; struct list_head tasks; - int semid; - int i; + int semid, i; rcu_read_lock(); un = list_entry_rcu(ulp-list_proc.next, @@ -1884,12 +1883,13 @@ void exit_sem(struct task_struct *tsk) } sma = sem_obtain_object_check(tsk-nsproxy-ipc_ns, un-semid); - sem_lock(sma, NULL, -1); - /* exit_sem raced with IPC_RMID, nothing to do */ - if (IS_ERR(sma)) + if (IS_ERR(sma)) { + rcu_read_unlock(); continue; + } + sem_lock(sma, NULL, -1); un = __lookup_undo(ulp, semid); if (un == NULL) { /* exit_sem raced with IPC_RMID+semget() that created -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds wrote: >> Another shot in >> the dark, I had this weird message when trying to build gcc: >> semop(2): encountered an error: Identifier removed > > This came from the gcc build? yes, very early in the build process, IIRC this line was repeated a few times and the build just stalled. > That's just crazy. No normal app uses sysv semaphores. I have an older > gcc build environment, and some grepping shows it has some ipc > semaphore use in the libstdc++ testsuite, and some libmudflap hooks, > but that should be very very minor. > I notice that your dmesg says that your kernel is compiled by > gcc-4.8.1 prerelease. Is there any chance that you could try to > install a known-stable gcc, like 4.7.2 or something. It's entirely > possible that it's a kernel bug even if it's triggered by some more > aggressive compiler optimization or something, but it would be really > good to try to see if this might be gcc-specific. I could build a kernel on another machine on which 4.7.2 is installed, kernel oops'd as well: http://i.imgur.com/uk6gmq1.jpg FWIW, I have a few things disabled in my config so here is the one I used, maybe I'm missing something (but again, everything works perfectly with your tree): https://gist.github.com/anonymous/5275566 Lastly, just FTR, I tried Andrew's 3.9-rc4-mm1 and got the same issues, unsurprisingly I guess. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 9:33 PM, Emmanuel Benisty wrote: > > I just tried the 7 original patches + the 2 one liners from -next + > modified Linus' patch (attached) .. that patch looks fine. > on the top of 3.9-rc4 using > PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained > above. I was building two packages at the same time, went away for 30 > seconds, came back and everything froze as soon as I touched the > laptop's touchpad. Maybe a coincidence but anyway... Another shot in > the dark, I had this weird message when trying to build gcc: > semop(2): encountered an error: Identifier removed This came from the gcc build? That's just crazy. No normal app uses sysv semaphores. I have an older gcc build environment, and some grepping shows it has some ipc semaphore use in the libstdc++ testsuite, and some libmudflap hooks, but that should be very very minor. You seem to trigger errors really trivially easily, which is really odd. It's sounding less and less like some subtle race, and more like the error just happens all the time. If you can make even the gcc build generate errors, I don't think they can be some rare blue-moon thing. I notice that your dmesg says that your kernel is compiled by gcc-4.8.1 prerelease. Is there any chance that you could try to install a known-stable gcc, like 4.7.2 or something. It's entirely possible that it's a kernel bug even if it's triggered by some more aggressive compiler optimization or something, but it would be really good to try to see if this might be gcc-specific. For example, I wonder if your gcc might miscompile idr_alloc() or something, so that we get the same ID for different ipc objects. That would certainly potentially cause chaos. Hmm? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, Mar 30, 2013 at 10:46 AM, Linus Torvalds wrote: > On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty wrote: >> >> Then I start building a random package and the problems start. They >> may also happen without compiling but this seems to trigger the bug >> quite quickly. > > I suspect it's about preemption, and the build just results in enough > scheduling load that you start hitting whatever race there is. > >> Anyway, some progress here, I hope: dmesg seems to be >> willing to reveal some secrets (using some pastebin service since this >> is pretty big): >> >> https://gist.github.com/anonymous/5275120 > > That looks like exactly the exit_sem() bug that Davidlohr was talking > about, where the > > /* exit_sem raced with IPC_RMID, nothing to do */ > if (IS_ERR(sma)) > continue; > > should be moved to *before* the > > sem_lock(sma, NULL, -1); > > call. And apparently the bug I had found is already fixed in -next. I just tried the 7 original patches + the 2 one liners from -next + modified Linus' patch (attached) on the top of 3.9-rc4 using PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained above. I was building two packages at the same time, went away for 30 seconds, came back and everything froze as soon as I touched the laptop's touchpad. Maybe a coincidence but anyway... Another shot in the dark, I had this weird message when trying to build gcc: semop(2): encountered an error: Identifier removed patch.diff Description: Binary data
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty wrote: > > Then I start building a random package and the problems start. They > may also happen without compiling but this seems to trigger the bug > quite quickly. I suspect it's about preemption, and the build just results in enough scheduling load that you start hitting whatever race there is. > Anyway, some progress here, I hope: dmesg seems to be > willing to reveal some secrets (using some pastebin service since this > is pretty big): > > https://gist.github.com/anonymous/5275120 That looks like exactly the exit_sem() bug that Davidlohr was talking about, where the /* exit_sem raced with IPC_RMID, nothing to do */ if (IS_ERR(sma)) continue; should be moved to *before* the sem_lock(sma, NULL, -1); call. And apparently the bug I had found is already fixed in -next. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
Hi Davidlohr, On Sat, Mar 30, 2013 at 9:08 AM, Davidlohr Bueso wrote: > Not sure which one liner you refer to, but, if you haven't already done > so, please try with these fixes (queued in linux-next): > > http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=a9cead0347283f3e72a39e7b76a3cc479b048e51 > http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=4db64b89525ac357cba754c3120065a9ec31 > > I've been trying to reproduce your twilight zone problem on five > different machines now without any luck. Is there anything you're doing > to trigger the issue? Does the machine boot ok and then do weird things, > say after X starts, open some program, etc? I was missing a9cead0, thanks. What I usually do is starting a standard session, which looks like this: init-+-5*[agetty] |-bash---startx---xinit-+-X---2*[{X}] | `-dwm---sh---sleep |-bash---chromium-+-chromium | |-chromium-+-chromium | | `-2*[{chromium}] | |-chromium-sandbo---chromium---chromium---4*[chromium---4*[{chromium}]] | `-65*[{chromium}] |-crond |-dbus-daemon |-klogd |-syslogd |-tmux-+-alsamixer | |-bash---bash | |-bash | |-htop | `-newsbeuter---{newsbeuter} |-udevd |-urxvtd-+-bash---pstree |`-bash---tmux `-wpa_supplicant Then I start building a random package and the problems start. They may also happen without compiling but this seems to trigger the bug quite quickly. Anyway, some progress here, I hope: dmesg seems to be willing to reveal some secrets (using some pastebin service since this is pretty big): https://gist.github.com/anonymous/5275120 Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, 2013-03-29 at 19:09 -0700, Linus Torvalds wrote: > On Fri, Mar 29, 2013 at 6:36 PM, Emmanuel Benisty wrote: > > > > I had to slightly modify the patch since it wouldn't match the changes > > introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch, > > hope that was the right thing to do. So, what I tried was: original 7 > > patches + the one liner + your patch blindly modified by me on the top > > of 3.9-rc4 and I'm still having twilight zone issues. > > Ok, please send your patch so that I can double-check what you did, > but it was simple enough that you probably did the right thing. > > Sad. Your case definitely looks like a double rcu-free, as shown by > the fact that when you enabled SLUB debugging the oops happened with > the use-after-free pattern (it's __rcu_reclaim() doing the > "head->func(head);" thing, and "func" is 0x6b6b6b6b6b6b6b6b, so "head" > has already been free'd once). > > So ipc_rcu_putref() and a refcounting error looked very promising.as a > potential explanation. > > The 'un' undo structure is also free'd with rcu, but the locking > around that seems much more robust. The undo entry is on two lists > (sma->list_id, under sma->sem_perm.lock and ulp->list_proc, under > ulp->lock). But those locks are actually tested with > assert_spin_locked() in all the relevant places, and the code actually > looks sane. So I had high hopes for ipc_rcu_putref()... > > Hmm. Except for exit_sem() that does odd things. You have preemption > enabled, don't you? exit_sem() does a lookup of the first list_proc > entry under tcy_read_lock to lookup un->semid, and then it drops the > rcu read lock. At which point "un" is no longer reliable, I think. But > then it still uses "un->semid", rather than the stable value it looked > up under the rcu read lock. Which looks bogus. > > So I'd like you to test a few more things: > > (a) In exit_sem(), can you change the > > sma = sem_lock_check(tsk->nsproxy->ipc_ns, un->semid); > > to use just "semid" rather than "un->semid", because I don't > think "un" is stable here. Well that's not really the case in the new code. We don't drop the rcu read lock until the end of the loop, in sem_unlock(). However, I just noticed that we're checking sma for error after trying to acquire sma->sem_perm.lock: sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid); sem_lock(sma, NULL, -1); /* exit_sem raced with IPC_RMID, nothing to do */ if (IS_ERR(sma)) continue; The IS_ERR(sma) check should be right after the sem_obtain_object_check() call instead. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 6:36 PM, Emmanuel Benisty wrote: > > I had to slightly modify the patch since it wouldn't match the changes > introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch, > hope that was the right thing to do. So, what I tried was: original 7 > patches + the one liner + your patch blindly modified by me on the top > of 3.9-rc4 and I'm still having twilight zone issues. Ok, please send your patch so that I can double-check what you did, but it was simple enough that you probably did the right thing. Sad. Your case definitely looks like a double rcu-free, as shown by the fact that when you enabled SLUB debugging the oops happened with the use-after-free pattern (it's __rcu_reclaim() doing the "head->func(head);" thing, and "func" is 0x6b6b6b6b6b6b6b6b, so "head" has already been free'd once). So ipc_rcu_putref() and a refcounting error looked very promising.as a potential explanation. The 'un' undo structure is also free'd with rcu, but the locking around that seems much more robust. The undo entry is on two lists (sma->list_id, under sma->sem_perm.lock and ulp->list_proc, under ulp->lock). But those locks are actually tested with assert_spin_locked() in all the relevant places, and the code actually looks sane. So I had high hopes for ipc_rcu_putref()... Hmm. Except for exit_sem() that does odd things. You have preemption enabled, don't you? exit_sem() does a lookup of the first list_proc entry under tcy_read_lock to lookup un->semid, and then it drops the rcu read lock. At which point "un" is no longer reliable, I think. But then it still uses "un->semid", rather than the stable value it looked up under the rcu read lock. Which looks bogus. So I'd like you to test a few more things: (a) In exit_sem(), can you change the sma = sem_lock_check(tsk->nsproxy->ipc_ns, un->semid); to use just "semid" rather than "un->semid", because I don't think "un" is stable here. (b) does the problem go away if you change disable CONFIG_PREEMPT (perhaps to PREEMPT_NONE or PREEMPT_VOLUNTARY?) Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, 2013-03-30 at 08:36 +0700, Emmanuel Benisty wrote: > Hi Linus, > > On Sat, Mar 30, 2013 at 6:16 AM, Linus Torvalds > wrote: > > Emmanuel, can you try the attached patch? I think it applies cleanly > > on top of the scalability series too without any changes, but I didn't > > check if the patches perhaps changed some of the naming or something. > > I had to slightly modify the patch since it wouldn't match the changes > introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch, > hope that was the right thing to do. So, what I tried was: original 7 > patches + the one liner + your patch blindly modified by me on the top > of 3.9-rc4 and I'm still having twilight zone issues. Not sure which one liner you refer to, but, if you haven't already done so, please try with these fixes (queued in linux-next): http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=a9cead0347283f3e72a39e7b76a3cc479b048e51 http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=4db64b89525ac357cba754c3120065a9ec31 I've been trying to reproduce your twilight zone problem on five different machines now without any luck. Is there anything you're doing to trigger the issue? Does the machine boot ok and then do weird things, say after X starts, open some program, etc? Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
Hi Linus, On Sat, Mar 30, 2013 at 6:16 AM, Linus Torvalds wrote: > Emmanuel, can you try the attached patch? I think it applies cleanly > on top of the scalability series too without any changes, but I didn't > check if the patches perhaps changed some of the naming or something. I had to slightly modify the patch since it wouldn't match the changes introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch, hope that was the right thing to do. So, what I tried was: original 7 patches + the one liner + your patch blindly modified by me on the top of 3.9-rc4 and I'm still having twilight zone issues. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 2:12 PM, Linus Torvalds wrote: > > I dunno. I'm still not sure this is triggerable, but it looks bad. But > both the semaphore case and the msg cases seem to be solvable by > moving the unlock down, and shm seem to have no getref/putref users to > race with, so this (whitespace-damaged) patch *may* be sufficient: Well, the patch doesn't seem to cause any problems, at least neither lockdep nor spinlock sleep debugging complains. I have no idea whether it actually fixes any problems, though. I do wonder if this might explain the problem Emmanuel saw. A double free of a RCU-freeable object would possibly result in exactly the kind of mess that Emmanuel reported with the semaphore scalability patches. Emmanuel, can you try the attached patch? I think it applies cleanly on top of the scalability series too without any changes, but I didn't check if the patches perhaps changed some of the naming or something. Linus patch.diff Description: Binary data
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 1:41 PM, Linus Torvalds wrote: > > The alternative would be to make sure the thing is always locked (and > in a rcu-read-safe region) before putref/getref. The only place (apart > from the initial allocation, which doesn't matter, because nothing can > see if itf that path fails) seems to be that freeque(), but I didn't > check everything. > > Moving the > > msg_unlock(msq); > > to the end of freeque() might be the way to go. It's going to cause us > to hold the lock for longer, but I'm not sure we care for that path. Uhhuh. I think shm_destroy() has the same pattern. And I think that one has locking reasons why it has to do the shm_unlock() before tearing some things down, although I'm not sure.. The good news is that shm doesn't seem to have any users of ipc_rcu_get/putref(), so I don't see anything to race against. So it has the same buggy pattern, but I don't think it can trigger anything. And ipc/sem.c has the same bug-pattern in freeary(). It does "sem_unlock(sma)" followed by "ipc_rcu_putref(sma);", and it *does* seem to have code that that can race against (sem_lock_and_putref()). The whole "sem_putref()" tries to be careful and gets the lock for the last ref, but getting the lock doesn't help when freeary() does the refcount access without it. The semaphore case seems to argue fairly strongly for an "atomic_t refcount", because right now it does a lot of "sem_putref()" stuff that just gets the lock for the putref. So not only is that insufficient (if it races against freeary(), but it's also more expensive than just an atomic refcount would have been. I dunno. I'm still not sure this is triggerable, but it looks bad. But both the semaphore case and the msg cases seem to be solvable by moving the unlock down, and shm seem to have no getref/putref users to race with, so this (whitespace-damaged) patch *may* be sufficient: diff --git a/ipc/msg.c b/ipc/msg.c index 31cd1bf6af27..338d8e2b589b 100644 --- a/ipc/msg.c +++ b/ipc/msg.c @@ -284,7 +284,6 @@ static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp) expunge_all(msq, -EIDRM); ss_wakeup(>q_senders, 1); msg_rmid(ns, msq); - msg_unlock(msq); tmp = msq->q_messages.next; while (tmp != >q_messages) { @@ -297,6 +296,7 @@ static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp) atomic_sub(msq->q_cbytes, >msg_bytes); security_msg_queue_free(msq); ipc_rcu_putref(msq); + msg_unlock(msq); } /* diff --git a/ipc/sem.c b/ipc/sem.c index 58d31f1c1eb5..1cf024b9eac0 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -766,12 +766,12 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp) /* Remove the semaphore set from the IDR */ sem_rmid(ns, sma); - sem_unlock(sma); wake_up_sem_queue_do(); ns->used_sems -= sma->sem_nsems; security_sem_free(sma); ipc_rcu_putref(sma); + sem_unlock(sma); } static unsigned long copy_semid_to_user(void __user *buf, struct semid64_ds *in, int version) (I didn't check very carefully, it's possible that we end up having some locking problem if we move the unlocks down later, but it *looks* fine) Anybody? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones wrote: > > Now that I have that reverted, I'm not seeing msgrcv traces any more, but > I've started seeing this.. > > general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC > RIP: 0010:[] [] free_msg+0x2b/0x40 > Call Trace: > [] freeque+0xcf/0x140 > [] msgctl_down.constprop.9+0x183/0x200 > [] sys_msgctl+0x139/0x400 > [] system_call_fastpath+0x16/0x1b > > Looks like seg was already kfree'd. Hmm. I have a suspicion. The use of ipc_rcu_getref/ipc_rcu_putref() seems a bit iffy. In particular, the refcount is not an atomic variable, so we absolutely *depend* on the spinlock for it. However, looking at "freeque()", that's not actually the case. It releases the message queue spinlock *before* it does the ipc_rcu_putref(), and it does that because the thing has become unreachable (it did a msg_rmid(), which will set ->deleted, which in turn will mean that nobody should successfully look it up any more). HOWEVER. While the "deleted" flag is serialized, the actual refcount is not. So in *parallel* with the freeque() call, we may have some other user that does something like ... ipc_rcu_getref(msq); msg_unlock(msq); schedule(); ipc_lock_by_ptr(>q_perm); ipc_rcu_putref(msq); if (msq->q_perm.deleted) { err = -EIDRM; goto out_unlock_free; } ... which got the lock for the "deleted" test, so that part is all fine, but notice the "ipc_rcu_putref()". It can happen at the same time that freeque() also does its own ipc_rcu_putref(). So now refcount may get buggered, resulting in random early reuse, double free's or leaking of the msq. There may be some reason I don't see why this cannot happen, but it does look suspicious. I wonder if the refcount should be an atomic value. The alternative would be to make sure the thing is always locked (and in a rcu-read-safe region) before putref/getref. The only place (apart from the initial allocation, which doesn't matter, because nothing can see if itf that path fails) seems to be that freeque(), but I didn't check everything. Moving the msg_unlock(msq); to the end of freeque() might be the way to go. It's going to cause us to hold the lock for longer, but I'm not sure we care for that path. Guys, am I missing something? This kind of refcounting problem might well explain the rcu-freeing-time bugs reported with the scalability patches: long-time race that just got *much* easier to expose with the higher level of parallelism? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 12:33 PM, Peter Hurley wrote: > On Fri, 2013-03-29 at 11:43 -0700, Linus Torvalds wrote: >> I think I foud at least one bug in the MSG_COPY stuff: it leaks the >> "copy" allocation if >> >> mode == SEARCH_LESSEQUAL >> >> but maybe I'm misreading it. > > Yes, you're misreading it. copy_msg() returns the 'copy' address when > copying is successful. So this patch double-frees 'copy'. No it doesn't. The case where "copy_msg()" *is* called and is successful, msg gets set to "copy", my patch sets "copy" to NULL, so no, it doesn't double-free. That said, it looks like if MSG_COPY is set, we cannot trigger the "mode == SEARCH_LESSEQUAL" case, because it forces msgtyp to 0, which in turn forces mode = SEARCH_ANY. So the actual leak case seems to not be possible, although that's *really* hard to see from looking at the code. The code is just an unreadable mess. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, 2013-03-29 at 12:26 -0700, Linus Torvalds wrote: > On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones wrote: > > > > Here's an oops I just hit.. > > > > BUG: unable to handle kernel NULL pointer dereference at 000f > > IP: [] testmsg.isra.5+0x1a/0x60 > > Btw, looking at the code leading up to this, what the f*ck is wrong > with the IPC stuff? > > It's using the generic list stuff for most of the lists, but then it > open-codes the accesses. > > So instead of using > >for_each_entry(walk_msg, >q_messages, m_list) { > .. >} > > the ipc/msg.c code does all that by hand, with > >tmp = msq->q_messages.next; >while (tmp != >q_messages) { > struct msg_msg *walk_msg; > > walk_msg = list_entry(tmp, struct msg_msg, m_list); > ... > tmp = tmp->next; >} > > Ugh. The code is near unreadable. And then it has magic memory > barriers etc, implying that it doesn't lock the data structures, but > no comments about them. See expunge_all() and pipelined_send(). > > The code seems entirely random, and it's badly set up (annoyance of > the day: crazy helper functions in ipc/msgutil.c to make sure that (a) > you have to spend more effort looking for them, and (b) they won't get > inlined). > > Clearly nobody has cared for the crazy IPC message code in a long time. Exactly that's what my patch series does; clean this mess up. This is what I wrote to Andrew a couple of days ago. On Tue, 2013-03-26 at 22:33 -0400, Peter Hurley wrote: I just figured out how the queue is being corrupted and why my series > fixes it. > > > With MSG_COPY set, the queue scan can exit with the local variable 'msg' > pointing to a real msg if the msg_counter never reaches the copy_number. > > The remaining execution looks like this: > > if (!IS_ERR(msg)) { > > if (msgflg & MSG_COPY) > goto out_unlock; > > > out_unlock: > msg_unlock(msq); > break; > } > } > if (IS_ERR(msg)) > > > bufsz = msg_handler(); > free_msg(msg); << msg never unlinked > > > Since the msg should not have been found (because it failed the match > criteria), the if (!IS_ERR(msg)) clause should never have executed. > > That's why my refactor fixes resolve this; because msg is not > inadvertently treated as a found msg. > > But let's be honest; the real bug here is the poor structure of this > function that even made this possible. The deepest nesting executes a > goto to a label in the middle of an if clause. Yuck! No wonder this > thing's fragile. > > So my recommendation still stands. The series that fixes this has been > getting tested in linux-next for a month. Fixing this some other way is > just asking for more trouble. > > But why not just revert MSG_COPY altogether for 3.9? Regards, Peter Hurley -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, 2013-03-29 at 11:43 -0700, Linus Torvalds wrote: > I think I foud at least one bug in the MSG_COPY stuff: it leaks the > "copy" allocation if > > mode == SEARCH_LESSEQUAL > > but maybe I'm misreading it. Yes, you're misreading it. copy_msg() returns the 'copy' address when copying is successful. So this patch double-frees 'copy'. Andrew has had the patches that fix this for a month but because they're fairly extensive he didn't want to apply them for 3.9. Regards, Peter -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones wrote: > > Here's an oops I just hit.. > > BUG: unable to handle kernel NULL pointer dereference at 000f > IP: [] testmsg.isra.5+0x1a/0x60 Btw, looking at the code leading up to this, what the f*ck is wrong with the IPC stuff? It's using the generic list stuff for most of the lists, but then it open-codes the accesses. So instead of using for_each_entry(walk_msg, >q_messages, m_list) { .. } the ipc/msg.c code does all that by hand, with tmp = msq->q_messages.next; while (tmp != >q_messages) { struct msg_msg *walk_msg; walk_msg = list_entry(tmp, struct msg_msg, m_list); ... tmp = tmp->next; } Ugh. The code is near unreadable. And then it has magic memory barriers etc, implying that it doesn't lock the data structures, but no comments about them. See expunge_all() and pipelined_send(). The code seems entirely random, and it's badly set up (annoyance of the day: crazy helper functions in ipc/msgutil.c to make sure that (a) you have to spend more effort looking for them, and (b) they won't get inlined). Clearly nobody has cared for the crazy IPC message code in a long time. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones wrote: > > Btw, something that's really bothering me is just how much bogus > 'follow-on' spew we have lately. I'm not sure what changed, but it > seems to have gotten worse. .. we have many more sanity checks, and they tend to trigger in the case of us killing somebody holding locks etc. > Related: is there any value in printing out all the ? symbols on > kernels with frame pointers enabled ? I've found them useful. You can often see what happened just before. Of course, I'd still prefer gcc not generate unnecessarily big stack frames (which is one of the bigger reasons for old stuff shining through) but as things are, they often give hints about what happened just before. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 11:43:25AM -0700, Linus Torvalds wrote: > On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones wrote: > > > > Now that I have that reverted, I'm not seeing msgrcv traces any more, but > > I've started seeing this.. > > > > general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC > > Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you > disable it? > > I think I foud at least one bug in the MSG_COPY stuff: it leaks the > "copy" allocation if > > mode == SEARCH_LESSEQUAL > > but maybe I'm misreading it. And that shouldn't cause the problem you > see, but it's indicative of how badly tested and thought through the > MSG_COPY code is. > > Totally UNTESTED leak fix appended. Stanislav? I'll give it a shot. Btw, something that's really bothering me is just how much bogus 'follow-on' spew we have lately. I'm not sure what changed, but it seems to have gotten worse. Here's an oops I just hit.. BUG: unable to handle kernel NULL pointer dereference at 000f IP: [] testmsg.isra.5+0x1a/0x60 PGD 10fd95067 PUD 10f767067 PMD 0 Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables raid0 snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec bluetooth microcode serio_raw snd_pcm edac_core pcspkr snd_page_alloc rfkill snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm CPU 2 Pid: 958, comm: trinity-child20 Not tainted 3.9.0-rc4+ #7 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H RIP: 0010:[] [] testmsg.isra.5+0x1a/0x60 RSP: 0018:880117bb5e88 EFLAGS: 00010246 RAX: RBX: 0004 RCX: 0078 RDX: 0004 RSI: fffe RDI: 000f RBP: 880117bb5e88 R08: 0004 R09: 0001 R10: 880117bb8000 R11: 0001 R12: fffe R13: 88010fd10308 R14: 88010fd10258 R15: FS: 7fa89c256740() GS:88012aa0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 000f CR3: 00010f76c000 CR4: 07e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process trinity-child20 (pid: 958, threadinfo 880117bb4000, task 880117bb8000) Stack: 880117bb5f68 812c3746 880117bb8000 880117bb8000 880117bb8000 81c7ace0 812c2430 0004 652a928e Call Trace: [] do_msgrcv+0x1a6/0x5f0 [] ? msg_security+0x10/0x10 [] ? trace_hardirqs_on_caller+0x115/0x1a0 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] sys_msgrcv+0x15/0x20 [] system_call_fastpath+0x16/0x1b Code: c3 48 c7 c0 f2 ff ff ff eb e5 0f 1f 80 00 00 00 00 66 66 66 66 90 55 83 fa 02 48 89 e5 74 3a 7e 28 83 fa 03 74 13 83 fa 04 75 0a <48> 39 37 b8 01 00 00 00 7e 02 31 c0 5d c3 48 3b 37 74 f7 b8 01 RIP [] testmsg.isra.5+0x1a/0x60 RSP CR2: 000f ---[ end trace 8f0713d2aacb6aa3 ]--- BUG: sleeping function called from invalid context at kernel/rwsem.c:20 in_atomic(): 1, irqs_disabled(): 0, pid: 958, name: trinity-child20 INFO: lockdep is turned off. Pid: 958, comm: trinity-child20 Tainted: G D 3.9.0-rc4+ #7 Call Trace: [] __might_sleep+0x145/0x200 [] down_read+0x2a/0xa0 [] exit_signals+0x24/0x130 [] do_exit+0xbc/0xd10 [] ? kmsg_dump+0x1b5/0x230 [] ? kmsg_dump+0x25/0x230 [] oops_end+0x9c/0xe0 [] no_context+0x268/0x275 [] __bad_area_nosemaphore+0x78/0x1d1 [] bad_area_nosemaphore+0x13/0x15 [] __do_page_fault+0x366/0x5b0 [] ? __lock_acquire+0x2e5/0x1a00 [] ? trace_hardirqs_off_caller+0x28/0xc0 [] ? trace_hardirqs_off_thunk+0x3a/0x3c [] do_page_fault+0xe/0x10 [] page_fault+0x22/0x30 [] ? testmsg.isra.5+0x1a/0x60 [] ? security_msg_queue_msgrcv+0x16/0x20 [] do_msgrcv+0x1a6/0x5f0 [] ? msg_security+0x10/0x10 [] ? trace_hardirqs_on_caller+0x115/0x1a0 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] sys_msgrcv+0x15/0x20 [] system_call_fastpath+0x16/0x1b note: trinity-child20[958] exited with preempt_count 1 BUG: scheduling while atomic: trinity-child20/958/0x1002 INFO: lockdep is turned off. Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack
Re: ipc,sem: sysv semaphore scalability
On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote: > On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones wrote: > > > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote: > > > > > Whichever way we go, we should get a wiggle on - this has been hanging > > > around for too long. Dave, do you have time to determine whether > > > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger > > > than max") fixes things up? > > > > Ok, with that reverted it's been grinding away for a few hours without > > incident. > > Normally I see the oops within a minute or so. > > > > OK, thanks, I queued a revert: > > From: Andrew Morton > Subject: revert "ipc: don't allocate a copy larger than max" > > Revert 88b9e456b164. Dave has confirmed that this was causing oopses > during trinity testing. I owe Peter an apology. I just hit it again with that backed out. Andrew, might as well drop that revert. BUG: unable to handle kernel NULL pointer dereference at 000f IP: [] testmsg.isra.5+0x1a/0x60 PGD 10fd95067 PUD 10f767067 PMD 0 Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables raid0 snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec bluetooth microcode serio_raw snd_pcm edac_core pcspkr snd_page_alloc rfkill snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm CPU 2 Pid: 958, comm: trinity-child20 Not tainted 3.9.0-rc4+ #7 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H RIP: 0010:[] [] testmsg.isra.5+0x1a/0x60 RSP: 0018:880117bb5e88 EFLAGS: 00010246 RAX: RBX: 0004 RCX: 0078 RDX: 0004 RSI: fffe RDI: 000f RBP: 880117bb5e88 R08: 0004 R09: 0001 R10: 880117bb8000 R11: 0001 R12: fffe R13: 88010fd10308 R14: 88010fd10258 R15: FS: 7fa89c256740() GS:88012aa0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 000f CR3: 00010f76c000 CR4: 07e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process trinity-child20 (pid: 958, threadinfo 880117bb4000, task 880117bb8000) Stack: 880117bb5f68 812c3746 880117bb8000 880117bb8000 880117bb8000 81c7ace0 812c2430 0004 652a928e Call Trace: [] do_msgrcv+0x1a6/0x5f0 [] ? msg_security+0x10/0x10 [] ? trace_hardirqs_on_caller+0x115/0x1a0 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] sys_msgrcv+0x15/0x20 [] system_call_fastpath+0x16/0x1b Code: c3 48 c7 c0 f2 ff ff ff eb e5 0f 1f 80 00 00 00 00 66 66 66 66 90 55 83 fa 02 48 89 e5 74 3a 7e 28 83 fa 03 74 13 83 fa 04 75 0a <48> 39 37 b8 01 00 00 00 7e 02 31 c0 5d c3 48 3b 37 74 f7 b8 01 RIP [] testmsg.isra.5+0x1a/0x60 I think I wasn't seeing that this last week because I had inadvertantly disabled DEBUG_PAGEALLOC and.. we're back to square one. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones wrote: > > Now that I have that reverted, I'm not seeing msgrcv traces any more, but > I've started seeing this.. > > general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you disable it? I think I foud at least one bug in the MSG_COPY stuff: it leaks the "copy" allocation if mode == SEARCH_LESSEQUAL but maybe I'm misreading it. And that shouldn't cause the problem you see, but it's indicative of how badly tested and thought through the MSG_COPY code is. Totally UNTESTED leak fix appended. Stanislav? Linus patch.diff Description: Binary data
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 11:04 AM, Dave Jones wrote: > > mainline. Your current tree. Ok, that's what I thought you were generally testing, just wanted to verify. The subject kind of implied otherwise.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 11:00:27AM -0700, Linus Torvalds wrote: > On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones wrote: > > > > Now that I have that reverted, I'm not seeing msgrcv traces any more, but > > I've started seeing this.. > > > > general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC > > > > Looks like seg was already kfree'd. > > Just to clarify: is this you testing -mm that has Davidlohr's and > Rik's scalability patches? Or mainline? mainline. Your current tree. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones wrote: > > Now that I have that reverted, I'm not seeing msgrcv traces any more, but > I've started seeing this.. > > general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC > > Looks like seg was already kfree'd. Just to clarify: is this you testing -mm that has Davidlohr's and Rik's scalability patches? Or mainline? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote: > On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones wrote: > > > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote: > > > > > Whichever way we go, we should get a wiggle on - this has been hanging > > > around for too long. Dave, do you have time to determine whether > > > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger > > > than max") fixes things up? > > > > Ok, with that reverted it's been grinding away for a few hours without > > incident. > > Normally I see the oops within a minute or so. > > OK, thanks, I queued a revert: > > From: Andrew Morton > Subject: revert "ipc: don't allocate a copy larger than max" > > Revert 88b9e456b164. Dave has confirmed that this was causing oopses > during trinity testing. Now that I have that reverted, I'm not seeing msgrcv traces any more, but I've started seeing this.. general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: l2tp_ppp l2tp_netlink l2tp_core llc2 phonet netrom rose af_key af_rxrpc caif_socket caif can_raw cmtp kernelcapi can_bcm can nfnetlink ipt_ULOG scsi_transport_iscsi af_802154 ax25 atm ipx pppoe pppox x25 nfc irda ppp_generic p8023 slhc p8022 appletalk decnet crc_ccitt rds psnap llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek btusb snd_hda_intel bluetooth snd_hda_codec raid0 snd_pcm rfkill microcode serio_raw pcspkr snd_page_alloc edac_core snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm CPU 3 Pid: 1850, comm: trinity-child37 Tainted: GB3.9.0-rc4+ #7 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H RIP: 0010:[] [] free_msg+0x2b/0x40 RSP: 0018:8800a1d3bdd0 EFLAGS: 00010202 RAX: RBX: 6b6b6b6b6b6b6b6b RCX: RDX: RSI: RDI: 810b6ced RBP: 8800a1d3bde0 R08: R09: 0001 R10: 0001 R11: 0001 R12: 88009997e620 R13: 81c7ace0 R14: 8800caf359d8 R15: 81c7b024 FS: 7f2d7be64740() GS:88012ac0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 7f8bd7bb6000 CR3: a1f0a000 CR4: 07e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process trinity-child37 (pid: 1850, threadinfo 8800a1d3a000, task 8800a1e62490) Stack: 6b6b6b6b6b6b6b6b 8800caf35928 8800a1d3be18 812c289f 81c7ace0 8800caf35928 8800a1d3be28 8800a1d3bec8 812c2a93 8800a1d3be40 Call Trace: [] freeque+0xcf/0x140 [] msgctl_down.constprop.9+0x183/0x200 [] ? up_read+0x1f/0x40 [] ? __do_page_fault+0x214/0x5b0 [] ? lock_release_non_nested+0x23e/0x320 [] sys_msgctl+0x139/0x400 [] ? retint_swapgs+0xe/0x13 [] ? trace_hardirqs_on_caller+0x115/0x1a0 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] system_call_fastpath+0x16/0x1b Code: 66 66 66 66 90 55 48 89 e5 41 54 49 89 fc 53 e8 fc 5e 01 00 49 8b 5c 24 20 4c 89 e7 e8 8f af ed ff 48 85 db 75 05 eb 13 4c 89 e3 <4c> 8b 23 48 89 df e8 7a af ed ff 4d 85 e4 75 ed 5b 41 5c 5d c3 (Taint is from an ext4 double-free I just reported in a separate thread) decoded.. 0: 66 66 66 66 90 data32 data32 data32 xchg %ax,%ax 5: 55 push %rbp 6: 48 89 e5mov%rsp,%rbp 9: 41 54 push %r12 b: 49 89 fcmov%rdi,%r12 e: 53 push %rbx f: e8 fc 5e 01 00 callq 0x15f10 14: 49 8b 5c 24 20 mov0x20(%r12),%rbx 19: 4c 89 e7mov%r12,%rdi 1c: e8 8f af ed ff callq 0xffedafb0 21: 48 85 dbtest %rbx,%rbx 24: 75 05 jne0x2b 26: eb 13 jmp0x3b 28: 4c 89 e3mov%r12,%rbx 2b:* 4c 8b 23mov(%rbx),%r12 <-- trapping instruction 2e: 48 89 dfmov%rbx,%rdi 31: e8 7a af ed ff callq 0xffedafb0 36: 4d 85 e4test %r12,%r12 39: 75 ed jne0x28 3b: 5b pop%rbx 3c: 41 5c pop%r12 3e: 5d pop%rbp 3f: c3 retq Disassembly of free_msg shows.. seg = msg->next; kfree(msg); while (seg != NULL) { struct msg_msgseg *tmp = seg->next; 30b: 4c 8b 23mov(%rbx),%r12 kfree(seg); 30e: 48 89 dfmov%rbx,%rdi 311: e8 00 00 00 00 callq 316 Looks like seg was
Re: ipc,sem: sysv semaphore scalability
On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote: On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones da...@redhat.com wrote: On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote: Whichever way we go, we should get a wiggle on - this has been hanging around for too long. Dave, do you have time to determine whether reverting 88b9e456b1649722673ff (ipc: don't allocate a copy larger than max) fixes things up? Ok, with that reverted it's been grinding away for a few hours without incident. Normally I see the oops within a minute or so. OK, thanks, I queued a revert: From: Andrew Morton a...@linux-foundation.org Subject: revert ipc: don't allocate a copy larger than max Revert 88b9e456b164. Dave has confirmed that this was causing oopses during trinity testing. Now that I have that reverted, I'm not seeing msgrcv traces any more, but I've started seeing this.. general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: l2tp_ppp l2tp_netlink l2tp_core llc2 phonet netrom rose af_key af_rxrpc caif_socket caif can_raw cmtp kernelcapi can_bcm can nfnetlink ipt_ULOG scsi_transport_iscsi af_802154 ax25 atm ipx pppoe pppox x25 nfc irda ppp_generic p8023 slhc p8022 appletalk decnet crc_ccitt rds psnap llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek btusb snd_hda_intel bluetooth snd_hda_codec raid0 snd_pcm rfkill microcode serio_raw pcspkr snd_page_alloc edac_core snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm CPU 3 Pid: 1850, comm: trinity-child37 Tainted: GB3.9.0-rc4+ #7 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H RIP: 0010:[812c20fb] [812c20fb] free_msg+0x2b/0x40 RSP: 0018:8800a1d3bdd0 EFLAGS: 00010202 RAX: RBX: 6b6b6b6b6b6b6b6b RCX: RDX: RSI: RDI: 810b6ced RBP: 8800a1d3bde0 R08: R09: 0001 R10: 0001 R11: 0001 R12: 88009997e620 R13: 81c7ace0 R14: 8800caf359d8 R15: 81c7b024 FS: 7f2d7be64740() GS:88012ac0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 7f8bd7bb6000 CR3: a1f0a000 CR4: 07e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process trinity-child37 (pid: 1850, threadinfo 8800a1d3a000, task 8800a1e62490) Stack: 6b6b6b6b6b6b6b6b 8800caf35928 8800a1d3be18 812c289f 81c7ace0 8800caf35928 8800a1d3be28 8800a1d3bec8 812c2a93 8800a1d3be40 Call Trace: [812c289f] freeque+0xcf/0x140 [812c2a93] msgctl_down.constprop.9+0x183/0x200 [810767cf] ? up_read+0x1f/0x40 [816c8f94] ? __do_page_fault+0x214/0x5b0 [810b94be] ? lock_release_non_nested+0x23e/0x320 [812c2da9] sys_msgctl+0x139/0x400 [816c5d4d] ? retint_swapgs+0xe/0x13 [810b6c55] ? trace_hardirqs_on_caller+0x115/0x1a0 [8134b39e] ? trace_hardirqs_on_thunk+0x3a/0x3f [816cd942] system_call_fastpath+0x16/0x1b Code: 66 66 66 66 90 55 48 89 e5 41 54 49 89 fc 53 e8 fc 5e 01 00 49 8b 5c 24 20 4c 89 e7 e8 8f af ed ff 48 85 db 75 05 eb 13 4c 89 e3 4c 8b 23 48 89 df e8 7a af ed ff 4d 85 e4 75 ed 5b 41 5c 5d c3 (Taint is from an ext4 double-free I just reported in a separate thread) decoded.. 0: 66 66 66 66 90 data32 data32 data32 xchg %ax,%ax 5: 55 push %rbp 6: 48 89 e5mov%rsp,%rbp 9: 41 54 push %r12 b: 49 89 fcmov%rdi,%r12 e: 53 push %rbx f: e8 fc 5e 01 00 callq 0x15f10 14: 49 8b 5c 24 20 mov0x20(%r12),%rbx 19: 4c 89 e7mov%r12,%rdi 1c: e8 8f af ed ff callq 0xffedafb0 21: 48 85 dbtest %rbx,%rbx 24: 75 05 jne0x2b 26: eb 13 jmp0x3b 28: 4c 89 e3mov%r12,%rbx 2b:* 4c 8b 23mov(%rbx),%r12 -- trapping instruction 2e: 48 89 dfmov%rbx,%rdi 31: e8 7a af ed ff callq 0xffedafb0 36: 4d 85 e4test %r12,%r12 39: 75 ed jne0x28 3b: 5b pop%rbx 3c: 41 5c pop%r12 3e: 5d pop%rbp 3f: c3 retq Disassembly of free_msg shows.. seg = msg-next; kfree(msg); while (seg != NULL) { struct msg_msgseg *tmp = seg-next;
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones da...@redhat.com wrote: Now that I have that reverted, I'm not seeing msgrcv traces any more, but I've started seeing this.. general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC Looks like seg was already kfree'd. Just to clarify: is this you testing -mm that has Davidlohr's and Rik's scalability patches? Or mainline? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 11:00:27AM -0700, Linus Torvalds wrote: On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones da...@redhat.com wrote: Now that I have that reverted, I'm not seeing msgrcv traces any more, but I've started seeing this.. general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC Looks like seg was already kfree'd. Just to clarify: is this you testing -mm that has Davidlohr's and Rik's scalability patches? Or mainline? mainline. Your current tree. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 11:04 AM, Dave Jones da...@redhat.com wrote: mainline. Your current tree. Ok, that's what I thought you were generally testing, just wanted to verify. The subject kind of implied otherwise.. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones da...@redhat.com wrote: Now that I have that reverted, I'm not seeing msgrcv traces any more, but I've started seeing this.. general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you disable it? I think I foud at least one bug in the MSG_COPY stuff: it leaks the copy allocation if mode == SEARCH_LESSEQUAL but maybe I'm misreading it. And that shouldn't cause the problem you see, but it's indicative of how badly tested and thought through the MSG_COPY code is. Totally UNTESTED leak fix appended. Stanislav? Linus patch.diff Description: Binary data
Re: ipc,sem: sysv semaphore scalability
On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote: On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones da...@redhat.com wrote: On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote: Whichever way we go, we should get a wiggle on - this has been hanging around for too long. Dave, do you have time to determine whether reverting 88b9e456b1649722673ff (ipc: don't allocate a copy larger than max) fixes things up? Ok, with that reverted it's been grinding away for a few hours without incident. Normally I see the oops within a minute or so. OK, thanks, I queued a revert: From: Andrew Morton a...@linux-foundation.org Subject: revert ipc: don't allocate a copy larger than max Revert 88b9e456b164. Dave has confirmed that this was causing oopses during trinity testing. I owe Peter an apology. I just hit it again with that backed out. Andrew, might as well drop that revert. BUG: unable to handle kernel NULL pointer dereference at 000f IP: [812c24ca] testmsg.isra.5+0x1a/0x60 PGD 10fd95067 PUD 10f767067 PMD 0 Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables raid0 snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec bluetooth microcode serio_raw snd_pcm edac_core pcspkr snd_page_alloc rfkill snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm CPU 2 Pid: 958, comm: trinity-child20 Not tainted 3.9.0-rc4+ #7 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H RIP: 0010:[812c24ca] [812c24ca] testmsg.isra.5+0x1a/0x60 RSP: 0018:880117bb5e88 EFLAGS: 00010246 RAX: RBX: 0004 RCX: 0078 RDX: 0004 RSI: fffe RDI: 000f RBP: 880117bb5e88 R08: 0004 R09: 0001 R10: 880117bb8000 R11: 0001 R12: fffe R13: 88010fd10308 R14: 88010fd10258 R15: FS: 7fa89c256740() GS:88012aa0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 000f CR3: 00010f76c000 CR4: 07e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process trinity-child20 (pid: 958, threadinfo 880117bb4000, task 880117bb8000) Stack: 880117bb5f68 812c3746 880117bb8000 880117bb8000 880117bb8000 81c7ace0 812c2430 0004 652a928e Call Trace: [812c3746] do_msgrcv+0x1a6/0x5f0 [812c2430] ? msg_security+0x10/0x10 [810b6c55] ? trace_hardirqs_on_caller+0x115/0x1a0 [8134b39e] ? trace_hardirqs_on_thunk+0x3a/0x3f [812c3ba5] sys_msgrcv+0x15/0x20 [816cd942] system_call_fastpath+0x16/0x1b Code: c3 48 c7 c0 f2 ff ff ff eb e5 0f 1f 80 00 00 00 00 66 66 66 66 90 55 83 fa 02 48 89 e5 74 3a 7e 28 83 fa 03 74 13 83 fa 04 75 0a 48 39 37 b8 01 00 00 00 7e 02 31 c0 5d c3 48 3b 37 74 f7 b8 01 RIP [812c24ca] testmsg.isra.5+0x1a/0x60 I think I wasn't seeing that this last week because I had inadvertantly disabled DEBUG_PAGEALLOC and.. we're back to square one. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 11:43:25AM -0700, Linus Torvalds wrote: On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones da...@redhat.com wrote: Now that I have that reverted, I'm not seeing msgrcv traces any more, but I've started seeing this.. general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you disable it? I think I foud at least one bug in the MSG_COPY stuff: it leaks the copy allocation if mode == SEARCH_LESSEQUAL but maybe I'm misreading it. And that shouldn't cause the problem you see, but it's indicative of how badly tested and thought through the MSG_COPY code is. Totally UNTESTED leak fix appended. Stanislav? I'll give it a shot. Btw, something that's really bothering me is just how much bogus 'follow-on' spew we have lately. I'm not sure what changed, but it seems to have gotten worse. Here's an oops I just hit.. BUG: unable to handle kernel NULL pointer dereference at 000f IP: [812c24ca] testmsg.isra.5+0x1a/0x60 PGD 10fd95067 PUD 10f767067 PMD 0 Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables raid0 snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec bluetooth microcode serio_raw snd_pcm edac_core pcspkr snd_page_alloc rfkill snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm CPU 2 Pid: 958, comm: trinity-child20 Not tainted 3.9.0-rc4+ #7 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H RIP: 0010:[812c24ca] [812c24ca] testmsg.isra.5+0x1a/0x60 RSP: 0018:880117bb5e88 EFLAGS: 00010246 RAX: RBX: 0004 RCX: 0078 RDX: 0004 RSI: fffe RDI: 000f RBP: 880117bb5e88 R08: 0004 R09: 0001 R10: 880117bb8000 R11: 0001 R12: fffe R13: 88010fd10308 R14: 88010fd10258 R15: FS: 7fa89c256740() GS:88012aa0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 000f CR3: 00010f76c000 CR4: 07e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process trinity-child20 (pid: 958, threadinfo 880117bb4000, task 880117bb8000) Stack: 880117bb5f68 812c3746 880117bb8000 880117bb8000 880117bb8000 81c7ace0 812c2430 0004 652a928e Call Trace: [812c3746] do_msgrcv+0x1a6/0x5f0 [812c2430] ? msg_security+0x10/0x10 [810b6c55] ? trace_hardirqs_on_caller+0x115/0x1a0 [8134b39e] ? trace_hardirqs_on_thunk+0x3a/0x3f [812c3ba5] sys_msgrcv+0x15/0x20 [816cd942] system_call_fastpath+0x16/0x1b Code: c3 48 c7 c0 f2 ff ff ff eb e5 0f 1f 80 00 00 00 00 66 66 66 66 90 55 83 fa 02 48 89 e5 74 3a 7e 28 83 fa 03 74 13 83 fa 04 75 0a 48 39 37 b8 01 00 00 00 7e 02 31 c0 5d c3 48 3b 37 74 f7 b8 01 RIP [812c24ca] testmsg.isra.5+0x1a/0x60 RSP 880117bb5e88 CR2: 000f ---[ end trace 8f0713d2aacb6aa3 ]--- BUG: sleeping function called from invalid context at kernel/rwsem.c:20 in_atomic(): 1, irqs_disabled(): 0, pid: 958, name: trinity-child20 INFO: lockdep is turned off. Pid: 958, comm: trinity-child20 Tainted: G D 3.9.0-rc4+ #7 Call Trace: [8107dba5] __might_sleep+0x145/0x200 [816c215a] down_read+0x2a/0xa0 [8105fa14] exit_signals+0x24/0x130 [8104b80c] do_exit+0xbc/0xd10 [81048b25] ? kmsg_dump+0x1b5/0x230 [81048995] ? kmsg_dump+0x25/0x230 [816c6b6c] oops_end+0x9c/0xe0 [816b7a40] no_context+0x268/0x275 [816b7ac5] __bad_area_nosemaphore+0x78/0x1d1 [816b7c31] bad_area_nosemaphore+0x13/0x15 [816c90e6] __do_page_fault+0x366/0x5b0 [810b72b5] ? __lock_acquire+0x2e5/0x1a00 [810b3348] ? trace_hardirqs_off_caller+0x28/0xc0 [8134b3dd] ? trace_hardirqs_off_thunk+0x3a/0x3c [816c933e] do_page_fault+0xe/0x10 [816c5fa2] page_fault+0x22/0x30 [812c24ca] ? testmsg.isra.5+0x1a/0x60 [812d80b6] ? security_msg_queue_msgrcv+0x16/0x20 [812c3746] do_msgrcv+0x1a6/0x5f0 [812c2430] ? msg_security+0x10/0x10 [810b6c55] ? trace_hardirqs_on_caller+0x115/0x1a0 [8134b39e] ? trace_hardirqs_on_thunk+0x3a/0x3f [812c3ba5] sys_msgrcv+0x15/0x20 [816cd942]
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones da...@redhat.com wrote: Btw, something that's really bothering me is just how much bogus 'follow-on' spew we have lately. I'm not sure what changed, but it seems to have gotten worse. .. we have many more sanity checks, and they tend to trigger in the case of us killing somebody holding locks etc. Related: is there any value in printing out all the ? symbols on kernels with frame pointers enabled ? I've found them useful. You can often see what happened just before. Of course, I'd still prefer gcc not generate unnecessarily big stack frames (which is one of the bigger reasons for old stuff shining through) but as things are, they often give hints about what happened just before. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones da...@redhat.com wrote: Here's an oops I just hit.. BUG: unable to handle kernel NULL pointer dereference at 000f IP: [812c24ca] testmsg.isra.5+0x1a/0x60 Btw, looking at the code leading up to this, what the f*ck is wrong with the IPC stuff? It's using the generic list stuff for most of the lists, but then it open-codes the accesses. So instead of using for_each_entry(walk_msg, msq-q_messages, m_list) { .. } the ipc/msg.c code does all that by hand, with tmp = msq-q_messages.next; while (tmp != msq-q_messages) { struct msg_msg *walk_msg; walk_msg = list_entry(tmp, struct msg_msg, m_list); ... tmp = tmp-next; } Ugh. The code is near unreadable. And then it has magic memory barriers etc, implying that it doesn't lock the data structures, but no comments about them. See expunge_all() and pipelined_send(). The code seems entirely random, and it's badly set up (annoyance of the day: crazy helper functions in ipc/msgutil.c to make sure that (a) you have to spend more effort looking for them, and (b) they won't get inlined). Clearly nobody has cared for the crazy IPC message code in a long time. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, 2013-03-29 at 11:43 -0700, Linus Torvalds wrote: I think I foud at least one bug in the MSG_COPY stuff: it leaks the copy allocation if mode == SEARCH_LESSEQUAL but maybe I'm misreading it. Yes, you're misreading it. copy_msg() returns the 'copy' address when copying is successful. So this patch double-frees 'copy'. Andrew has had the patches that fix this for a month but because they're fairly extensive he didn't want to apply them for 3.9. Regards, Peter -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, 2013-03-29 at 12:26 -0700, Linus Torvalds wrote: On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones da...@redhat.com wrote: Here's an oops I just hit.. BUG: unable to handle kernel NULL pointer dereference at 000f IP: [812c24ca] testmsg.isra.5+0x1a/0x60 Btw, looking at the code leading up to this, what the f*ck is wrong with the IPC stuff? It's using the generic list stuff for most of the lists, but then it open-codes the accesses. So instead of using for_each_entry(walk_msg, msq-q_messages, m_list) { .. } the ipc/msg.c code does all that by hand, with tmp = msq-q_messages.next; while (tmp != msq-q_messages) { struct msg_msg *walk_msg; walk_msg = list_entry(tmp, struct msg_msg, m_list); ... tmp = tmp-next; } Ugh. The code is near unreadable. And then it has magic memory barriers etc, implying that it doesn't lock the data structures, but no comments about them. See expunge_all() and pipelined_send(). The code seems entirely random, and it's badly set up (annoyance of the day: crazy helper functions in ipc/msgutil.c to make sure that (a) you have to spend more effort looking for them, and (b) they won't get inlined). Clearly nobody has cared for the crazy IPC message code in a long time. Exactly that's what my patch series does; clean this mess up. This is what I wrote to Andrew a couple of days ago. On Tue, 2013-03-26 at 22:33 -0400, Peter Hurley wrote: I just figured out how the queue is being corrupted and why my series fixes it. With MSG_COPY set, the queue scan can exit with the local variable 'msg' pointing to a real msg if the msg_counter never reaches the copy_number. The remaining execution looks like this: if (!IS_ERR(msg)) { if (msgflg MSG_COPY) goto out_unlock; out_unlock: msg_unlock(msq); break; } } if (IS_ERR(msg)) bufsz = msg_handler(); free_msg(msg); msg never unlinked Since the msg should not have been found (because it failed the match criteria), the if (!IS_ERR(msg)) clause should never have executed. That's why my refactor fixes resolve this; because msg is not inadvertently treated as a found msg. But let's be honest; the real bug here is the poor structure of this function that even made this possible. The deepest nesting executes a goto to a label in the middle of an if clause. Yuck! No wonder this thing's fragile. So my recommendation still stands. The series that fixes this has been getting tested in linux-next for a month. Fixing this some other way is just asking for more trouble. But why not just revert MSG_COPY altogether for 3.9? Regards, Peter Hurley -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 12:33 PM, Peter Hurley pe...@hurleysoftware.com wrote: On Fri, 2013-03-29 at 11:43 -0700, Linus Torvalds wrote: I think I foud at least one bug in the MSG_COPY stuff: it leaks the copy allocation if mode == SEARCH_LESSEQUAL but maybe I'm misreading it. Yes, you're misreading it. copy_msg() returns the 'copy' address when copying is successful. So this patch double-frees 'copy'. No it doesn't. The case where copy_msg() *is* called and is successful, msg gets set to copy, my patch sets copy to NULL, so no, it doesn't double-free. That said, it looks like if MSG_COPY is set, we cannot trigger the mode == SEARCH_LESSEQUAL case, because it forces msgtyp to 0, which in turn forces mode = SEARCH_ANY. So the actual leak case seems to not be possible, although that's *really* hard to see from looking at the code. The code is just an unreadable mess. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones da...@redhat.com wrote: Now that I have that reverted, I'm not seeing msgrcv traces any more, but I've started seeing this.. general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC RIP: 0010:[812c20fb] [812c20fb] free_msg+0x2b/0x40 Call Trace: [812c289f] freeque+0xcf/0x140 [812c2a93] msgctl_down.constprop.9+0x183/0x200 [812c2da9] sys_msgctl+0x139/0x400 [816cd942] system_call_fastpath+0x16/0x1b Looks like seg was already kfree'd. Hmm. I have a suspicion. The use of ipc_rcu_getref/ipc_rcu_putref() seems a bit iffy. In particular, the refcount is not an atomic variable, so we absolutely *depend* on the spinlock for it. However, looking at freeque(), that's not actually the case. It releases the message queue spinlock *before* it does the ipc_rcu_putref(), and it does that because the thing has become unreachable (it did a msg_rmid(), which will set -deleted, which in turn will mean that nobody should successfully look it up any more). HOWEVER. While the deleted flag is serialized, the actual refcount is not. So in *parallel* with the freeque() call, we may have some other user that does something like ... ipc_rcu_getref(msq); msg_unlock(msq); schedule(); ipc_lock_by_ptr(msq-q_perm); ipc_rcu_putref(msq); if (msq-q_perm.deleted) { err = -EIDRM; goto out_unlock_free; } ... which got the lock for the deleted test, so that part is all fine, but notice the ipc_rcu_putref(). It can happen at the same time that freeque() also does its own ipc_rcu_putref(). So now refcount may get buggered, resulting in random early reuse, double free's or leaking of the msq. There may be some reason I don't see why this cannot happen, but it does look suspicious. I wonder if the refcount should be an atomic value. The alternative would be to make sure the thing is always locked (and in a rcu-read-safe region) before putref/getref. The only place (apart from the initial allocation, which doesn't matter, because nothing can see if itf that path fails) seems to be that freeque(), but I didn't check everything. Moving the msg_unlock(msq); to the end of freeque() might be the way to go. It's going to cause us to hold the lock for longer, but I'm not sure we care for that path. Guys, am I missing something? This kind of refcounting problem might well explain the rcu-freeing-time bugs reported with the scalability patches: long-time race that just got *much* easier to expose with the higher level of parallelism? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 1:41 PM, Linus Torvalds torva...@linux-foundation.org wrote: The alternative would be to make sure the thing is always locked (and in a rcu-read-safe region) before putref/getref. The only place (apart from the initial allocation, which doesn't matter, because nothing can see if itf that path fails) seems to be that freeque(), but I didn't check everything. Moving the msg_unlock(msq); to the end of freeque() might be the way to go. It's going to cause us to hold the lock for longer, but I'm not sure we care for that path. Uhhuh. I think shm_destroy() has the same pattern. And I think that one has locking reasons why it has to do the shm_unlock() before tearing some things down, although I'm not sure.. The good news is that shm doesn't seem to have any users of ipc_rcu_get/putref(), so I don't see anything to race against. So it has the same buggy pattern, but I don't think it can trigger anything. And ipc/sem.c has the same bug-pattern in freeary(). It does sem_unlock(sma) followed by ipc_rcu_putref(sma);, and it *does* seem to have code that that can race against (sem_lock_and_putref()). The whole sem_putref() tries to be careful and gets the lock for the last ref, but getting the lock doesn't help when freeary() does the refcount access without it. The semaphore case seems to argue fairly strongly for an atomic_t refcount, because right now it does a lot of sem_putref() stuff that just gets the lock for the putref. So not only is that insufficient (if it races against freeary(), but it's also more expensive than just an atomic refcount would have been. I dunno. I'm still not sure this is triggerable, but it looks bad. But both the semaphore case and the msg cases seem to be solvable by moving the unlock down, and shm seem to have no getref/putref users to race with, so this (whitespace-damaged) patch *may* be sufficient: diff --git a/ipc/msg.c b/ipc/msg.c index 31cd1bf6af27..338d8e2b589b 100644 --- a/ipc/msg.c +++ b/ipc/msg.c @@ -284,7 +284,6 @@ static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp) expunge_all(msq, -EIDRM); ss_wakeup(msq-q_senders, 1); msg_rmid(ns, msq); - msg_unlock(msq); tmp = msq-q_messages.next; while (tmp != msq-q_messages) { @@ -297,6 +296,7 @@ static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp) atomic_sub(msq-q_cbytes, ns-msg_bytes); security_msg_queue_free(msq); ipc_rcu_putref(msq); + msg_unlock(msq); } /* diff --git a/ipc/sem.c b/ipc/sem.c index 58d31f1c1eb5..1cf024b9eac0 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -766,12 +766,12 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp) /* Remove the semaphore set from the IDR */ sem_rmid(ns, sma); - sem_unlock(sma); wake_up_sem_queue_do(tasks); ns-used_sems -= sma-sem_nsems; security_sem_free(sma); ipc_rcu_putref(sma); + sem_unlock(sma); } static unsigned long copy_semid_to_user(void __user *buf, struct semid64_ds *in, int version) (I didn't check very carefully, it's possible that we end up having some locking problem if we move the unlocks down later, but it *looks* fine) Anybody? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 2:12 PM, Linus Torvalds torva...@linux-foundation.org wrote: I dunno. I'm still not sure this is triggerable, but it looks bad. But both the semaphore case and the msg cases seem to be solvable by moving the unlock down, and shm seem to have no getref/putref users to race with, so this (whitespace-damaged) patch *may* be sufficient: Well, the patch doesn't seem to cause any problems, at least neither lockdep nor spinlock sleep debugging complains. I have no idea whether it actually fixes any problems, though. I do wonder if this might explain the problem Emmanuel saw. A double free of a RCU-freeable object would possibly result in exactly the kind of mess that Emmanuel reported with the semaphore scalability patches. Emmanuel, can you try the attached patch? I think it applies cleanly on top of the scalability series too without any changes, but I didn't check if the patches perhaps changed some of the naming or something. Linus patch.diff Description: Binary data
Re: ipc,sem: sysv semaphore scalability
Hi Linus, On Sat, Mar 30, 2013 at 6:16 AM, Linus Torvalds torva...@linux-foundation.org wrote: Emmanuel, can you try the attached patch? I think it applies cleanly on top of the scalability series too without any changes, but I didn't check if the patches perhaps changed some of the naming or something. I had to slightly modify the patch since it wouldn't match the changes introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch, hope that was the right thing to do. So, what I tried was: original 7 patches + the one liner + your patch blindly modified by me on the top of 3.9-rc4 and I'm still having twilight zone issues. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, 2013-03-30 at 08:36 +0700, Emmanuel Benisty wrote: Hi Linus, On Sat, Mar 30, 2013 at 6:16 AM, Linus Torvalds torva...@linux-foundation.org wrote: Emmanuel, can you try the attached patch? I think it applies cleanly on top of the scalability series too without any changes, but I didn't check if the patches perhaps changed some of the naming or something. I had to slightly modify the patch since it wouldn't match the changes introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch, hope that was the right thing to do. So, what I tried was: original 7 patches + the one liner + your patch blindly modified by me on the top of 3.9-rc4 and I'm still having twilight zone issues. Not sure which one liner you refer to, but, if you haven't already done so, please try with these fixes (queued in linux-next): http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=a9cead0347283f3e72a39e7b76a3cc479b048e51 http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=4db64b89525ac357cba754c3120065a9ec31 I've been trying to reproduce your twilight zone problem on five different machines now without any luck. Is there anything you're doing to trigger the issue? Does the machine boot ok and then do weird things, say after X starts, open some program, etc? Thanks, Davidlohr -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 6:36 PM, Emmanuel Benisty benist...@gmail.com wrote: I had to slightly modify the patch since it wouldn't match the changes introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch, hope that was the right thing to do. So, what I tried was: original 7 patches + the one liner + your patch blindly modified by me on the top of 3.9-rc4 and I'm still having twilight zone issues. Ok, please send your patch so that I can double-check what you did, but it was simple enough that you probably did the right thing. Sad. Your case definitely looks like a double rcu-free, as shown by the fact that when you enabled SLUB debugging the oops happened with the use-after-free pattern (it's __rcu_reclaim() doing the head-func(head); thing, and func is 0x6b6b6b6b6b6b6b6b, so head has already been free'd once). So ipc_rcu_putref() and a refcounting error looked very promising.as a potential explanation. The 'un' undo structure is also free'd with rcu, but the locking around that seems much more robust. The undo entry is on two lists (sma-list_id, under sma-sem_perm.lock and ulp-list_proc, under ulp-lock). But those locks are actually tested with assert_spin_locked() in all the relevant places, and the code actually looks sane. So I had high hopes for ipc_rcu_putref()... Hmm. Except for exit_sem() that does odd things. You have preemption enabled, don't you? exit_sem() does a lookup of the first list_proc entry under tcy_read_lock to lookup un-semid, and then it drops the rcu read lock. At which point un is no longer reliable, I think. But then it still uses un-semid, rather than the stable value it looked up under the rcu read lock. Which looks bogus. So I'd like you to test a few more things: (a) In exit_sem(), can you change the sma = sem_lock_check(tsk-nsproxy-ipc_ns, un-semid); to use just semid rather than un-semid, because I don't think un is stable here. (b) does the problem go away if you change disable CONFIG_PREEMPT (perhaps to PREEMPT_NONE or PREEMPT_VOLUNTARY?) Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, 2013-03-29 at 19:09 -0700, Linus Torvalds wrote: On Fri, Mar 29, 2013 at 6:36 PM, Emmanuel Benisty benist...@gmail.com wrote: I had to slightly modify the patch since it wouldn't match the changes introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch, hope that was the right thing to do. So, what I tried was: original 7 patches + the one liner + your patch blindly modified by me on the top of 3.9-rc4 and I'm still having twilight zone issues. Ok, please send your patch so that I can double-check what you did, but it was simple enough that you probably did the right thing. Sad. Your case definitely looks like a double rcu-free, as shown by the fact that when you enabled SLUB debugging the oops happened with the use-after-free pattern (it's __rcu_reclaim() doing the head-func(head); thing, and func is 0x6b6b6b6b6b6b6b6b, so head has already been free'd once). So ipc_rcu_putref() and a refcounting error looked very promising.as a potential explanation. The 'un' undo structure is also free'd with rcu, but the locking around that seems much more robust. The undo entry is on two lists (sma-list_id, under sma-sem_perm.lock and ulp-list_proc, under ulp-lock). But those locks are actually tested with assert_spin_locked() in all the relevant places, and the code actually looks sane. So I had high hopes for ipc_rcu_putref()... Hmm. Except for exit_sem() that does odd things. You have preemption enabled, don't you? exit_sem() does a lookup of the first list_proc entry under tcy_read_lock to lookup un-semid, and then it drops the rcu read lock. At which point un is no longer reliable, I think. But then it still uses un-semid, rather than the stable value it looked up under the rcu read lock. Which looks bogus. So I'd like you to test a few more things: (a) In exit_sem(), can you change the sma = sem_lock_check(tsk-nsproxy-ipc_ns, un-semid); to use just semid rather than un-semid, because I don't think un is stable here. Well that's not really the case in the new code. We don't drop the rcu read lock until the end of the loop, in sem_unlock(). However, I just noticed that we're checking sma for error after trying to acquire sma-sem_perm.lock: sma = sem_obtain_object_check(tsk-nsproxy-ipc_ns, un-semid); sem_lock(sma, NULL, -1); /* exit_sem raced with IPC_RMID, nothing to do */ if (IS_ERR(sma)) continue; The IS_ERR(sma) check should be right after the sem_obtain_object_check() call instead. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
Hi Davidlohr, On Sat, Mar 30, 2013 at 9:08 AM, Davidlohr Bueso davidlohr.bu...@hp.com wrote: Not sure which one liner you refer to, but, if you haven't already done so, please try with these fixes (queued in linux-next): http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=a9cead0347283f3e72a39e7b76a3cc479b048e51 http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=4db64b89525ac357cba754c3120065a9ec31 I've been trying to reproduce your twilight zone problem on five different machines now without any luck. Is there anything you're doing to trigger the issue? Does the machine boot ok and then do weird things, say after X starts, open some program, etc? I was missing a9cead0, thanks. What I usually do is starting a standard session, which looks like this: init-+-5*[agetty] |-bash---startx---xinit-+-X---2*[{X}] | `-dwm---sh---sleep |-bash---chromium-+-chromium | |-chromium-+-chromium | | `-2*[{chromium}] | |-chromium-sandbo---chromium---chromium---4*[chromium---4*[{chromium}]] | `-65*[{chromium}] |-crond |-dbus-daemon |-klogd |-syslogd |-tmux-+-alsamixer | |-bash---bash | |-bash | |-htop | `-newsbeuter---{newsbeuter} |-udevd |-urxvtd-+-bash---pstree |`-bash---tmux `-wpa_supplicant Then I start building a random package and the problems start. They may also happen without compiling but this seems to trigger the bug quite quickly. Anyway, some progress here, I hope: dmesg seems to be willing to reveal some secrets (using some pastebin service since this is pretty big): https://gist.github.com/anonymous/5275120 Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty benist...@gmail.com wrote: Then I start building a random package and the problems start. They may also happen without compiling but this seems to trigger the bug quite quickly. I suspect it's about preemption, and the build just results in enough scheduling load that you start hitting whatever race there is. Anyway, some progress here, I hope: dmesg seems to be willing to reveal some secrets (using some pastebin service since this is pretty big): https://gist.github.com/anonymous/5275120 That looks like exactly the exit_sem() bug that Davidlohr was talking about, where the /* exit_sem raced with IPC_RMID, nothing to do */ if (IS_ERR(sma)) continue; should be moved to *before* the sem_lock(sma, NULL, -1); call. And apparently the bug I had found is already fixed in -next. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, Mar 30, 2013 at 10:46 AM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty benist...@gmail.com wrote: Then I start building a random package and the problems start. They may also happen without compiling but this seems to trigger the bug quite quickly. I suspect it's about preemption, and the build just results in enough scheduling load that you start hitting whatever race there is. Anyway, some progress here, I hope: dmesg seems to be willing to reveal some secrets (using some pastebin service since this is pretty big): https://gist.github.com/anonymous/5275120 That looks like exactly the exit_sem() bug that Davidlohr was talking about, where the /* exit_sem raced with IPC_RMID, nothing to do */ if (IS_ERR(sma)) continue; should be moved to *before* the sem_lock(sma, NULL, -1); call. And apparently the bug I had found is already fixed in -next. I just tried the 7 original patches + the 2 one liners from -next + modified Linus' patch (attached) on the top of 3.9-rc4 using PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained above. I was building two packages at the same time, went away for 30 seconds, came back and everything froze as soon as I touched the laptop's touchpad. Maybe a coincidence but anyway... Another shot in the dark, I had this weird message when trying to build gcc: semop(2): encountered an error: Identifier removed patch.diff Description: Binary data
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 9:33 PM, Emmanuel Benisty benist...@gmail.com wrote: I just tried the 7 original patches + the 2 one liners from -next + modified Linus' patch (attached) .. that patch looks fine. on the top of 3.9-rc4 using PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained above. I was building two packages at the same time, went away for 30 seconds, came back and everything froze as soon as I touched the laptop's touchpad. Maybe a coincidence but anyway... Another shot in the dark, I had this weird message when trying to build gcc: semop(2): encountered an error: Identifier removed This came from the gcc build? That's just crazy. No normal app uses sysv semaphores. I have an older gcc build environment, and some grepping shows it has some ipc semaphore use in the libstdc++ testsuite, and some libmudflap hooks, but that should be very very minor. You seem to trigger errors really trivially easily, which is really odd. It's sounding less and less like some subtle race, and more like the error just happens all the time. If you can make even the gcc build generate errors, I don't think they can be some rare blue-moon thing. I notice that your dmesg says that your kernel is compiled by gcc-4.8.1 prerelease. Is there any chance that you could try to install a known-stable gcc, like 4.7.2 or something. It's entirely possible that it's a kernel bug even if it's triggered by some more aggressive compiler optimization or something, but it would be really good to try to see if this might be gcc-specific. For example, I wonder if your gcc might miscompile idr_alloc() or something, so that we get the same ID for different ipc objects. That would certainly potentially cause chaos. Hmm? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds torva...@linux-foundation.org wrote: Another shot in the dark, I had this weird message when trying to build gcc: semop(2): encountered an error: Identifier removed This came from the gcc build? yes, very early in the build process, IIRC this line was repeated a few times and the build just stalled. That's just crazy. No normal app uses sysv semaphores. I have an older gcc build environment, and some grepping shows it has some ipc semaphore use in the libstdc++ testsuite, and some libmudflap hooks, but that should be very very minor. I notice that your dmesg says that your kernel is compiled by gcc-4.8.1 prerelease. Is there any chance that you could try to install a known-stable gcc, like 4.7.2 or something. It's entirely possible that it's a kernel bug even if it's triggered by some more aggressive compiler optimization or something, but it would be really good to try to see if this might be gcc-specific. I could build a kernel on another machine on which 4.7.2 is installed, kernel oops'd as well: http://i.imgur.com/uk6gmq1.jpg FWIW, I have a few things disabled in my config so here is the one I used, maybe I'm missing something (but again, everything works perfectly with your tree): https://gist.github.com/anonymous/5275566 Lastly, just FTR, I tried Andrew's 3.9-rc4-mm1 and got the same issues, unsurprisingly I guess. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones wrote: > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote: > > > Whichever way we go, we should get a wiggle on - this has been hanging > > around for too long. Dave, do you have time to determine whether > > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger > > than max") fixes things up? > > Ok, with that reverted it's been grinding away for a few hours without > incident. > Normally I see the oops within a minute or so. > OK, thanks, I queued a revert: From: Andrew Morton Subject: revert "ipc: don't allocate a copy larger than max" Revert 88b9e456b164. Dave has confirmed that this was causing oopses during trinity testing. Cc: Peter Hurley Cc: Stanislav Kinsbursky Reported-by: Dave Jones Cc: Signed-off-by: Andrew Morton --- ipc/msg.c |6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff -puN ipc/msg.c~revert-ipc-dont-allocate-a-copy-larger-than-max ipc/msg.c --- a/ipc/msg.c~revert-ipc-dont-allocate-a-copy-larger-than-max +++ a/ipc/msg.c @@ -820,17 +820,15 @@ long do_msgrcv(int msqid, void __user *b struct msg_msg *copy = NULL; unsigned long copy_number = 0; - ns = current->nsproxy->ipc_ns; - if (msqid < 0 || (long) bufsz < 0) return -EINVAL; if (msgflg & MSG_COPY) { - copy = prepare_copy(buf, min_t(size_t, bufsz, ns->msg_ctlmax), - msgflg, , _number); + copy = prepare_copy(buf, bufsz, msgflg, , _number); if (IS_ERR(copy)) return PTR_ERR(copy); } mode = convert_mode(, msgflg); + ns = current->nsproxy->ipc_ns; msq = msg_lock_check(ns, msqid); if (IS_ERR(msq)) { _ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote: > Whichever way we go, we should get a wiggle on - this has been hanging > around for too long. Dave, do you have time to determine whether > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger > than max") fixes things up? Ok, with that reverted it's been grinding away for a few hours without incident. Normally I see the oops within a minute or so. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, 26 Mar 2013 10:59:27 -0700 Davidlohr Bueso wrote: > On Mon, 2013-03-25 at 20:47 +0700, Emmanuel Benisty wrote: > > On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds > > wrote: > > > And you never see this problem without Rik's patches? > > > > No, never. > > > > > Could you bisect > > > *which* patch it starts with? Are the first four ones ok (the moving > > > of the locking around, but without the fine-grained ones), for > > > example? > > > > With the first four patches only, I got some X server freeze (just tried > > once). > > Going over the code again, I found a potential recursive spinlock scenario. > Andrew, if you have no objections, please queue this up. > > Thanks. > > ---8<--- > > From: Davidlohr Bueso > Subject: [PATCH] ipc, sem: prevent possible deadlock > > In semctl_main(), when cmd == GETALL, we're locking > sma->sem_perm.lock (through sem_lock_and_putref), yet > after the conditional, we lock it again. > Unlock sma right after exiting the conditional. > > Signed-off-by: Davidlohr Bueso > --- > ipc/sem.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/ipc/sem.c b/ipc/sem.c > index 1a2913d..f257afe 100644 > --- a/ipc/sem.c > +++ b/ipc/sem.c > @@ -1243,6 +1243,7 @@ static int semctl_main(struct ipc_namespace *ns, int > semid, int semnum, > err = -EIDRM; > goto out_free; > } > + sem_unlock(sma, -1); > } > > sem_lock(sma, NULL, -1); Looks right. Do we need the locking at all? What does it actually do? sem_lock_and_putref(sma); if (sma->sem_perm.deleted) { sem_unlock(sma, -1); err = -EIDRM; goto out_free; } sem_unlock(sma, -1); We're taking the lock, testing an int and then dropping the lock. What's the point in that? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/26/2013 02:07 PM, Sasha Levin wrote: On 03/26/2013 01:51 PM, Davidlohr Bueso wrote: On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote: On 03/20/2013 03:55 PM, Rik van Riel wrote: This series makes the sysv semaphore code more scalable, by reducing the time the semaphore lock is held, and making the locking more scalable for semaphore arrays with multiple semaphores. Hi Rik, Another issue that came up is: [ 96.347341] [ 96.348085] [ BUG: lock held when returning to user space! ] [ 96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G W [ 96.360300] [ 96.361084] trinity-child9/7583 is leaving the kernel with locks still held! [ 96.362019] 1 lock held by trinity-child9/7583: [ 96.362610] #0: (rcu_read_lock){.+.+..}, at: [] SYSC_semtimedop+0x1fb/0xec0 It seems that we can leave semtimedop without releasing the rcu read lock. I'm a bit confused by what's going on in semtimedop with regards to rcu read lock, it seems that this behaviour is actually intentional? rcu_read_lock(); sma = sem_obtain_object_check(ns, semid); if (IS_ERR(sma)) { if (un) rcu_read_unlock(); error = PTR_ERR(sma); goto out_free; } When I've looked at that it seems that not releasing the read lock was (very) intentional. This logic was from the original code, which I also found to be quite confusing. I wasn't getting this warning with the old code, so there was probably something else that triggers this now. After that, the only code path that would release the lock starts with: if (un) { ... So we won't release the lock at all if un is NULL? Not necessarily, we do release everything at the end of the function: out_unlock_free: sem_unlock(sma, locknum); Ow, there's a rcu_read_unlock() in sem_unlock()? This complicates things even more I suspect. If un is non-NULL we'll be unlocking rcu lock twice? It is uglier than you think... On success, find_alloc_undo will call rcu_read_lock, so we have the rcu_read_lock held twice :( Some of the ipc code is quite ugly, but making too many large changes at once is just asking for trouble. I suspect we're going to have to untangle this one bit at a time... -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/26/2013 01:59 PM, Davidlohr Bueso wrote: From: Davidlohr Bueso Subject: [PATCH] ipc, sem: prevent possible deadlock In semctl_main(), when cmd == GETALL, we're locking sma->sem_perm.lock (through sem_lock_and_putref), yet after the conditional, we lock it again. Unlock sma right after exiting the conditional. Signed-off-by: Davidlohr Bueso Reviewed-by: Rik van Riel -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/20/2013 03:55 PM, Rik van Riel wrote: > This series makes the sysv semaphore code more scalable, > by reducing the time the semaphore lock is held, and making > the locking more scalable for semaphore arrays with multiple > semaphores. Hi Rik, Another issue that came up is: [ 96.347341] [ 96.348085] [ BUG: lock held when returning to user space! ] [ 96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G W [ 96.360300] [ 96.361084] trinity-child9/7583 is leaving the kernel with locks still held! [ 96.362019] 1 lock held by trinity-child9/7583: [ 96.362610] #0: (rcu_read_lock){.+.+..}, at: [] SYSC_semtimedop+0x1fb/0xec0 It seems that we can leave semtimedop without releasing the rcu read lock. I'm a bit confused by what's going on in semtimedop with regards to rcu read lock, it seems that this behaviour is actually intentional? rcu_read_lock(); sma = sem_obtain_object_check(ns, semid); if (IS_ERR(sma)) { if (un) rcu_read_unlock(); error = PTR_ERR(sma); goto out_free; } When I've looked at that it seems that not releasing the read lock was (very) intentional. After that, the only code path that would release the lock starts with: if (un) { ... So we won't release the lock at all if un is NULL? Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/26/2013 01:51 PM, Davidlohr Bueso wrote: > On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote: >> On 03/20/2013 03:55 PM, Rik van Riel wrote: >>> This series makes the sysv semaphore code more scalable, >>> by reducing the time the semaphore lock is held, and making >>> the locking more scalable for semaphore arrays with multiple >>> semaphores. >> >> Hi Rik, >> >> Another issue that came up is: >> >> [ 96.347341] >> [ 96.348085] [ BUG: lock held when returning to user space! ] >> [ 96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G >> W >> [ 96.360300] >> [ 96.361084] trinity-child9/7583 is leaving the kernel with locks still >> held! >> [ 96.362019] 1 lock held by trinity-child9/7583: >> [ 96.362610] #0: (rcu_read_lock){.+.+..}, at: [] >> SYSC_semtimedop+0x1fb/0xec0 >> >> It seems that we can leave semtimedop without releasing the rcu read lock. >> >> I'm a bit confused by what's going on in semtimedop with regards to rcu read >> lock, it >> seems that this behaviour is actually intentional? >> >> rcu_read_lock(); >> sma = sem_obtain_object_check(ns, semid); >> if (IS_ERR(sma)) { >> if (un) >> rcu_read_unlock(); >> error = PTR_ERR(sma); >> goto out_free; >> } >> >> When I've looked at that it seems that not releasing the read lock was (very) >> intentional. > > This logic was from the original code, which I also found to be quite > confusing. I wasn't getting this warning with the old code, so there was probably something else that triggers this now. >> >> After that, the only code path that would release the lock starts with: >> >> if (un) { >> ... >> >> So we won't release the lock at all if un is NULL? >> > > Not necessarily, we do release everything at the end of the function: > > out_unlock_free: > sem_unlock(sma, locknum); Ow, there's a rcu_read_unlock() in sem_unlock()? This complicates things even more I suspect. If un is non-NULL we'll be unlocking rcu lock twice? if (un->semid == -1) { rcu_read_unlock(); goto out_unlock_free; } [...] out_unlock_free: sem_unlock(sma, locknum); Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Mon, 2013-03-25 at 20:47 +0700, Emmanuel Benisty wrote: > On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds > wrote: > > And you never see this problem without Rik's patches? > > No, never. > > > Could you bisect > > *which* patch it starts with? Are the first four ones ok (the moving > > of the locking around, but without the fine-grained ones), for > > example? > > With the first four patches only, I got some X server freeze (just tried > once). Going over the code again, I found a potential recursive spinlock scenario. Andrew, if you have no objections, please queue this up. Thanks. ---8<--- From: Davidlohr Bueso Subject: [PATCH] ipc, sem: prevent possible deadlock In semctl_main(), when cmd == GETALL, we're locking sma->sem_perm.lock (through sem_lock_and_putref), yet after the conditional, we lock it again. Unlock sma right after exiting the conditional. Signed-off-by: Davidlohr Bueso --- ipc/sem.c | 1 + 1 file changed, 1 insertion(+) diff --git a/ipc/sem.c b/ipc/sem.c index 1a2913d..f257afe 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -1243,6 +1243,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum, err = -EIDRM; goto out_free; } + sem_unlock(sma, -1); } sem_lock(sma, NULL, -1); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, Mar 26, 2013 at 01:33:07PM -0400, Sasha Levin wrote: > On 03/20/2013 03:55 PM, Rik van Riel wrote: > > This series makes the sysv semaphore code more scalable, > > by reducing the time the semaphore lock is held, and making > > the locking more scalable for semaphore arrays with multiple > > semaphores. > > Hi Rik, > > Another issue that came up is: > > [ 96.347341] > [ 96.348085] [ BUG: lock held when returning to user space! ] > [ 96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G > W > [ 96.360300] > [ 96.361084] trinity-child9/7583 is leaving the kernel with locks still > held! > [ 96.362019] 1 lock held by trinity-child9/7583: > [ 96.362610] #0: (rcu_read_lock){.+.+..}, at: [] > SYSC_semtimedop+0x1fb/0xec0 > > It seems that we can leave semtimedop without releasing the rcu read lock. > > I'm a bit confused by what's going on in semtimedop with regards to rcu read > lock, it > seems that this behaviour is actually intentional? > > rcu_read_lock(); > sma = sem_obtain_object_check(ns, semid); > if (IS_ERR(sma)) { > if (un) > rcu_read_unlock(); > error = PTR_ERR(sma); > goto out_free; > } > > When I've looked at that it seems that not releasing the read lock was (very) > intentional. > > After that, the only code path that would release the lock starts with: > > if (un) { > ... > > So we won't release the lock at all if un is NULL? Intentions notwithstanding, it is absolutely required to exit any and all RCU read-side critical sections prior to going into user mode. I suggest removing the "if (un)". Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote: > On 03/20/2013 03:55 PM, Rik van Riel wrote: > > This series makes the sysv semaphore code more scalable, > > by reducing the time the semaphore lock is held, and making > > the locking more scalable for semaphore arrays with multiple > > semaphores. > > Hi Rik, > > Another issue that came up is: > > [ 96.347341] > [ 96.348085] [ BUG: lock held when returning to user space! ] > [ 96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G > W > [ 96.360300] > [ 96.361084] trinity-child9/7583 is leaving the kernel with locks still > held! > [ 96.362019] 1 lock held by trinity-child9/7583: > [ 96.362610] #0: (rcu_read_lock){.+.+..}, at: [] > SYSC_semtimedop+0x1fb/0xec0 > > It seems that we can leave semtimedop without releasing the rcu read lock. > > I'm a bit confused by what's going on in semtimedop with regards to rcu read > lock, it > seems that this behaviour is actually intentional? > > rcu_read_lock(); > sma = sem_obtain_object_check(ns, semid); > if (IS_ERR(sma)) { > if (un) > rcu_read_unlock(); > error = PTR_ERR(sma); > goto out_free; > } > > When I've looked at that it seems that not releasing the read lock was (very) > intentional. This logic was from the original code, which I also found to be quite confusing. > > After that, the only code path that would release the lock starts with: > > if (un) { > ... > > So we won't release the lock at all if un is NULL? > Not necessarily, we do release everything at the end of the function: out_unlock_free: sem_unlock(sma, locknum); Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote: On 03/20/2013 03:55 PM, Rik van Riel wrote: This series makes the sysv semaphore code more scalable, by reducing the time the semaphore lock is held, and making the locking more scalable for semaphore arrays with multiple semaphores. Hi Rik, Another issue that came up is: [ 96.347341] [ 96.348085] [ BUG: lock held when returning to user space! ] [ 96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G W [ 96.360300] [ 96.361084] trinity-child9/7583 is leaving the kernel with locks still held! [ 96.362019] 1 lock held by trinity-child9/7583: [ 96.362610] #0: (rcu_read_lock){.+.+..}, at: [8192eafb] SYSC_semtimedop+0x1fb/0xec0 It seems that we can leave semtimedop without releasing the rcu read lock. I'm a bit confused by what's going on in semtimedop with regards to rcu read lock, it seems that this behaviour is actually intentional? rcu_read_lock(); sma = sem_obtain_object_check(ns, semid); if (IS_ERR(sma)) { if (un) rcu_read_unlock(); error = PTR_ERR(sma); goto out_free; } When I've looked at that it seems that not releasing the read lock was (very) intentional. This logic was from the original code, which I also found to be quite confusing. After that, the only code path that would release the lock starts with: if (un) { ... So we won't release the lock at all if un is NULL? Not necessarily, we do release everything at the end of the function: out_unlock_free: sem_unlock(sma, locknum); Thanks, Davidlohr -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, Mar 26, 2013 at 01:33:07PM -0400, Sasha Levin wrote: On 03/20/2013 03:55 PM, Rik van Riel wrote: This series makes the sysv semaphore code more scalable, by reducing the time the semaphore lock is held, and making the locking more scalable for semaphore arrays with multiple semaphores. Hi Rik, Another issue that came up is: [ 96.347341] [ 96.348085] [ BUG: lock held when returning to user space! ] [ 96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G W [ 96.360300] [ 96.361084] trinity-child9/7583 is leaving the kernel with locks still held! [ 96.362019] 1 lock held by trinity-child9/7583: [ 96.362610] #0: (rcu_read_lock){.+.+..}, at: [8192eafb] SYSC_semtimedop+0x1fb/0xec0 It seems that we can leave semtimedop without releasing the rcu read lock. I'm a bit confused by what's going on in semtimedop with regards to rcu read lock, it seems that this behaviour is actually intentional? rcu_read_lock(); sma = sem_obtain_object_check(ns, semid); if (IS_ERR(sma)) { if (un) rcu_read_unlock(); error = PTR_ERR(sma); goto out_free; } When I've looked at that it seems that not releasing the read lock was (very) intentional. After that, the only code path that would release the lock starts with: if (un) { ... So we won't release the lock at all if un is NULL? Intentions notwithstanding, it is absolutely required to exit any and all RCU read-side critical sections prior to going into user mode. I suggest removing the if (un). Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Mon, 2013-03-25 at 20:47 +0700, Emmanuel Benisty wrote: On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds torva...@linux-foundation.org wrote: And you never see this problem without Rik's patches? No, never. Could you bisect *which* patch it starts with? Are the first four ones ok (the moving of the locking around, but without the fine-grained ones), for example? With the first four patches only, I got some X server freeze (just tried once). Going over the code again, I found a potential recursive spinlock scenario. Andrew, if you have no objections, please queue this up. Thanks. ---8--- From: Davidlohr Bueso davidlohr.bu...@hp.com Subject: [PATCH] ipc, sem: prevent possible deadlock In semctl_main(), when cmd == GETALL, we're locking sma-sem_perm.lock (through sem_lock_and_putref), yet after the conditional, we lock it again. Unlock sma right after exiting the conditional. Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com --- ipc/sem.c | 1 + 1 file changed, 1 insertion(+) diff --git a/ipc/sem.c b/ipc/sem.c index 1a2913d..f257afe 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -1243,6 +1243,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum, err = -EIDRM; goto out_free; } + sem_unlock(sma, -1); } sem_lock(sma, NULL, -1); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/26/2013 01:51 PM, Davidlohr Bueso wrote: On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote: On 03/20/2013 03:55 PM, Rik van Riel wrote: This series makes the sysv semaphore code more scalable, by reducing the time the semaphore lock is held, and making the locking more scalable for semaphore arrays with multiple semaphores. Hi Rik, Another issue that came up is: [ 96.347341] [ 96.348085] [ BUG: lock held when returning to user space! ] [ 96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G W [ 96.360300] [ 96.361084] trinity-child9/7583 is leaving the kernel with locks still held! [ 96.362019] 1 lock held by trinity-child9/7583: [ 96.362610] #0: (rcu_read_lock){.+.+..}, at: [8192eafb] SYSC_semtimedop+0x1fb/0xec0 It seems that we can leave semtimedop without releasing the rcu read lock. I'm a bit confused by what's going on in semtimedop with regards to rcu read lock, it seems that this behaviour is actually intentional? rcu_read_lock(); sma = sem_obtain_object_check(ns, semid); if (IS_ERR(sma)) { if (un) rcu_read_unlock(); error = PTR_ERR(sma); goto out_free; } When I've looked at that it seems that not releasing the read lock was (very) intentional. This logic was from the original code, which I also found to be quite confusing. I wasn't getting this warning with the old code, so there was probably something else that triggers this now. After that, the only code path that would release the lock starts with: if (un) { ... So we won't release the lock at all if un is NULL? Not necessarily, we do release everything at the end of the function: out_unlock_free: sem_unlock(sma, locknum); Ow, there's a rcu_read_unlock() in sem_unlock()? This complicates things even more I suspect. If un is non-NULL we'll be unlocking rcu lock twice? if (un-semid == -1) { rcu_read_unlock(); goto out_unlock_free; } [...] out_unlock_free: sem_unlock(sma, locknum); Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/20/2013 03:55 PM, Rik van Riel wrote: This series makes the sysv semaphore code more scalable, by reducing the time the semaphore lock is held, and making the locking more scalable for semaphore arrays with multiple semaphores. Hi Rik, Another issue that came up is: [ 96.347341] [ 96.348085] [ BUG: lock held when returning to user space! ] [ 96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G W [ 96.360300] [ 96.361084] trinity-child9/7583 is leaving the kernel with locks still held! [ 96.362019] 1 lock held by trinity-child9/7583: [ 96.362610] #0: (rcu_read_lock){.+.+..}, at: [8192eafb] SYSC_semtimedop+0x1fb/0xec0 It seems that we can leave semtimedop without releasing the rcu read lock. I'm a bit confused by what's going on in semtimedop with regards to rcu read lock, it seems that this behaviour is actually intentional? rcu_read_lock(); sma = sem_obtain_object_check(ns, semid); if (IS_ERR(sma)) { if (un) rcu_read_unlock(); error = PTR_ERR(sma); goto out_free; } When I've looked at that it seems that not releasing the read lock was (very) intentional. After that, the only code path that would release the lock starts with: if (un) { ... So we won't release the lock at all if un is NULL? Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/26/2013 01:59 PM, Davidlohr Bueso wrote: From: Davidlohr Bueso davidlohr.bu...@hp.com Subject: [PATCH] ipc, sem: prevent possible deadlock In semctl_main(), when cmd == GETALL, we're locking sma-sem_perm.lock (through sem_lock_and_putref), yet after the conditional, we lock it again. Unlock sma right after exiting the conditional. Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com Reviewed-by: Rik van Riel r...@redhat.com -- All rights reversed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/26/2013 02:07 PM, Sasha Levin wrote: On 03/26/2013 01:51 PM, Davidlohr Bueso wrote: On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote: On 03/20/2013 03:55 PM, Rik van Riel wrote: This series makes the sysv semaphore code more scalable, by reducing the time the semaphore lock is held, and making the locking more scalable for semaphore arrays with multiple semaphores. Hi Rik, Another issue that came up is: [ 96.347341] [ 96.348085] [ BUG: lock held when returning to user space! ] [ 96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G W [ 96.360300] [ 96.361084] trinity-child9/7583 is leaving the kernel with locks still held! [ 96.362019] 1 lock held by trinity-child9/7583: [ 96.362610] #0: (rcu_read_lock){.+.+..}, at: [8192eafb] SYSC_semtimedop+0x1fb/0xec0 It seems that we can leave semtimedop without releasing the rcu read lock. I'm a bit confused by what's going on in semtimedop with regards to rcu read lock, it seems that this behaviour is actually intentional? rcu_read_lock(); sma = sem_obtain_object_check(ns, semid); if (IS_ERR(sma)) { if (un) rcu_read_unlock(); error = PTR_ERR(sma); goto out_free; } When I've looked at that it seems that not releasing the read lock was (very) intentional. This logic was from the original code, which I also found to be quite confusing. I wasn't getting this warning with the old code, so there was probably something else that triggers this now. After that, the only code path that would release the lock starts with: if (un) { ... So we won't release the lock at all if un is NULL? Not necessarily, we do release everything at the end of the function: out_unlock_free: sem_unlock(sma, locknum); Ow, there's a rcu_read_unlock() in sem_unlock()? This complicates things even more I suspect. If un is non-NULL we'll be unlocking rcu lock twice? It is uglier than you think... On success, find_alloc_undo will call rcu_read_lock, so we have the rcu_read_lock held twice :( Some of the ipc code is quite ugly, but making too many large changes at once is just asking for trouble. I suspect we're going to have to untangle this one bit at a time... -- All rights reversed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, 26 Mar 2013 10:59:27 -0700 Davidlohr Bueso davidlohr.bu...@hp.com wrote: On Mon, 2013-03-25 at 20:47 +0700, Emmanuel Benisty wrote: On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds torva...@linux-foundation.org wrote: And you never see this problem without Rik's patches? No, never. Could you bisect *which* patch it starts with? Are the first four ones ok (the moving of the locking around, but without the fine-grained ones), for example? With the first four patches only, I got some X server freeze (just tried once). Going over the code again, I found a potential recursive spinlock scenario. Andrew, if you have no objections, please queue this up. Thanks. ---8--- From: Davidlohr Bueso davidlohr.bu...@hp.com Subject: [PATCH] ipc, sem: prevent possible deadlock In semctl_main(), when cmd == GETALL, we're locking sma-sem_perm.lock (through sem_lock_and_putref), yet after the conditional, we lock it again. Unlock sma right after exiting the conditional. Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com --- ipc/sem.c | 1 + 1 file changed, 1 insertion(+) diff --git a/ipc/sem.c b/ipc/sem.c index 1a2913d..f257afe 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -1243,6 +1243,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum, err = -EIDRM; goto out_free; } + sem_unlock(sma, -1); } sem_lock(sma, NULL, -1); Looks right. Do we need the locking at all? What does it actually do? sem_lock_and_putref(sma); if (sma-sem_perm.deleted) { sem_unlock(sma, -1); err = -EIDRM; goto out_free; } sem_unlock(sma, -1); We're taking the lock, testing an int and then dropping the lock. What's the point in that? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/