subject:"Re\: ipc,sem\: sysv semaphore scalability"

Re: ipc,sem: sysv semaphore scalability

2013-05-06 Thread Jörn Engel

On Sat, 4 May 2013 20:12:45 +0200, Borislav Petkov wrote:
> On Sat, May 04, 2013 at 11:55:58AM -0400, Jörn Engel wrote:
> > Blockconsole currently lives here:
> > https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/
> 
> Tja, if only that were upstream...

Linus has a pull request.  If he prefers to change code until the
relevant bits fit into 80x25, that is his choice. ;)

Jörn

--
The only good bug is a dead bug.
-- Starship Troopers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-05-06 Thread Jörn Engel

On Sat, 4 May 2013 20:12:45 +0200, Borislav Petkov wrote:
 On Sat, May 04, 2013 at 11:55:58AM -0400, Jörn Engel wrote:
  Blockconsole currently lives here:
  https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/
 
 Tja, if only that were upstream...

Linus has a pull request.  If he prefers to change code until the
relevant bits fit into 80x25, that is his choice. ;)

Jörn

--
The only good bug is a dead bug.
-- Starship Troopers
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-05-04 Thread Borislav Petkov

On Sat, May 04, 2013 at 11:55:58AM -0400, Jörn Engel wrote:
> Blockconsole currently lives here:
> https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/

Tja, if only that were upstream...

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-05-04 Thread Jörn Engel

On Sat, 23 March 2013 10:19:16 +0700, Emmanuel Benisty wrote:
> 
> I could reproduce it but could you please let me know what would be
> the right tools I should use to catch the original oops?
> This is what I got but I doubt it will be helpful:
> http://i.imgur.com/Mewi1hC.jpg

You could use either netconsole or blockconsole.  Netconsole requires
a second machine to capture the information, blockconsole requires a
usb key or something similar to write to.  In both cases you will get
the entire console output in a file, often including very helpful
messages leading up to the crash.

Blockconsole currently lives here:
https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/

Jörn

--
Unless something dramatically changes, by 2015 we'll be largely
wondering what all the fuss surrounding Linux was really about.
-- Rob Enderle
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-05-04 Thread Jörn Engel

On Sat, 23 March 2013 10:19:16 +0700, Emmanuel Benisty wrote:
 
 I could reproduce it but could you please let me know what would be
 the right tools I should use to catch the original oops?
 This is what I got but I doubt it will be helpful:
 http://i.imgur.com/Mewi1hC.jpg

You could use either netconsole or blockconsole.  Netconsole requires
a second machine to capture the information, blockconsole requires a
usb key or something similar to write to.  In both cases you will get
the entire console output in a file, often including very helpful
messages leading up to the crash.

Blockconsole currently lives here:
https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/

Jörn

--
Unless something dramatically changes, by 2015 we'll be largely
wondering what all the fuss surrounding Linux was really about.
-- Rob Enderle
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-05-04 Thread Borislav Petkov

On Sat, May 04, 2013 at 11:55:58AM -0400, Jörn Engel wrote:
 Blockconsole currently lives here:
 https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/

Tja, if only that were upstream...

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-05-03 Thread Peter Hurley


On 03/29/2013 03:01 PM, Dave Jones wrote:

On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote:
  > On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones  wrote:
  >
  > > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:
  > >
  > >  > Whichever way we go, we should get a wiggle on - this has been hanging
  > >  > around for too long.  Dave, do you have time to determine whether
  > >  > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
  > >  > than max") fixes things up?
  > >
  > > Ok, with that reverted it's been grinding away for a few hours without 
incident.
  > > Normally I see the oops within a minute or so.
  > >
  >
  > OK, thanks, I queued a revert:
  >
  > From: Andrew Morton 
  > Subject: revert "ipc: don't allocate a copy larger than max"
  >
  > Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
  > during trinity testing.

I owe Peter an apology. I just hit it again with that backed out.
Andrew, might as well drop that revert.

BUG: unable to handle kernel NULL pointer dereference at 000f
IP: [] testmsg.isra.5+0x1a/0x60
[...snip...]

I think I wasn't seeing that this last week because I had inadvertantly 
disabled DEBUG_PAGEALLOC

and.. we're back to square one.

Dave


Andrew,

I just realized you're still carrying

commit 4bea54c91bcc5451f237e6b721b0b35eccd01d17
Author: Andrew Morton 
Date:   Fri Apr 26 10:55:12 2013 +1000

revert "ipc: don't allocate a copy larger than max"

Revert 88b9e456b164.  Dave has confirmed that this was causing 
oopses
during trinity testing.

Cc: Peter Hurley 
Cc: Stanislav Kinsbursky 
Reported-by: Dave Jones 
Cc: 
Signed-off-by: Andrew Morton 

Please drop.

As quoted above, the testing on mainline that attributed
the observed oops to the reverted patch was due to a config
error.

As Linus pointed out below, this patch fixes real bugs.

On 04/02/2013 03:53 PM, Sasha Levin wrote:
> On 04/02/2013 01:52 PM, Linus Torvalds wrote:
>> On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin  wrote:
>>>
>>> By just playing with the 'msgsz' parameter with MSG_COPY set.
>>
>> Hmm. Looking closer, I suspect you're testing without commit
>> 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That
>> should limit the size passed in to prepare_copy -> load_copy to
>> msg_ctlmax.
>
> That commit has a revert in the -next trees, do we need a revert
> of the revert?
>
>commit ff6577a3e714ccae02d4400e989762c19c37b0b3
>Author: Andrew Morton 
>Date:   Wed Mar 27 10:24:02 2013 +1100
>
>revert "ipc: don't allocate a copy larger than max"
>
>Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
>during trinity testing.
>
>Cc: Peter Hurley 
>Cc: Stanislav Kinsbursky 
>Reported-by: Dave Jones 
>Cc: 
>Signed-off-by: Andrew Morton 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-05-03 Thread Peter Hurley


On 03/29/2013 03:01 PM, Dave Jones wrote:

On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote:
   On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones da...@redhat.com wrote:
  
On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:
   
  Whichever way we go, we should get a wiggle on - this has been hanging
  around for too long.  Dave, do you have time to determine whether
  reverting 88b9e456b1649722673ff (ipc: don't allocate a copy larger
  than max) fixes things up?
   
Ok, with that reverted it's been grinding away for a few hours without 
incident.
Normally I see the oops within a minute or so.
   
  
   OK, thanks, I queued a revert:
  
   From: Andrew Morton a...@linux-foundation.org
   Subject: revert ipc: don't allocate a copy larger than max
  
   Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
   during trinity testing.

I owe Peter an apology. I just hit it again with that backed out.
Andrew, might as well drop that revert.

BUG: unable to handle kernel NULL pointer dereference at 000f
IP: [812c24ca] testmsg.isra.5+0x1a/0x60
[...snip...]

I think I wasn't seeing that this last week because I had inadvertantly 
disabled DEBUG_PAGEALLOC

and.. we're back to square one.

Dave


Andrew,

I just realized you're still carrying

commit 4bea54c91bcc5451f237e6b721b0b35eccd01d17
Author: Andrew Morton a...@linux-foundation.org
Date:   Fri Apr 26 10:55:12 2013 +1000

revert ipc: don't allocate a copy larger than max

Revert 88b9e456b164.  Dave has confirmed that this was causing 
oopses
during trinity testing.

Cc: Peter Hurley pe...@hurleysoftware.com
Cc: Stanislav Kinsbursky skinsbur...@parallels.com
Reported-by: Dave Jones da...@redhat.com
Cc: sta...@vger.kernel.org
Signed-off-by: Andrew Morton a...@linux-foundation.org

Please drop.

As quoted above, the testing on mainline that attributed
the observed oops to the reverted patch was due to a config
error.

As Linus pointed out below, this patch fixes real bugs.

On 04/02/2013 03:53 PM, Sasha Levin wrote:
 On 04/02/2013 01:52 PM, Linus Torvalds wrote:
 On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin sasha.le...@oracle.com wrote:

 By just playing with the 'msgsz' parameter with MSG_COPY set.

 Hmm. Looking closer, I suspect you're testing without commit
 88b9e456b164 (ipc: don't allocate a copy larger than max). That
 should limit the size passed in to prepare_copy - load_copy to
 msg_ctlmax.

 That commit has a revert in the -next trees, do we need a revert
 of the revert?

commit ff6577a3e714ccae02d4400e989762c19c37b0b3
Author: Andrew Morton a...@linux-foundation.org
Date:   Wed Mar 27 10:24:02 2013 +1100

revert ipc: don't allocate a copy larger than max

Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
during trinity testing.

Cc: Peter Hurley pe...@hurleysoftware.com
Cc: Stanislav Kinsbursky skinsbur...@parallels.com
Reported-by: Dave Jones da...@redhat.com
Cc: sta...@vger.kernel.org
Signed-off-by: Andrew Morton a...@linux-foundation.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-04-16 Thread Andrew Morton

On Tue, 26 Mar 2013 11:35:33 -0700 Andrew Morton  
wrote:

> Do we need the locking at all?  What does it actually do?
> 
>   sem_lock_and_putref(sma);
>   if (sma->sem_perm.deleted) {
>   sem_unlock(sma, -1);
>   err = -EIDRM;
>   goto out_free;
>   }
>   sem_unlock(sma, -1);
> 
> We're taking the lock, testing an int and then dropping the lock. 
> What's the point in that?

Rikpoke.

The new semctl_main() is now taking a lock, testing
sma->sem_perm.deleted then dropping that lock.  It looks wrong.  What
is that lock testing against?  What prevents .deleted from changing
value 1ns after we dropped that lock?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-04-16 Thread Andrew Morton

On Tue, 26 Mar 2013 11:35:33 -0700 Andrew Morton a...@linux-foundation.org 
wrote:

 Do we need the locking at all?  What does it actually do?
 
   sem_lock_and_putref(sma);
   if (sma-sem_perm.deleted) {
   sem_unlock(sma, -1);
   err = -EIDRM;
   goto out_free;
   }
   sem_unlock(sma, -1);
 
 We're taking the lock, testing an int and then dropping the lock. 
 What's the point in that?

Rikpoke.

The new semctl_main() is now taking a lock, testing
sma-sem_perm.deleted then dropping that lock.  It looks wrong.  What
is that lock testing against?  What prevents .deleted from changing
value 1ns after we dropped that lock?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Dave Jones

On Tue, Apr 02, 2013 at 03:53:01PM -0400, Sasha Levin wrote:
 > On 04/02/2013 01:52 PM, Linus Torvalds wrote:
 > > On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin  wrote:
 > >>
 > >> By just playing with the 'msgsz' parameter with MSG_COPY set.
 > > 
 > > Hmm. Looking closer, I suspect you're testing without commit
 > > 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That
 > > should limit the size passed in to prepare_copy -> load_copy to
 > > msg_ctlmax.
 > 
 > That commit has a revert in the -next trees, do we need a revert
 > of the revert?

Yeah, I told Andrew to drop that, but I think he's travelling.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Sasha Levin

On 04/02/2013 01:52 PM, Linus Torvalds wrote:
> On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin  wrote:
>>
>> By just playing with the 'msgsz' parameter with MSG_COPY set.
> 
> Hmm. Looking closer, I suspect you're testing without commit
> 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That
> should limit the size passed in to prepare_copy -> load_copy to
> msg_ctlmax.

That commit has a revert in the -next trees, do we need a revert
of the revert?

commit ff6577a3e714ccae02d4400e989762c19c37b0b3
Author: Andrew Morton 
Date:   Wed Mar 27 10:24:02 2013 +1100

revert "ipc: don't allocate a copy larger than max"

Revert 88b9e456b164.  Dave has confirmed that this was causing 
oopses
during trinity testing.

Cc: Peter Hurley 
Cc: Stanislav Kinsbursky 
Reported-by: Dave Jones 
Cc: 
Signed-off-by: Andrew Morton 



Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Linus Torvalds

On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin  wrote:
>
> By just playing with the 'msgsz' parameter with MSG_COPY set.

Hmm. Looking closer, I suspect you're testing without commit
88b9e456b164 ("ipc: don't allocate a copy larger than max"). That
should limit the size passed in to prepare_copy -> load_copy to
msg_ctlmax.

Now, I think it's possibly still a good idea to limit bufsz to INT_MAX
regardless, but as far as I can see that prepare_copy -> load_copy
path is the only place that can get confused. Everybody else uses
size_t (or "long" in the case of r_maxsize) as far as I can tell.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Linus Torvalds

On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin  wrote:
>
> If you guys are already looking at this, the conversions between size_t,
> long and int in the do_msgrcv/load_msg/alloc_msg code are a mess. You could
> trigger anything from:

Good catch.

Let's just change the "(long)bufsz < 0" into "bufsz > INT_MAX".

I suspect we should change some of the "int" arguments to "size_t" too
so that we don't have these kinds of odd "different routines see
different values due to subtle casting errors", but in the end we
don't really want to ever help people have these kinds of potential
overflow issues. We already limit normal read/write/sendmsg etc to
INT_MAX (although we tend to *truncate* it to INT_MAX rather than
return an error, but I think the simpler patch here is preferable
unless somebody complains).

Comments?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Sasha Levin

On 03/29/2013 03:36 PM, Peter Hurley wrote:
> On Fri, 2013-03-29 at 12:26 -0700, Linus Torvalds wrote:
>> On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones  wrote:
>>>
>>> Here's an oops I just hit..
>>>
>>> BUG: unable to handle kernel NULL pointer dereference at 000f
>>> IP: [] testmsg.isra.5+0x1a/0x60
>>
>> Btw, looking at the code leading up to this, what the f*ck is wrong
>> with the IPC stuff?
>>
>> It's using the generic list stuff for most of the lists, but then it
>> open-codes the accesses.
>>
>> So instead of using
>>
>>for_each_entry(walk_msg, >q_messages, m_list) {
>>   ..
>>}
>>
>> the ipc/msg.c code does all that by hand, with
>>
>>tmp = msq->q_messages.next;
>>while (tmp != >q_messages) {
>>   struct msg_msg *walk_msg;
>>
>>   walk_msg = list_entry(tmp, struct msg_msg, m_list);
>>   ...
>>   tmp = tmp->next;
>>}
>>
>> Ugh. The code is near unreadable. And then it has magic memory
>> barriers etc, implying that it doesn't lock the data structures, but
>> no comments about them. See expunge_all() and pipelined_send().
>>
>> The code seems entirely random, and it's badly set up (annoyance of
>> the day: crazy helper functions in ipc/msgutil.c to make sure that (a)
>> you have to spend more effort looking for them, and (b) they won't get
>> inlined).
>>
>> Clearly nobody has cared for the crazy IPC message code in a long time.
> 
> Exactly that's what my patch series does; clean this mess up.
> 
> This is what I wrote to Andrew a couple of days ago.
> 
> On Tue, 2013-03-26 at 22:33 -0400, Peter Hurley wrote:
> I just figured out how the queue is being corrupted and why my series
>> fixes it.
>>
>>
>> With MSG_COPY set, the queue scan can exit with the local variable
> 'msg'
>> pointing to a real msg if the msg_counter never reaches the
> copy_number.
>>
>> The remaining execution looks like this:
>>
>>  if (!IS_ERR(msg)) {
>>  
>>  if (msgflg & MSG_COPY)
>>  goto out_unlock;
>>  
>>
>> out_unlock:
>>  msg_unlock(msq);
>>  break;
>>  }
>>  }
>>  if (IS_ERR(msg))
>>  
>>
>>  bufsz = msg_handler();
>>  free_msg(msg);  << msg never unlinked
>>
>>
>> Since the msg should not have been found (because it failed the match
>> criteria), the if (!IS_ERR(msg)) clause should never have executed.
>>
>> That's why my refactor fixes resolve this; because msg is not
>> inadvertently treated as a found msg.
>>
>> But let's be honest; the real bug here is the poor structure of this
>> function that even made this possible. The deepest nesting executes a
>> goto to a label in the middle of an if clause. Yuck! No wonder this
>> thing's fragile.
>>
>> So my recommendation still stands. The series that fixes this has been
>> getting tested in linux-next for a month. Fixing this some other way
> is
>> just asking for more trouble.
>>
>> But why not just revert MSG_COPY altogether for 3.9?

If you guys are already looking at this, the conversions between size_t,
long and int in the do_msgrcv/load_msg/alloc_msg code are a mess. You could
trigger anything from:

[   33.046572] BUG: unable to handle kernel paging request at 88003c2c7000
[   33.047721] IP: [] bad_from_user+0x4/0x6
[   33.048528] PGD 7232067 PUD 7233067 PMD 3067 PTE 80003c2c7060
[   33.049506] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   33.050029] Modules linked in:
[   33.050029] CPU 0
[   33.050029] Pid: 6885, comm: a.out Tainted: GW
3.9.0-rc4-next-20130328-sasha-00017-g1463000 #321
[   33.050029] RIP: 0010:[]  [] 
bad_from_user+0x4/0x6
[   33.050029] RSP: 0018:88003462be40  EFLAGS: 00010246
[   33.050029] RAX:  RBX: fffb RCX: ff06ae2b
[   33.050029] RDX: fffb RSI: 7fffed36d2a0 RDI: 88003c2c7000
[   33.050029] RBP: 88003462be88 R08: 0280 R09: 
[   33.050029] R10:  R11:  R12: fffb
[   33.050029] R13: 7fffed36d2a0 R14:  R15: 
[   33.050029] FS:  7f6990044700() GS:88003dc0() 
knlGS:
[   33.050029] CS:  0010 DS:  ES:  CR0: 80050033
[   33.050029] CR2: 88003c2c7000 CR3: 347c8000 CR4: 000406f0
[   33.050029] DR0:  DR1:  DR2: 
[   33.050029] DR3:  DR6: 0ff0 DR7: 0400
[   33.050029] Process a.out (pid: 6885, threadinfo 88003462a000, task 
880034cb3000)
[   33.050029] Stack:
[   33.050029]  8192a6a9 88003462be98 88003b331e00 
88003ddd01e0
[   33.050029]    0001 

[   33.050029]   88003462bf68 8192bb34 

[   33.050029] Call Trace:
[   33.050029]  [] ?

Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Sasha Levin

On 03/29/2013 03:36 PM, Peter Hurley wrote:
 On Fri, 2013-03-29 at 12:26 -0700, Linus Torvalds wrote:
 On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones da...@redhat.com wrote:

 Here's an oops I just hit..

 BUG: unable to handle kernel NULL pointer dereference at 000f
 IP: [812c24ca] testmsg.isra.5+0x1a/0x60

 Btw, looking at the code leading up to this, what the f*ck is wrong
 with the IPC stuff?

 It's using the generic list stuff for most of the lists, but then it
 open-codes the accesses.

 So instead of using

for_each_entry(walk_msg, msq-q_messages, m_list) {
   ..
}

 the ipc/msg.c code does all that by hand, with

tmp = msq-q_messages.next;
while (tmp != msq-q_messages) {
   struct msg_msg *walk_msg;

   walk_msg = list_entry(tmp, struct msg_msg, m_list);
   ...
   tmp = tmp-next;
}

 Ugh. The code is near unreadable. And then it has magic memory
 barriers etc, implying that it doesn't lock the data structures, but
 no comments about them. See expunge_all() and pipelined_send().

 The code seems entirely random, and it's badly set up (annoyance of
 the day: crazy helper functions in ipc/msgutil.c to make sure that (a)
 you have to spend more effort looking for them, and (b) they won't get
 inlined).

 Clearly nobody has cared for the crazy IPC message code in a long time.
 
 Exactly that's what my patch series does; clean this mess up.
 
 This is what I wrote to Andrew a couple of days ago.
 
 On Tue, 2013-03-26 at 22:33 -0400, Peter Hurley wrote:
 I just figured out how the queue is being corrupted and why my series
 fixes it.


 With MSG_COPY set, the queue scan can exit with the local variable
 'msg'
 pointing to a real msg if the msg_counter never reaches the
 copy_number.

 The remaining execution looks like this:

  if (!IS_ERR(msg)) {
  
  if (msgflg  MSG_COPY)
  goto out_unlock;
  

 out_unlock:
  msg_unlock(msq);
  break;
  }
  }
  if (IS_ERR(msg))
  

  bufsz = msg_handler();
  free_msg(msg);   msg never unlinked


 Since the msg should not have been found (because it failed the match
 criteria), the if (!IS_ERR(msg)) clause should never have executed.

 That's why my refactor fixes resolve this; because msg is not
 inadvertently treated as a found msg.

 But let's be honest; the real bug here is the poor structure of this
 function that even made this possible. The deepest nesting executes a
 goto to a label in the middle of an if clause. Yuck! No wonder this
 thing's fragile.

 So my recommendation still stands. The series that fixes this has been
 getting tested in linux-next for a month. Fixing this some other way
 is
 just asking for more trouble.

 But why not just revert MSG_COPY altogether for 3.9?

If you guys are already looking at this, the conversions between size_t,
long and int in the do_msgrcv/load_msg/alloc_msg code are a mess. You could
trigger anything from:

[   33.046572] BUG: unable to handle kernel paging request at 88003c2c7000
[   33.047721] IP: [83dbcb34] bad_from_user+0x4/0x6
[   33.048528] PGD 7232067 PUD 7233067 PMD 3067 PTE 80003c2c7060
[   33.049506] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   33.050029] Modules linked in:
[   33.050029] CPU 0
[   33.050029] Pid: 6885, comm: a.out Tainted: GW
3.9.0-rc4-next-20130328-sasha-00017-g1463000 #321
[   33.050029] RIP: 0010:[83dbcb34]  [83dbcb34] 
bad_from_user+0x4/0x6
[   33.050029] RSP: 0018:88003462be40  EFLAGS: 00010246
[   33.050029] RAX:  RBX: fffb RCX: ff06ae2b
[   33.050029] RDX: fffb RSI: 7fffed36d2a0 RDI: 88003c2c7000
[   33.050029] RBP: 88003462be88 R08: 0280 R09: 
[   33.050029] R10:  R11:  R12: fffb
[   33.050029] R13: 7fffed36d2a0 R14:  R15: 
[   33.050029] FS:  7f6990044700() GS:88003dc0() 
knlGS:
[   33.050029] CS:  0010 DS:  ES:  CR0: 80050033
[   33.050029] CR2: 88003c2c7000 CR3: 347c8000 CR4: 000406f0
[   33.050029] DR0:  DR1:  DR2: 
[   33.050029] DR3:  DR6: 0ff0 DR7: 0400
[   33.050029] Process a.out (pid: 6885, threadinfo 88003462a000, task 
880034cb3000)
[   33.050029] Stack:
[   33.050029]  8192a6a9 88003462be98 88003b331e00 
88003ddd01e0
[   33.050029]    0001 

[   33.050029]   88003462bf68 8192bb34 

[   33.050029] Call Trace:
[   33.050029]  [8192a6a9] ? load_msg+0x59/0x100
[   33.050029]  [8192bb34] do_msgrcv+0x74/0x5b0
[   33.050029]

Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Linus Torvalds

On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin sasha.le...@oracle.com wrote:

 If you guys are already looking at this, the conversions between size_t,
 long and int in the do_msgrcv/load_msg/alloc_msg code are a mess. You could
 trigger anything from:

Good catch.

Let's just change the (long)bufsz  0 into bufsz  INT_MAX.

I suspect we should change some of the int arguments to size_t too
so that we don't have these kinds of odd different routines see
different values due to subtle casting errors, but in the end we
don't really want to ever help people have these kinds of potential
overflow issues. We already limit normal read/write/sendmsg etc to
INT_MAX (although we tend to *truncate* it to INT_MAX rather than
return an error, but I think the simpler patch here is preferable
unless somebody complains).

Comments?

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Linus Torvalds

On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin sasha.le...@oracle.com wrote:

 By just playing with the 'msgsz' parameter with MSG_COPY set.

Hmm. Looking closer, I suspect you're testing without commit
88b9e456b164 (ipc: don't allocate a copy larger than max). That
should limit the size passed in to prepare_copy - load_copy to
msg_ctlmax.

Now, I think it's possibly still a good idea to limit bufsz to INT_MAX
regardless, but as far as I can see that prepare_copy - load_copy
path is the only place that can get confused. Everybody else uses
size_t (or long in the case of r_maxsize) as far as I can tell.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Sasha Levin

On 04/02/2013 01:52 PM, Linus Torvalds wrote:
 On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin sasha.le...@oracle.com wrote:

 By just playing with the 'msgsz' parameter with MSG_COPY set.
 
 Hmm. Looking closer, I suspect you're testing without commit
 88b9e456b164 (ipc: don't allocate a copy larger than max). That
 should limit the size passed in to prepare_copy - load_copy to
 msg_ctlmax.

That commit has a revert in the -next trees, do we need a revert
of the revert?

commit ff6577a3e714ccae02d4400e989762c19c37b0b3
Author: Andrew Morton a...@linux-foundation.org
Date:   Wed Mar 27 10:24:02 2013 +1100

revert ipc: don't allocate a copy larger than max

Revert 88b9e456b164.  Dave has confirmed that this was causing 
oopses
during trinity testing.

Cc: Peter Hurley pe...@hurleysoftware.com
Cc: Stanislav Kinsbursky skinsbur...@parallels.com
Reported-by: Dave Jones da...@redhat.com
Cc: sta...@vger.kernel.org
Signed-off-by: Andrew Morton a...@linux-foundation.org



Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Dave Jones

On Tue, Apr 02, 2013 at 03:53:01PM -0400, Sasha Levin wrote:
  On 04/02/2013 01:52 PM, Linus Torvalds wrote:
   On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin sasha.le...@oracle.com wrote:
  
   By just playing with the 'msgsz' parameter with MSG_COPY set.
   
   Hmm. Looking closer, I suspect you're testing without commit
   88b9e456b164 (ipc: don't allocate a copy larger than max). That
   should limit the size passed in to prepare_copy - load_copy to
   msg_ctlmax.
  
  That commit has a revert in the -next trees, do we need a revert
  of the revert?

Yeah, I told Andrew to drop that, but I think he's travelling.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-04-01 Thread Stanislav Kinsbursky

29.03.2013 22:43, Linus Torvalds пишет:

On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones  wrote:

>
>Now that I have that reverted, I'm not seeing msgrcv traces any more, but
>I've started seeing this..
>
>general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC

Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you
disable it?

I think I foud at least one bug in the MSG_COPY stuff: it leaks the
"copy" allocation if

 mode == SEARCH_LESSEQUAL

Hello, Linus.
Sorry, but I don't see copy allocation leak.
Dummy message is allocated always in msgflg has MSG_COPY flag being set.
Also prepare_copy() use msgtyp as a message number to copy and thus set it to 0.

but maybe I'm misreading it. And that shouldn't cause the problem you
see, but it's indicative of how badly tested and thought through the
MSG_COPY code is.

Totally UNTESTED leak fix appended. Stanislav?

I don't see, how this patch can help. And we should not release it until copy is
done in msg_handler, because msg is equal to copy.

Dummy copy message is release either by free_copy() (if msg is error) or
free_msg().

But there are two issues here definitely:

1) Poor SELinux support for message
copying. This issue was addressed by Stephen Smalley here:

https://lkml.org/lkml/2013/2/6/663

But look like he didn't sent the patch to Andrew.

2) Copy leak and queue corruption in case of
copy message wasn't found (this was mentioned by Peter in another thread; 
thanks for
catching this, Peter!), because msg will be a valid pointer and all the "message 
copy"
clean up logic doesn't work.

I like Peter's cleanup and fix series.  But if it looks like too much changes
for this old code, I have another small patch, which should fix the issue:

ipc: set msg back to -EAGAIN if copy wasn't performed

Make sure, that msg pointer is set back to error value in case of MSG_COPY
flag is set and desired message to copy wasn't found. This garantees, that 
msg
is either a error pointer or a copy address.
Otherwise last message in queue will be freed without unlinking from queue
(which leads to memory corruption) plus dummy allocated copy won't be 
released.

Signed-off-by: Stanislav Kinsbursky 

diff --git a/ipc/msg.c b/ipc/msg.c
index 31cd1bf..fede1d0 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -872,6 +872,7 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, 
long msgtyp,
goto out_unlock;
break;
}
+   msg = ERR_PTR(-EAGAIN);
} else
break;
msg_counter++;

  Linus

patch.diff

  ipc/msg.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 31cd1bf6af27..b841508556cb 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -870,6 +870,7 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, 
long msgtyp,
msg = copy_msg(msg, copy);
if (IS_ERR(msg))
goto out_unlock;
+   copy = NULL;
break;
}
} else
@@ -976,10 +977,9 @@ out_unlock:
break;
}
}
-   if (IS_ERR(msg)) {
-   free_copy(copy);
+   free_copy(copy);
+   if (IS_ERR(msg))
return PTR_ERR(msg);
-   }

bufsz = msg_handler(buf, msg, bufsz);
free_msg(msg);

--
Best regards,
Stanislav Kinsbursky
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-04-01 Thread Stanislav Kinsbursky


29.03.2013 22:43, Linus Torvalds пишет:

On Fri, Mar 29, 2013 at 9:17 AM, Dave Jonesda...@redhat.com  wrote:


Now that I have that reverted, I'm not seeing msgrcv traces any more, but
I've started seeing this..

general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC

Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you
disable it?

I think I foud at least one bug in the MSG_COPY stuff: it leaks the
copy allocation if

 mode == SEARCH_LESSEQUAL



Hello, Linus.
Sorry, but I don't see copy allocation leak.
Dummy message is allocated always in msgflg has MSG_COPY flag being set.
Also prepare_copy() use msgtyp as a message number to copy and thus set it to 0.


but maybe I'm misreading it. And that shouldn't cause the problem you
see, but it's indicative of how badly tested and thought through the
MSG_COPY code is.

Totally UNTESTED leak fix appended. Stanislav?



I don't see, how this patch can help. And we should not release it until copy is
done in msg_handler, because msg is equal to copy.

Dummy copy message is release either by free_copy() (if msg is error) or
free_msg().

But there are two issues here definitely:

1) Poor SELinux support for message
copying. This issue was addressed by Stephen Smalley here:

https://lkml.org/lkml/2013/2/6/663

But look like he didn't sent the patch to Andrew.

2) Copy leak and queue corruption in case of
copy message wasn't found (this was mentioned by Peter in another thread; 
thanks for
catching this, Peter!), because msg will be a valid pointer and all the message 
copy
clean up logic doesn't work.

I like Peter's cleanup and fix series.  But if it looks like too much changes
for this old code, I have another small patch, which should fix the issue:

ipc: set msg back to -EAGAIN if copy wasn't performed

Make sure, that msg pointer is set back to error value in case of MSG_COPY
flag is set and desired message to copy wasn't found. This garantees, that 
msg
is either a error pointer or a copy address.
Otherwise last message in queue will be freed without unlinking from queue
(which leads to memory corruption) plus dummy allocated copy won't be 
released.

Signed-off-by: Stanislav Kinsbursky skinsbur...@parallels.com

diff --git a/ipc/msg.c b/ipc/msg.c
index 31cd1bf..fede1d0 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -872,6 +872,7 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, 
long msgtyp,
goto out_unlock;
break;
}
+   msg = ERR_PTR(-EAGAIN);
} else
break;
msg_counter++;


  Linus


patch.diff


  ipc/msg.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 31cd1bf6af27..b841508556cb 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -870,6 +870,7 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, 
long msgtyp,
msg = copy_msg(msg, copy);
if (IS_ERR(msg))
goto out_unlock;
+   copy = NULL;
break;
}
} else
@@ -976,10 +977,9 @@ out_unlock:
break;
}
}
-   if (IS_ERR(msg)) {
-   free_copy(copy);
+   free_copy(copy);
+   if (IS_ERR(msg))
return PTR_ERR(msg);
-   }

bufsz = msg_handler(buf, msg, bufsz);
free_msg(msg);




--
Best regards,
Stanislav Kinsbursky
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-31 Thread Linus Torvalds

On Sun, Mar 31, 2013 at 6:45 AM, Rik van Riel  wrote:
>
> Should we use "semid" here, like Linus suggested, instead of "un->semid"?

As Davidlohr noted, in linux-next the rcu read-lock is held over the
whole thing, so no, un->semid should be stable once "un" has been
re-looked-up under the semaphore lock.

In mainline, the problem is that the "sem_lock_check()" is done with
"un->semid" *after* we've dropped the RCU read-lock, so "un" at that
point is not reliable (it could be free'd at any time underneath us).

That said, I really *really* hate what both mainline and linux-next do
with the RCU read lock, and linux-next is arguably worse.

The whole "take the RCU lock in one place, and release it in another"
is confusing and bug-prone as hell. And linux-next made it worse: now
sem_lock() no longer takes the read-lock (it expects the caller to
take it), but sem_unlock() still drops the read-lock. This is all just
f*cking crazy.

The rule should be that the rcu read-lock is always and released at
the same "level". For example, find_alloc_undo() should just be called
with (and unconditionaly return with) the rcu read-lock held, and if
it needs to actually do an allocation, it can drop the rcu lock for
the duration of the allocation.

This whole "conditional locking" depending on error returns and on
whether we have undo's etc is bug-prone and confusing. And when you
have totally different locking rules for "sem_lock()" vs
"sem_unlock()", you know you're confused.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-31 Thread Emmanuel Benisty

Hi Davidlohr,

On Sun, Mar 31, 2013 at 12:01 PM, Davidlohr Bueso
 wrote:
> Specially dropping the rcu read lock before the continue statement
> (sorry for not mentioning this in the last email).

I was missing this indeed, thanks. Still the same issues however...
I'll do some more testing on the same machine but with a totally
different environment, within tomorrow hopefully.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-31 Thread Rik van Riel


On 03/31/2013 01:01 AM, Davidlohr Bueso wrote:


diff --git a/ipc/sem.c b/ipc/sem.c
index f257afe..74cedfe 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1867,8 +1867,7 @@ void exit_sem(struct task_struct *tsk)
struct sem_array *sma;
struct sem_undo *un;
struct list_head tasks;
-   int semid;
-   int i;
+   int semid, i;

rcu_read_lock();
un = list_entry_rcu(ulp->list_proc.next,
@@ -1884,12 +1883,13 @@ void exit_sem(struct task_struct *tsk)
}

sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid);


Should we use "semid" here, like Linus suggested, instead of "un->semid"?


-   sem_lock(sma, NULL, -1);
-
/* exit_sem raced with IPC_RMID, nothing to do */
-   if (IS_ERR(sma))
+   if (IS_ERR(sma)) {
+   rcu_read_unlock();
continue;
+   }

+   sem_lock(sma, NULL, -1);
un = __lookup_undo(ulp, semid);
if (un == NULL) {
/* exit_sem raced with IPC_RMID+semget() that created



--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-31 Thread Rik van Riel


On 03/31/2013 01:01 AM, Davidlohr Bueso wrote:


diff --git a/ipc/sem.c b/ipc/sem.c
index f257afe..74cedfe 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1867,8 +1867,7 @@ void exit_sem(struct task_struct *tsk)
struct sem_array *sma;
struct sem_undo *un;
struct list_head tasks;
-   int semid;
-   int i;
+   int semid, i;

rcu_read_lock();
un = list_entry_rcu(ulp-list_proc.next,
@@ -1884,12 +1883,13 @@ void exit_sem(struct task_struct *tsk)
}

sma = sem_obtain_object_check(tsk-nsproxy-ipc_ns, un-semid);


Should we use semid here, like Linus suggested, instead of un-semid?


-   sem_lock(sma, NULL, -1);
-
/* exit_sem raced with IPC_RMID, nothing to do */
-   if (IS_ERR(sma))
+   if (IS_ERR(sma)) {
+   rcu_read_unlock();
continue;
+   }

+   sem_lock(sma, NULL, -1);
un = __lookup_undo(ulp, semid);
if (un == NULL) {
/* exit_sem raced with IPC_RMID+semget() that created



--
All rights reversed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-31 Thread Emmanuel Benisty

Hi Davidlohr,

On Sun, Mar 31, 2013 at 12:01 PM, Davidlohr Bueso
davidlohr.bu...@hp.com wrote:
 Specially dropping the rcu read lock before the continue statement
 (sorry for not mentioning this in the last email).

I was missing this indeed, thanks. Still the same issues however...
I'll do some more testing on the same machine but with a totally
different environment, within tomorrow hopefully.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-31 Thread Linus Torvalds

On Sun, Mar 31, 2013 at 6:45 AM, Rik van Riel r...@surriel.com wrote:

 Should we use semid here, like Linus suggested, instead of un-semid?

As Davidlohr noted, in linux-next the rcu read-lock is held over the
whole thing, so no, un-semid should be stable once un has been
re-looked-up under the semaphore lock.

In mainline, the problem is that the sem_lock_check() is done with
un-semid *after* we've dropped the RCU read-lock, so un at that
point is not reliable (it could be free'd at any time underneath us).

That said, I really *really* hate what both mainline and linux-next do
with the RCU read lock, and linux-next is arguably worse.

The whole take the RCU lock in one place, and release it in another
is confusing and bug-prone as hell. And linux-next made it worse: now
sem_lock() no longer takes the read-lock (it expects the caller to
take it), but sem_unlock() still drops the read-lock. This is all just
f*cking crazy.

The rule should be that the rcu read-lock is always and released at
the same level. For example, find_alloc_undo() should just be called
with (and unconditionaly return with) the rcu read-lock held, and if
it needs to actually do an allocation, it can drop the rcu lock for
the duration of the allocation.

This whole conditional locking depending on error returns and on
whether we have undo's etc is bug-prone and confusing. And when you
have totally different locking rules for sem_lock() vs
sem_unlock(), you know you're confused.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-30 Thread Davidlohr Bueso

On Sat, 2013-03-30 at 11:33 +0700, Emmanuel Benisty wrote:
> On Sat, Mar 30, 2013 at 10:46 AM, Linus Torvalds
>  wrote:
> > On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty  
> > wrote:
> >>
> >> Then I start building a random package and the problems start. They
> >> may also happen without compiling but this seems to trigger the bug
> >> quite quickly.
> >
> > I suspect it's about preemption, and the build just results in enough
> > scheduling load that you start hitting whatever race there is.
> >
> >> Anyway, some progress here, I hope: dmesg seems to be
> >> willing to reveal some secrets (using some pastebin service since this
> >> is pretty big):
> >>
> >> https://gist.github.com/anonymous/5275120
> >
> > That looks like exactly the exit_sem() bug that Davidlohr was talking
> > about, where the
> >
> > /* exit_sem raced with IPC_RMID, nothing to do */
> > if (IS_ERR(sma))
> > continue;
> >
> > should be moved to *before* the
> >
> > sem_lock(sma, NULL, -1);
> >
> > call. And apparently the bug I had found is already fixed in -next.
> 
> I just tried the 7 original patches + the 2 one liners from -next +
> modified Linus' patch (attached) on the top of 3.9-rc4 using
> PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained
> above. I was building two packages at the same time, went away for 30
> seconds, came back and everything froze as soon as I touched the
> laptop's touchpad. Maybe a coincidence but anyway... Another shot in
> the dark, I had this weird message when trying to build gcc:
> semop(2): encountered an error: Identifier removed

*sigh*. I had high hopes for this being the bug triggering your issue,
specially after seeing exit_sem() in the trace. 

Emmanuel, just to be sure, does your changes reflect the patch below?
Specially dropping the rcu read lock before the continue statement
(sorry for not mentioning this in the last email).

Anyway, this is still a bug. Andrew, the patch below applies to
linux-next, please queue this up if you don't have any objections. 

Thanks,
Davidlohr

---8<---
From: Davidlohr Bueso 
Subject: [PATCH] ipc, sem: do not call sem_lock when bogus sma

In exit_sem() we attempt to acquire the sma->sem_perm.lock by calling
sem_lock() immediately after obtaining sma. However, if sma isn't valid,
then calling sem_lock() will tend to do bad things.

Move the sma error check right after the sem_obtain_object_check() call instead.

Signed-off-by: Davidlohr Bueso 
---
 ipc/sem.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index f257afe..74cedfe 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1867,8 +1867,7 @@ void exit_sem(struct task_struct *tsk)
struct sem_array *sma;
struct sem_undo *un;
struct list_head tasks;
-   int semid;
-   int i;
+   int semid, i;
 
rcu_read_lock();
un = list_entry_rcu(ulp->list_proc.next,
@@ -1884,12 +1883,13 @@ void exit_sem(struct task_struct *tsk)
}
 
sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid);
-   sem_lock(sma, NULL, -1);
-
/* exit_sem raced with IPC_RMID, nothing to do */
-   if (IS_ERR(sma))
+   if (IS_ERR(sma)) {
+   rcu_read_unlock();
continue;
+   }
 
+   sem_lock(sma, NULL, -1);
un = __lookup_undo(ulp, semid);
if (un == NULL) {
/* exit_sem raced with IPC_RMID+semget() that created
-- 
1.7.11.7



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-30 Thread Emmanuel Benisty

Hi Linus,

On Sun, Mar 31, 2013 at 12:22 AM, Linus Torvalds
 wrote:
> On Fri, Mar 29, 2013 at 10:57 PM, Emmanuel Benisty  
> wrote:
>> On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds
>>>
>>> This came from the gcc build?
>>
>> yes, very early in the build process, IIRC this line was repeated a
>> few times and the build just stalled.
>
> Ok, we're bringing out the crazy hacks now.
>
> The attached patch is just insane, doesn't really even work in
> general, and only even compiles on 64-bit. But it should work in
> *practice* to find if somebody adds the same RCU head to the RCU lists
> twice, and ignore the second time it happens (and give a warning that
> hopefully pinpoints the backtrace).
>
> It's ugly. It's broken. It may not work. In other words, I'm not proud
> of it. But you seem to be the only one able to trigger the issue
> easily, willing to try crazy crap, so "tag, you're it". Maybe this
> gives us more information. And maybe it doesn't, and I'm totally wrong
> about the whole "rcu head added twice" theory.

That's all I could get so far:
https://gist.github.com/anonymous/5279255

Losing wireless is generally the start signal of controlled demolition
of the machine.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-30 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 10:57 PM, Emmanuel Benisty  wrote:
> On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds
>>
>> This came from the gcc build?
>
> yes, very early in the build process, IIRC this line was repeated a
> few times and the build just stalled.

Ok, we're bringing out the crazy hacks now.

The attached patch is just insane, doesn't really even work in
general, and only even compiles on 64-bit. But it should work in
*practice* to find if somebody adds the same RCU head to the RCU lists
twice, and ignore the second time it happens (and give a warning that
hopefully pinpoints the backtrace).

It's ugly. It's broken. It may not work. In other words, I'm not proud
of it. But you seem to be the only one able to trigger the issue
easily, willing to try crazy crap, so "tag, you're it". Maybe this
gives us more information. And maybe it doesn't, and I'm totally wrong
about the whole "rcu head added twice" theory.

Linus

patch.diff
Description: Binary data

Re: ipc,sem: sysv semaphore scalability

2013-03-30 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 10:57 PM, Emmanuel Benisty benist...@gmail.com wrote:
 On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds

 This came from the gcc build?

 yes, very early in the build process, IIRC this line was repeated a
 few times and the build just stalled.

Ok, we're bringing out the crazy hacks now.

The attached patch is just insane, doesn't really even work in
general, and only even compiles on 64-bit. But it should work in
*practice* to find if somebody adds the same RCU head to the RCU lists
twice, and ignore the second time it happens (and give a warning that
hopefully pinpoints the backtrace).

It's ugly. It's broken. It may not work. In other words, I'm not proud
of it. But you seem to be the only one able to trigger the issue
easily, willing to try crazy crap, so tag, you're it. Maybe this
gives us more information. And maybe it doesn't, and I'm totally wrong
about the whole rcu head added twice theory.

Linus


patch.diff
Description: Binary data

Re: ipc,sem: sysv semaphore scalability

2013-03-30 Thread Emmanuel Benisty

Hi Linus,

On Sun, Mar 31, 2013 at 12:22 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Fri, Mar 29, 2013 at 10:57 PM, Emmanuel Benisty benist...@gmail.com 
 wrote:
 On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds

 This came from the gcc build?

 yes, very early in the build process, IIRC this line was repeated a
 few times and the build just stalled.

 Ok, we're bringing out the crazy hacks now.

 The attached patch is just insane, doesn't really even work in
 general, and only even compiles on 64-bit. But it should work in
 *practice* to find if somebody adds the same RCU head to the RCU lists
 twice, and ignore the second time it happens (and give a warning that
 hopefully pinpoints the backtrace).

 It's ugly. It's broken. It may not work. In other words, I'm not proud
 of it. But you seem to be the only one able to trigger the issue
 easily, willing to try crazy crap, so tag, you're it. Maybe this
 gives us more information. And maybe it doesn't, and I'm totally wrong
 about the whole rcu head added twice theory.

That's all I could get so far:
https://gist.github.com/anonymous/5279255

Losing wireless is generally the start signal of controlled demolition
of the machine.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-30 Thread Davidlohr Bueso

On Sat, 2013-03-30 at 11:33 +0700, Emmanuel Benisty wrote:
 On Sat, Mar 30, 2013 at 10:46 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
  On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty benist...@gmail.com 
  wrote:
 
  Then I start building a random package and the problems start. They
  may also happen without compiling but this seems to trigger the bug
  quite quickly.
 
  I suspect it's about preemption, and the build just results in enough
  scheduling load that you start hitting whatever race there is.
 
  Anyway, some progress here, I hope: dmesg seems to be
  willing to reveal some secrets (using some pastebin service since this
  is pretty big):
 
  https://gist.github.com/anonymous/5275120
 
  That looks like exactly the exit_sem() bug that Davidlohr was talking
  about, where the
 
  /* exit_sem raced with IPC_RMID, nothing to do */
  if (IS_ERR(sma))
  continue;
 
  should be moved to *before* the
 
  sem_lock(sma, NULL, -1);
 
  call. And apparently the bug I had found is already fixed in -next.
 
 I just tried the 7 original patches + the 2 one liners from -next +
 modified Linus' patch (attached) on the top of 3.9-rc4 using
 PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained
 above. I was building two packages at the same time, went away for 30
 seconds, came back and everything froze as soon as I touched the
 laptop's touchpad. Maybe a coincidence but anyway... Another shot in
 the dark, I had this weird message when trying to build gcc:
 semop(2): encountered an error: Identifier removed

*sigh*. I had high hopes for this being the bug triggering your issue,
specially after seeing exit_sem() in the trace. 

Emmanuel, just to be sure, does your changes reflect the patch below?
Specially dropping the rcu read lock before the continue statement
(sorry for not mentioning this in the last email).

Anyway, this is still a bug. Andrew, the patch below applies to
linux-next, please queue this up if you don't have any objections. 

Thanks,
Davidlohr

---8---
From: Davidlohr Bueso davidlohr.bu...@hp.com
Subject: [PATCH] ipc, sem: do not call sem_lock when bogus sma

In exit_sem() we attempt to acquire the sma-sem_perm.lock by calling
sem_lock() immediately after obtaining sma. However, if sma isn't valid,
then calling sem_lock() will tend to do bad things.

Move the sma error check right after the sem_obtain_object_check() call instead.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 ipc/sem.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index f257afe..74cedfe 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1867,8 +1867,7 @@ void exit_sem(struct task_struct *tsk)
struct sem_array *sma;
struct sem_undo *un;
struct list_head tasks;
-   int semid;
-   int i;
+   int semid, i;
 
rcu_read_lock();
un = list_entry_rcu(ulp-list_proc.next,
@@ -1884,12 +1883,13 @@ void exit_sem(struct task_struct *tsk)
}
 
sma = sem_obtain_object_check(tsk-nsproxy-ipc_ns, un-semid);
-   sem_lock(sma, NULL, -1);
-
/* exit_sem raced with IPC_RMID, nothing to do */
-   if (IS_ERR(sma))
+   if (IS_ERR(sma)) {
+   rcu_read_unlock();
continue;
+   }
 
+   sem_lock(sma, NULL, -1);
un = __lookup_undo(ulp, semid);
if (un == NULL) {
/* exit_sem raced with IPC_RMID+semget() that created
-- 
1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Emmanuel Benisty

On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds
 wrote:
>> Another shot in
>> the dark, I had this weird message when trying to build gcc:
>> semop(2): encountered an error: Identifier removed
>
> This came from the gcc build?

yes, very early in the build process, IIRC this line was repeated a
few times and the build just stalled.

> That's just crazy. No normal app uses sysv semaphores. I have an older
> gcc build environment, and some grepping shows it has some ipc
> semaphore use in the libstdc++ testsuite, and some libmudflap hooks,
> but that should be very very minor.

> I notice that your dmesg says that your kernel is compiled by
> gcc-4.8.1 prerelease. Is there any chance that you could try to
> install a known-stable gcc, like 4.7.2 or something. It's entirely
> possible that it's a kernel bug even if it's triggered by some more
> aggressive compiler optimization or something, but it would be really
> good to try to see if this might be gcc-specific.

I could build a kernel on another machine on which 4.7.2 is installed,
kernel oops'd as well:
http://i.imgur.com/uk6gmq1.jpg

FWIW, I have a few things disabled in my config so here is the one I
used, maybe I'm missing something (but again, everything works
perfectly with your tree):
https://gist.github.com/anonymous/5275566

Lastly, just FTR, I tried Andrew's 3.9-rc4-mm1 and got the same
issues, unsurprisingly I guess.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 9:33 PM, Emmanuel Benisty  wrote:
>
> I just tried the 7 original patches + the 2 one liners from -next +
> modified Linus' patch (attached)

.. that patch looks fine.

> on the top of 3.9-rc4 using
> PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained
> above. I was building two packages at the same time, went away for 30
> seconds, came back and everything froze as soon as I touched the
> laptop's touchpad. Maybe a coincidence but anyway... Another shot in
> the dark, I had this weird message when trying to build gcc:
> semop(2): encountered an error: Identifier removed

This came from the gcc build?

That's just crazy. No normal app uses sysv semaphores. I have an older
gcc build environment, and some grepping shows it has some ipc
semaphore use in the libstdc++ testsuite, and some libmudflap hooks,
but that should be very very minor.

You seem to trigger errors really trivially easily, which is really
odd. It's sounding less and less like some subtle race, and more like
the error just happens all the time. If you can make even the gcc
build generate errors, I don't think they can be some rare blue-moon
thing.

I notice that your dmesg says that your kernel is compiled by
gcc-4.8.1 prerelease. Is there any chance that you could try to
install a known-stable gcc, like 4.7.2 or something. It's entirely
possible that it's a kernel bug even if it's triggered by some more
aggressive compiler optimization or something, but it would be really
good to try to see if this might be gcc-specific.

For example, I wonder if your gcc might miscompile idr_alloc() or
something, so that we get the same ID for different ipc objects. That
would certainly potentially cause chaos.

Hmm?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Emmanuel Benisty

On Sat, Mar 30, 2013 at 10:46 AM, Linus Torvalds
 wrote:
> On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty  wrote:
>>
>> Then I start building a random package and the problems start. They
>> may also happen without compiling but this seems to trigger the bug
>> quite quickly.
>
> I suspect it's about preemption, and the build just results in enough
> scheduling load that you start hitting whatever race there is.
>
>> Anyway, some progress here, I hope: dmesg seems to be
>> willing to reveal some secrets (using some pastebin service since this
>> is pretty big):
>>
>> https://gist.github.com/anonymous/5275120
>
> That looks like exactly the exit_sem() bug that Davidlohr was talking
> about, where the
>
> /* exit_sem raced with IPC_RMID, nothing to do */
> if (IS_ERR(sma))
> continue;
>
> should be moved to *before* the
>
> sem_lock(sma, NULL, -1);
>
> call. And apparently the bug I had found is already fixed in -next.

I just tried the 7 original patches + the 2 one liners from -next +
modified Linus' patch (attached) on the top of 3.9-rc4 using
PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained
above. I was building two packages at the same time, went away for 30
seconds, came back and everything froze as soon as I touched the
laptop's touchpad. Maybe a coincidence but anyway... Another shot in
the dark, I had this weird message when trying to build gcc:
semop(2): encountered an error: Identifier removed

patch.diff
Description: Binary data

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty  wrote:
>
> Then I start building a random package and the problems start. They
> may also happen without compiling but this seems to trigger the bug
> quite quickly.

I suspect it's about preemption, and the build just results in enough
scheduling load that you start hitting whatever race there is.

> Anyway, some progress here, I hope: dmesg seems to be
> willing to reveal some secrets (using some pastebin service since this
> is pretty big):
>
> https://gist.github.com/anonymous/5275120

That looks like exactly the exit_sem() bug that Davidlohr was talking
about, where the

/* exit_sem raced with IPC_RMID, nothing to do */
if (IS_ERR(sma))
continue;

should be moved to *before* the

sem_lock(sma, NULL, -1);

call. And apparently the bug I had found is already fixed in -next.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Emmanuel Benisty

Hi Davidlohr,

On Sat, Mar 30, 2013 at 9:08 AM, Davidlohr Bueso  wrote:
> Not sure which one liner you refer to, but, if you haven't already done
> so, please try with these fixes (queued in linux-next):
>
> http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=a9cead0347283f3e72a39e7b76a3cc479b048e51
> http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=4db64b89525ac357cba754c3120065a9ec31
>
> I've been trying to reproduce your twilight zone problem on five
> different machines now without any luck. Is there anything you're doing
> to trigger the issue? Does the machine boot ok and then do weird things,
> say after X starts, open some program, etc?

I was missing a9cead0, thanks. What I usually do is starting a
standard session, which looks like this:

init-+-5*[agetty]
 |-bash---startx---xinit-+-X---2*[{X}]
 |   `-dwm---sh---sleep
 |-bash---chromium-+-chromium
 | |-chromium-+-chromium
 | |  `-2*[{chromium}]
 |
|-chromium-sandbo---chromium---chromium---4*[chromium---4*[{chromium}]]
 | `-65*[{chromium}]
 |-crond
 |-dbus-daemon
 |-klogd
 |-syslogd
 |-tmux-+-alsamixer
 |  |-bash---bash
 |  |-bash
 |  |-htop
 |  `-newsbeuter---{newsbeuter}
 |-udevd
 |-urxvtd-+-bash---pstree
 |`-bash---tmux
 `-wpa_supplicant

Then I start building a random package and the problems start. They
may also happen without compiling but this seems to trigger the bug
quite quickly. Anyway, some progress here, I hope: dmesg seems to be
willing to reveal some secrets (using some pastebin service since this
is pretty big):

https://gist.github.com/anonymous/5275120

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Davidlohr Bueso

On Fri, 2013-03-29 at 19:09 -0700, Linus Torvalds wrote:
> On Fri, Mar 29, 2013 at 6:36 PM, Emmanuel Benisty  wrote:
> >
> > I had to slightly modify the patch since it wouldn't match the changes
> > introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
> > hope that was the right thing to do. So, what I tried was: original 7
> > patches + the one liner + your patch blindly modified by me on the top
> > of 3.9-rc4 and I'm still having twilight zone issues.
> 
> Ok, please send your patch so that I can double-check what you did,
> but it was simple enough that you probably did the right thing.
> 
> Sad. Your case definitely looks like a double rcu-free, as shown by
> the fact that when you enabled SLUB debugging the oops happened with
> the use-after-free pattern (it's __rcu_reclaim() doing the
> "head->func(head);" thing, and "func" is 0x6b6b6b6b6b6b6b6b, so "head"
> has already been free'd once).
> 
> So ipc_rcu_putref() and a refcounting error looked very promising.as a
> potential explanation.
> 
> The 'un' undo structure is also free'd with rcu, but the locking
> around that seems much more robust. The undo entry is on two lists
> (sma->list_id, under sma->sem_perm.lock and ulp->list_proc, under
> ulp->lock). But those locks are actually tested with
> assert_spin_locked() in all the relevant places, and the code actually
> looks sane. So I had high hopes for ipc_rcu_putref()...
> 
> Hmm. Except for exit_sem() that does odd things. You have preemption
> enabled, don't you? exit_sem() does a lookup of the first list_proc
> entry under tcy_read_lock to lookup un->semid, and then it drops the
> rcu read lock. At which point "un" is no longer reliable, I think. But
> then it still uses "un->semid", rather than the stable value it looked
> up under the rcu read lock. Which looks bogus.
> 
> So I'd like you to test a few more things:
> 
>  (a) In exit_sem(), can you change the
> 
>  sma = sem_lock_check(tsk->nsproxy->ipc_ns, un->semid);
> 
>  to use just "semid" rather than "un->semid", because I don't
> think "un" is stable here.

Well that's not really the case in the new code. We don't drop the rcu
read lock until the end of the loop, in sem_unlock(). However, I just
noticed that we're checking sma for error after trying to acquire
sma->sem_perm.lock:

sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid);
sem_lock(sma, NULL, -1);

/* exit_sem raced with IPC_RMID, nothing to do */
if (IS_ERR(sma))
continue;

The IS_ERR(sma) check should be right after the sem_obtain_object_check() call 
instead.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 6:36 PM, Emmanuel Benisty  wrote:
>
> I had to slightly modify the patch since it wouldn't match the changes
> introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
> hope that was the right thing to do. So, what I tried was: original 7
> patches + the one liner + your patch blindly modified by me on the top
> of 3.9-rc4 and I'm still having twilight zone issues.

Ok, please send your patch so that I can double-check what you did,
but it was simple enough that you probably did the right thing.

Sad. Your case definitely looks like a double rcu-free, as shown by
the fact that when you enabled SLUB debugging the oops happened with
the use-after-free pattern (it's __rcu_reclaim() doing the
"head->func(head);" thing, and "func" is 0x6b6b6b6b6b6b6b6b, so "head"
has already been free'd once).

So ipc_rcu_putref() and a refcounting error looked very promising.as a
potential explanation.

The 'un' undo structure is also free'd with rcu, but the locking
around that seems much more robust. The undo entry is on two lists
(sma->list_id, under sma->sem_perm.lock and ulp->list_proc, under
ulp->lock). But those locks are actually tested with
assert_spin_locked() in all the relevant places, and the code actually
looks sane. So I had high hopes for ipc_rcu_putref()...

Hmm. Except for exit_sem() that does odd things. You have preemption
enabled, don't you? exit_sem() does a lookup of the first list_proc
entry under tcy_read_lock to lookup un->semid, and then it drops the
rcu read lock. At which point "un" is no longer reliable, I think. But
then it still uses "un->semid", rather than the stable value it looked
up under the rcu read lock. Which looks bogus.

So I'd like you to test a few more things:

 (a) In exit_sem(), can you change the

 sma = sem_lock_check(tsk->nsproxy->ipc_ns, un->semid);

 to use just "semid" rather than "un->semid", because I don't
think "un" is stable here.

 (b) does the problem go away if you change disable CONFIG_PREEMPT
(perhaps to PREEMPT_NONE or PREEMPT_VOLUNTARY?)

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Davidlohr Bueso

On Sat, 2013-03-30 at 08:36 +0700, Emmanuel Benisty wrote:
> Hi Linus,
> 
> On Sat, Mar 30, 2013 at 6:16 AM, Linus Torvalds
>  wrote:
> > Emmanuel, can you try the attached patch? I think it applies cleanly
> > on top of the scalability series too without any changes, but I didn't
> > check if the patches perhaps changed some of the naming or something.
> 
> I had to slightly modify the patch since it wouldn't match the changes
> introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
> hope that was the right thing to do. So, what I tried was: original 7
> patches + the one liner + your patch blindly modified by me on the top
> of 3.9-rc4 and I'm still having twilight zone issues.

Not sure which one liner you refer to, but, if you haven't already done
so, please try with these fixes (queued in linux-next):

http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=a9cead0347283f3e72a39e7b76a3cc479b048e51
http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=4db64b89525ac357cba754c3120065a9ec31

I've been trying to reproduce your twilight zone problem on five
different machines now without any luck. Is there anything you're doing
to trigger the issue? Does the machine boot ok and then do weird things,
say after X starts, open some program, etc?

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Emmanuel Benisty

Hi Linus,

On Sat, Mar 30, 2013 at 6:16 AM, Linus Torvalds
 wrote:
> Emmanuel, can you try the attached patch? I think it applies cleanly
> on top of the scalability series too without any changes, but I didn't
> check if the patches perhaps changed some of the naming or something.

I had to slightly modify the patch since it wouldn't match the changes
introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
hope that was the right thing to do. So, what I tried was: original 7
patches + the one liner + your patch blindly modified by me on the top
of 3.9-rc4 and I'm still having twilight zone issues.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 2:12 PM, Linus Torvalds
 wrote:
>
> I dunno. I'm still not sure this is triggerable, but it looks bad. But
> both the semaphore case and the msg cases seem to be solvable by
> moving the unlock down, and shm seem to have no getref/putref users to
> race with, so this (whitespace-damaged) patch *may* be sufficient:

Well, the patch doesn't seem to cause any problems, at least neither
lockdep nor spinlock sleep debugging complains. I have no idea whether
it actually fixes any problems, though.

I do wonder if this might explain the problem Emmanuel saw. A double
free of a RCU-freeable object would possibly result in exactly the
kind of mess that Emmanuel reported with the semaphore scalability
patches.

Emmanuel, can you try the attached patch? I think it applies cleanly
on top of the scalability series too without any changes, but I didn't
check if the patches perhaps changed some of the naming or something.

  Linus

patch.diff
Description: Binary data

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 1:41 PM, Linus Torvalds
 wrote:
>
> The alternative would be to make sure the thing is always locked (and
> in a rcu-read-safe region) before putref/getref. The only place (apart
> from the initial allocation, which doesn't matter, because nothing can
> see if itf that path fails) seems to be that freeque(), but I didn't
> check everything.
>
> Moving the
>
> msg_unlock(msq);
>
> to the end of freeque() might be the way to go. It's going to cause us
> to hold the lock for longer, but I'm not sure we care for that path.

Uhhuh. I think shm_destroy() has the same pattern. And I think that
one has locking reasons why it has to do the shm_unlock() before
tearing some things down, although I'm not sure..

The good news is that shm doesn't seem to have any users of
ipc_rcu_get/putref(), so I don't see anything to race against. So it
has the same buggy pattern, but I don't think it can trigger anything.

And ipc/sem.c has the same bug-pattern in freeary(). It does
"sem_unlock(sma)" followed by "ipc_rcu_putref(sma);", and it *does*
seem to have code that that can race against (sem_lock_and_putref()).
The whole "sem_putref()" tries to be careful and gets the lock for the
last ref, but getting the lock doesn't help when freeary() does the
refcount access without it.

The semaphore case seems to argue fairly strongly for an "atomic_t
refcount", because right now it does a lot of "sem_putref()" stuff
that just gets the lock for the putref. So not only is that
insufficient (if it races against freeary(), but it's also more
expensive than just an atomic refcount would have been.

I dunno. I'm still not sure this is triggerable, but it looks bad. But
both the semaphore case and the msg cases seem to be solvable by
moving the unlock down, and shm seem to have no getref/putref users to
race with, so this (whitespace-damaged) patch *may* be sufficient:

diff --git a/ipc/msg.c b/ipc/msg.c
index 31cd1bf6af27..338d8e2b589b 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -284,7 +284,6 @@ static void freeque(struct ipc_namespace *ns,
struct kern_ipc_perm *ipcp)
expunge_all(msq, -EIDRM);
ss_wakeup(>q_senders, 1);
msg_rmid(ns, msq);
-   msg_unlock(msq);

tmp = msq->q_messages.next;
while (tmp != >q_messages) {
@@ -297,6 +296,7 @@ static void freeque(struct ipc_namespace *ns,
struct kern_ipc_perm *ipcp)
atomic_sub(msq->q_cbytes, >msg_bytes);
security_msg_queue_free(msq);
ipc_rcu_putref(msq);
+   msg_unlock(msq);
 }

 /*
diff --git a/ipc/sem.c b/ipc/sem.c
index 58d31f1c1eb5..1cf024b9eac0 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -766,12 +766,12 @@ static void freeary(struct ipc_namespace *ns,
struct kern_ipc_perm *ipcp)

/* Remove the semaphore set from the IDR */
sem_rmid(ns, sma);
-   sem_unlock(sma);

wake_up_sem_queue_do();
ns->used_sems -= sma->sem_nsems;
security_sem_free(sma);
ipc_rcu_putref(sma);
+   sem_unlock(sma);
 }

 static unsigned long copy_semid_to_user(void __user *buf, struct
semid64_ds *in, int version)

(I didn't check very carefully, it's possible that we end up having
some locking problem if we move the unlocks down later, but it *looks*
fine)

Anybody?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones  wrote:
>
> Now that I have that reverted, I'm not seeing msgrcv traces any more, but
> I've started seeing this..
>
> general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> RIP: 0010:[]  [] free_msg+0x2b/0x40
> Call Trace:
>  [] freeque+0xcf/0x140
>  [] msgctl_down.constprop.9+0x183/0x200
>  [] sys_msgctl+0x139/0x400
>  [] system_call_fastpath+0x16/0x1b
>
> Looks like seg was already kfree'd.

Hmm.

I have a suspicion.

The use of ipc_rcu_getref/ipc_rcu_putref() seems a bit iffy.

In particular, the refcount is not an atomic variable, so we
absolutely *depend* on the spinlock for it.

However, looking at "freeque()", that's not actually the case. It
releases the message queue spinlock *before* it does the
ipc_rcu_putref(), and it does that because the thing has become
unreachable (it did a msg_rmid(), which will set ->deleted, which in
turn will mean that nobody should successfully look it up any more).

HOWEVER.

While the "deleted" flag is serialized, the actual refcount is not. So
in *parallel* with the freeque() call, we may have some other user
that does something like

...
ipc_rcu_getref(msq);
msg_unlock(msq);
schedule();

ipc_lock_by_ptr(>q_perm);
ipc_rcu_putref(msq);
if (msq->q_perm.deleted) {
err = -EIDRM;
goto out_unlock_free;
}
...

which got the lock for the "deleted" test, so that part is all fine,
but notice the "ipc_rcu_putref()". It can happen at the same time that
freeque() also does its own ipc_rcu_putref().

So now refcount may get buggered, resulting in random early reuse,
double free's or leaking of the msq.

There may be some reason I don't see why this cannot happen, but it
does look suspicious. I wonder if the refcount should be an atomic
value.

The alternative would be to make sure the thing is always locked (and
in a rcu-read-safe region) before putref/getref. The only place (apart
from the initial allocation, which doesn't matter, because nothing can
see if itf that path fails) seems to be that freeque(), but I didn't
check everything.

Moving the

msg_unlock(msq);

to the end of freeque() might be the way to go. It's going to cause us
to hold the lock for longer, but I'm not sure we care for that path.

Guys, am I missing something? This kind of refcounting problem might
well explain the rcu-freeing-time bugs reported with the scalability
patches: long-time race that just got *much* easier to expose with the
higher level of parallelism?

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 12:33 PM, Peter Hurley  wrote:
> On Fri, 2013-03-29 at 11:43 -0700, Linus Torvalds wrote:
>> I think I foud at least one bug in the MSG_COPY stuff: it leaks the
>> "copy" allocation if
>>
>> mode == SEARCH_LESSEQUAL
>>
>> but maybe I'm misreading it.
>
> Yes, you're misreading it. copy_msg() returns the 'copy' address when
> copying is successful. So this patch double-frees 'copy'.

No it doesn't.

The case where "copy_msg()" *is* called and is successful, msg gets
set to "copy", my patch sets "copy" to NULL, so no, it doesn't
double-free.

That said, it looks like if MSG_COPY is set, we cannot trigger the
"mode == SEARCH_LESSEQUAL" case, because it forces msgtyp to 0, which
in turn forces mode = SEARCH_ANY.

So the actual leak case seems to not be possible, although that's
*really* hard to see from looking at the code. The code is just an
unreadable mess.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Peter Hurley

On Fri, 2013-03-29 at 12:26 -0700, Linus Torvalds wrote:
> On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones  wrote:
> >
> > Here's an oops I just hit..
> >
> > BUG: unable to handle kernel NULL pointer dereference at 000f
> > IP: [] testmsg.isra.5+0x1a/0x60
> 
> Btw, looking at the code leading up to this, what the f*ck is wrong
> with the IPC stuff?
> 
> It's using the generic list stuff for most of the lists, but then it
> open-codes the accesses.
> 
> So instead of using
> 
>for_each_entry(walk_msg, >q_messages, m_list) {
>   ..
>}
> 
> the ipc/msg.c code does all that by hand, with
> 
>tmp = msq->q_messages.next;
>while (tmp != >q_messages) {
>   struct msg_msg *walk_msg;
> 
>   walk_msg = list_entry(tmp, struct msg_msg, m_list);
>   ...
>   tmp = tmp->next;
>}
> 
> Ugh. The code is near unreadable. And then it has magic memory
> barriers etc, implying that it doesn't lock the data structures, but
> no comments about them. See expunge_all() and pipelined_send().
> 
> The code seems entirely random, and it's badly set up (annoyance of
> the day: crazy helper functions in ipc/msgutil.c to make sure that (a)
> you have to spend more effort looking for them, and (b) they won't get
> inlined).
> 
> Clearly nobody has cared for the crazy IPC message code in a long time.

Exactly that's what my patch series does; clean this mess up.

This is what I wrote to Andrew a couple of days ago.

On Tue, 2013-03-26 at 22:33 -0400, Peter Hurley wrote:
I just figured out how the queue is being corrupted and why my series
> fixes it.
> 
> 
> With MSG_COPY set, the queue scan can exit with the local variable
'msg'
> pointing to a real msg if the msg_counter never reaches the
copy_number.
> 
> The remaining execution looks like this:
> 
>   if (!IS_ERR(msg)) {
>   
>   if (msgflg & MSG_COPY)
>   goto out_unlock;
>   
> 
> out_unlock:
>   msg_unlock(msq);
>   break;
>   }
>   }
>   if (IS_ERR(msg))
>   
> 
>   bufsz = msg_handler();
>   free_msg(msg);  << msg never unlinked
> 
> 
> Since the msg should not have been found (because it failed the match
> criteria), the if (!IS_ERR(msg)) clause should never have executed.
> 
> That's why my refactor fixes resolve this; because msg is not
> inadvertently treated as a found msg.
> 
> But let's be honest; the real bug here is the poor structure of this
> function that even made this possible. The deepest nesting executes a
> goto to a label in the middle of an if clause. Yuck! No wonder this
> thing's fragile.
> 
> So my recommendation still stands. The series that fixes this has been
> getting tested in linux-next for a month. Fixing this some other way
is
> just asking for more trouble.
> 
> But why not just revert MSG_COPY altogether for 3.9?

Regards,
Peter Hurley

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Peter Hurley

On Fri, 2013-03-29 at 11:43 -0700, Linus Torvalds wrote:
> I think I foud at least one bug in the MSG_COPY stuff: it leaks the
> "copy" allocation if
> 
> mode == SEARCH_LESSEQUAL
> 
> but maybe I'm misreading it.

Yes, you're misreading it. copy_msg() returns the 'copy' address when
copying is successful. So this patch double-frees 'copy'.

Andrew has had the patches that fix this for a month but because they're
fairly extensive he didn't want to apply them for 3.9.

Regards,
Peter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones  wrote:
>
> Here's an oops I just hit..
>
> BUG: unable to handle kernel NULL pointer dereference at 000f
> IP: [] testmsg.isra.5+0x1a/0x60

Btw, looking at the code leading up to this, what the f*ck is wrong
with the IPC stuff?

It's using the generic list stuff for most of the lists, but then it
open-codes the accesses.

So instead of using

   for_each_entry(walk_msg, >q_messages, m_list) {
  ..
   }

the ipc/msg.c code does all that by hand, with

   tmp = msq->q_messages.next;
   while (tmp != >q_messages) {
  struct msg_msg *walk_msg;

  walk_msg = list_entry(tmp, struct msg_msg, m_list);
  ...
  tmp = tmp->next;
   }

Ugh. The code is near unreadable. And then it has magic memory
barriers etc, implying that it doesn't lock the data structures, but
no comments about them. See expunge_all() and pipelined_send().

The code seems entirely random, and it's badly set up (annoyance of
the day: crazy helper functions in ipc/msgutil.c to make sure that (a)
you have to spend more effort looking for them, and (b) they won't get
inlined).

Clearly nobody has cared for the crazy IPC message code in a long time.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones  wrote:
>
> Btw, something that's really bothering me is just how much bogus
> 'follow-on' spew we have lately. I'm not sure what changed, but it
> seems to have gotten worse.

.. we have many more sanity checks, and they tend to trigger in the
case of us killing somebody holding locks etc.

> Related: is there any value in printing out all the ? symbols on
> kernels with frame pointers enabled ?

I've found them useful. You can often see what happened just before.

Of course, I'd still prefer gcc not generate unnecessarily big stack
frames (which is one of the bigger reasons for old stuff shining
through) but as things are, they often give hints about what happened
just before.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Dave Jones

On Fri, Mar 29, 2013 at 11:43:25AM -0700, Linus Torvalds wrote:
 > On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones  wrote:
 > >
 > > Now that I have that reverted, I'm not seeing msgrcv traces any more, but
 > > I've started seeing this..
 > >
 > > general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 > 
 > Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you
 > disable it?
 > 
 > I think I foud at least one bug in the MSG_COPY stuff: it leaks the
 > "copy" allocation if
 > 
 > mode == SEARCH_LESSEQUAL
 > 
 > but maybe I'm misreading it. And that shouldn't cause the problem you
 > see, but it's indicative of how badly tested and thought through the
 > MSG_COPY code is.
 > 
 > Totally UNTESTED leak fix appended. Stanislav?

I'll give it a shot.

Btw, something that's really bothering me is just how much bogus
'follow-on' spew we have lately. I'm not sure what changed, but it
seems to have gotten worse.

Here's an oops I just hit..

BUG: unable to handle kernel NULL pointer dereference at 000f
IP: [] testmsg.isra.5+0x1a/0x60
PGD 10fd95067 PUD 10f767067 PMD 0 
Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif 
can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi 
af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox 
ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT 
nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter 
ip6_tables raid0 snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec 
bluetooth microcode serio_raw snd_pcm edac_core pcspkr snd_page_alloc rfkill 
snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm 
radeon backlight drm_kms_helper ttm
CPU 2 
Pid: 958, comm: trinity-child20 Not tainted 3.9.0-rc4+ #7 Gigabyte Technology 
Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[]  [] testmsg.isra.5+0x1a/0x60
RSP: 0018:880117bb5e88  EFLAGS: 00010246
RAX:  RBX: 0004 RCX: 0078
RDX: 0004 RSI: fffe RDI: 000f
RBP: 880117bb5e88 R08: 0004 R09: 0001
R10: 880117bb8000 R11: 0001 R12: fffe
R13: 88010fd10308 R14: 88010fd10258 R15: 
FS:  7fa89c256740() GS:88012aa0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 000f CR3: 00010f76c000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process trinity-child20 (pid: 958, threadinfo 880117bb4000, task 
880117bb8000)
Stack:
 880117bb5f68 812c3746  880117bb8000
 880117bb8000 880117bb8000 81c7ace0 812c2430
 0004   652a928e
Call Trace:
 [] do_msgrcv+0x1a6/0x5f0
 [] ? msg_security+0x10/0x10
 [] ? trace_hardirqs_on_caller+0x115/0x1a0
 [] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [] sys_msgrcv+0x15/0x20
 [] system_call_fastpath+0x16/0x1b
Code: c3 48 c7 c0 f2 ff ff ff eb e5 0f 1f 80 00 00 00 00 66 66 66 66 90 55 83 
fa 02 48 89 e5 74 3a 7e 28 83 fa 03 74 13 83 fa 04 75 0a <48> 39 37 b8 01 00 00 
00 7e 02 31 c0 5d c3 48 3b 37 74 f7 b8 01 
RIP  [] testmsg.isra.5+0x1a/0x60
 RSP 
CR2: 000f
---[ end trace 8f0713d2aacb6aa3 ]---
BUG: sleeping function called from invalid context at kernel/rwsem.c:20
in_atomic(): 1, irqs_disabled(): 0, pid: 958, name: trinity-child20
INFO: lockdep is turned off.
Pid: 958, comm: trinity-child20 Tainted: G  D  3.9.0-rc4+ #7
Call Trace:
 [] __might_sleep+0x145/0x200
 [] down_read+0x2a/0xa0
 [] exit_signals+0x24/0x130
 [] do_exit+0xbc/0xd10
 [] ? kmsg_dump+0x1b5/0x230
 [] ? kmsg_dump+0x25/0x230
 [] oops_end+0x9c/0xe0
 [] no_context+0x268/0x275
 [] __bad_area_nosemaphore+0x78/0x1d1
 [] bad_area_nosemaphore+0x13/0x15
 [] __do_page_fault+0x366/0x5b0
 [] ? __lock_acquire+0x2e5/0x1a00
 [] ? trace_hardirqs_off_caller+0x28/0xc0
 [] ? trace_hardirqs_off_thunk+0x3a/0x3c
 [] do_page_fault+0xe/0x10
 [] page_fault+0x22/0x30
 [] ? testmsg.isra.5+0x1a/0x60
 [] ? security_msg_queue_msgrcv+0x16/0x20
 [] do_msgrcv+0x1a6/0x5f0
 [] ? msg_security+0x10/0x10
 [] ? trace_hardirqs_on_caller+0x115/0x1a0
 [] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [] sys_msgrcv+0x15/0x20
 [] system_call_fastpath+0x16/0x1b
note: trinity-child20[958] exited with preempt_count 1
BUG: scheduling while atomic: trinity-child20/958/0x1002
INFO: lockdep is turned off.
Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif 
can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi 
af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox 
ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT 
nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Dave Jones

On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote:
 > On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones  wrote:
 > 
 > > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:
 > > 
 > >  > Whichever way we go, we should get a wiggle on - this has been hanging
 > >  > around for too long.  Dave, do you have time to determine whether
 > >  > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
 > >  > than max") fixes things up?
 > > 
 > > Ok, with that reverted it's been grinding away for a few hours without 
 > > incident.
 > > Normally I see the oops within a minute or so.
 > > 
 > 
 > OK, thanks, I queued a revert:
 > 
 > From: Andrew Morton 
 > Subject: revert "ipc: don't allocate a copy larger than max"
 > 
 > Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
 > during trinity testing.

I owe Peter an apology. I just hit it again with that backed out.
Andrew, might as well drop that revert.

BUG: unable to handle kernel NULL pointer dereference at 000f
IP: [] testmsg.isra.5+0x1a/0x60
PGD 10fd95067 PUD 10f767067 PMD 0 
Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif 
can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi 
af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox 
ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT 
nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter 
ip6_tables raid0 snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec 
bluetooth microcode serio_raw snd_pcm edac_core pcspkr snd_page_alloc rfkill 
snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm 
radeon backlight drm_kms_helper ttm
CPU 2 
Pid: 958, comm: trinity-child20 Not tainted 3.9.0-rc4+ #7 Gigabyte Technology 
Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[]  [] testmsg.isra.5+0x1a/0x60
RSP: 0018:880117bb5e88  EFLAGS: 00010246
RAX:  RBX: 0004 RCX: 0078
RDX: 0004 RSI: fffe RDI: 000f
RBP: 880117bb5e88 R08: 0004 R09: 0001
R10: 880117bb8000 R11: 0001 R12: fffe
R13: 88010fd10308 R14: 88010fd10258 R15: 
FS:  7fa89c256740() GS:88012aa0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 000f CR3: 00010f76c000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process trinity-child20 (pid: 958, threadinfo 880117bb4000, task 
880117bb8000)
Stack:
 880117bb5f68 812c3746  880117bb8000
 880117bb8000 880117bb8000 81c7ace0 812c2430
 0004   652a928e
Call Trace:
 [] do_msgrcv+0x1a6/0x5f0
 [] ? msg_security+0x10/0x10
 [] ? trace_hardirqs_on_caller+0x115/0x1a0
 [] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [] sys_msgrcv+0x15/0x20
 [] system_call_fastpath+0x16/0x1b
Code: c3 48 c7 c0 f2 ff ff ff eb e5 0f 1f 80 00 00 00 00 66 66 66 66 90 55 83 
fa 02 48 89 e5 74 3a 7e 28 83 fa 03 74 13 83 fa 04 75 0a <48> 39 37 b8 01 00 00 
00 7e 02 31 c0 5d c3 48 3b 37 74 f7 b8 01 
RIP  [] testmsg.isra.5+0x1a/0x60

I think I wasn't seeing that this last week because I had inadvertantly 
disabled DEBUG_PAGEALLOC

and.. we're back to square one.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones  wrote:
>
> Now that I have that reverted, I'm not seeing msgrcv traces any more, but
> I've started seeing this..
>
> general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC

Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you
disable it?

I think I foud at least one bug in the MSG_COPY stuff: it leaks the
"copy" allocation if

mode == SEARCH_LESSEQUAL

but maybe I'm misreading it. And that shouldn't cause the problem you
see, but it's indicative of how badly tested and thought through the
MSG_COPY code is.

Totally UNTESTED leak fix appended. Stanislav?

 Linus

patch.diff
Description: Binary data

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 11:04 AM, Dave Jones  wrote:
>
> mainline. Your current tree.

Ok, that's what I thought you were generally testing, just wanted to
verify. The subject kind of implied otherwise..

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Dave Jones

On Fri, Mar 29, 2013 at 11:00:27AM -0700, Linus Torvalds wrote:
 > On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones  wrote:
 > >
 > > Now that I have that reverted, I'm not seeing msgrcv traces any more, but
 > > I've started seeing this..
 > >
 > > general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 > >
 > > Looks like seg was already kfree'd.
 > 
 > Just to clarify: is this you testing -mm that has Davidlohr's and
 > Rik's scalability patches? Or mainline?

mainline. Your current tree.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones  wrote:
>
> Now that I have that reverted, I'm not seeing msgrcv traces any more, but
> I've started seeing this..
>
> general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
>
> Looks like seg was already kfree'd.

Just to clarify: is this you testing -mm that has Davidlohr's and
Rik's scalability patches? Or mainline?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Dave Jones

On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote:
 > On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones  wrote:
 > 
 > > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:
 > > 
 > >  > Whichever way we go, we should get a wiggle on - this has been hanging
 > >  > around for too long.  Dave, do you have time to determine whether
 > >  > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
 > >  > than max") fixes things up?
 > > 
 > > Ok, with that reverted it's been grinding away for a few hours without 
 > > incident.
 > > Normally I see the oops within a minute or so.
 > 
 > OK, thanks, I queued a revert:
 > 
 > From: Andrew Morton 
 > Subject: revert "ipc: don't allocate a copy larger than max"
 > 
 > Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
 > during trinity testing.

Now that I have that reverted, I'm not seeing msgrcv traces any more, but 
I've started seeing this..

general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: l2tp_ppp l2tp_netlink l2tp_core llc2 phonet netrom rose 
af_key af_rxrpc caif_socket caif can_raw cmtp kernelcapi can_bcm can nfnetlink 
ipt_ULOG scsi_transport_iscsi af_802154 ax25 atm ipx pppoe pppox x25 nfc irda 
ppp_generic p8023 slhc p8022 appletalk decnet crc_ccitt rds psnap llc lockd 
sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack 
ip6table_filter ip6_tables snd_hda_codec_realtek btusb snd_hda_intel bluetooth 
snd_hda_codec raid0 snd_pcm rfkill microcode serio_raw pcspkr snd_page_alloc 
edac_core snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan 
kvm_amd kvm radeon backlight drm_kms_helper ttm
CPU 3 
Pid: 1850, comm: trinity-child37 Tainted: GB3.9.0-rc4+ #7 Gigabyte 
Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[]  [] free_msg+0x2b/0x40
RSP: 0018:8800a1d3bdd0  EFLAGS: 00010202
RAX:  RBX: 6b6b6b6b6b6b6b6b RCX: 
RDX:  RSI:  RDI: 810b6ced
RBP: 8800a1d3bde0 R08:  R09: 0001
R10: 0001 R11: 0001 R12: 88009997e620
R13: 81c7ace0 R14: 8800caf359d8 R15: 81c7b024
FS:  7f2d7be64740() GS:88012ac0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 7f8bd7bb6000 CR3: a1f0a000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process trinity-child37 (pid: 1850, threadinfo 8800a1d3a000, task 
8800a1e62490)
Stack:
 6b6b6b6b6b6b6b6b 8800caf35928 8800a1d3be18 812c289f
  81c7ace0 8800caf35928 
 8800a1d3be28 8800a1d3bec8 812c2a93 8800a1d3be40
Call Trace:
 [] freeque+0xcf/0x140
 [] msgctl_down.constprop.9+0x183/0x200
 [] ? up_read+0x1f/0x40
 [] ? __do_page_fault+0x214/0x5b0
 [] ? lock_release_non_nested+0x23e/0x320
 [] sys_msgctl+0x139/0x400
 [] ? retint_swapgs+0xe/0x13
 [] ? trace_hardirqs_on_caller+0x115/0x1a0
 [] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [] system_call_fastpath+0x16/0x1b
Code: 66 66 66 66 90 55 48 89 e5 41 54 49 89 fc 53 e8 fc 5e 01 00 49 8b 5c 24 
20 4c 89 e7 e8 8f af ed ff 48 85 db 75 05 eb 13 4c 89 e3 <4c> 8b 23 48 89 df e8 
7a af ed ff 4d 85 e4 75 ed 5b 41 5c 5d c3 


(Taint is from an ext4 double-free I just reported in a separate thread)

decoded..

   0:   66 66 66 66 90  data32 data32 data32 xchg %ax,%ax
   5:   55  push   %rbp
   6:   48 89 e5mov%rsp,%rbp
   9:   41 54   push   %r12
   b:   49 89 fcmov%rdi,%r12
   e:   53  push   %rbx
   f:   e8 fc 5e 01 00  callq  0x15f10
  14:   49 8b 5c 24 20  mov0x20(%r12),%rbx
  19:   4c 89 e7mov%r12,%rdi
  1c:   e8 8f af ed ff  callq  0xffedafb0
  21:   48 85 dbtest   %rbx,%rbx
  24:   75 05   jne0x2b
  26:   eb 13   jmp0x3b
  28:   4c 89 e3mov%r12,%rbx
  2b:*  4c 8b 23mov(%rbx),%r12 <-- trapping instruction
  2e:   48 89 dfmov%rbx,%rdi
  31:   e8 7a af ed ff  callq  0xffedafb0
  36:   4d 85 e4test   %r12,%r12
  39:   75 ed   jne0x28
  3b:   5b  pop%rbx
  3c:   41 5c   pop%r12
  3e:   5d  pop%rbp
  3f:   c3  retq   

Disassembly of free_msg shows..

seg = msg->next;
kfree(msg);
while (seg != NULL) {
struct msg_msgseg *tmp = seg->next;
 30b:   4c 8b 23mov(%rbx),%r12
kfree(seg);
 30e:   48 89 dfmov%rbx,%rdi
 311:   e8 00 00 00 00  callq  316 


Looks like seg was

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Dave Jones

On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote:
  On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones da...@redhat.com wrote:
  
   On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:
   
 Whichever way we go, we should get a wiggle on - this has been hanging
 around for too long.  Dave, do you have time to determine whether
 reverting 88b9e456b1649722673ff (ipc: don't allocate a copy larger
 than max) fixes things up?
   
   Ok, with that reverted it's been grinding away for a few hours without 
   incident.
   Normally I see the oops within a minute or so.
  
  OK, thanks, I queued a revert:
  
  From: Andrew Morton a...@linux-foundation.org
  Subject: revert ipc: don't allocate a copy larger than max
  
  Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
  during trinity testing.

Now that I have that reverted, I'm not seeing msgrcv traces any more, but 
I've started seeing this..

general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: l2tp_ppp l2tp_netlink l2tp_core llc2 phonet netrom rose 
af_key af_rxrpc caif_socket caif can_raw cmtp kernelcapi can_bcm can nfnetlink 
ipt_ULOG scsi_transport_iscsi af_802154 ax25 atm ipx pppoe pppox x25 nfc irda 
ppp_generic p8023 slhc p8022 appletalk decnet crc_ccitt rds psnap llc lockd 
sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack 
ip6table_filter ip6_tables snd_hda_codec_realtek btusb snd_hda_intel bluetooth 
snd_hda_codec raid0 snd_pcm rfkill microcode serio_raw pcspkr snd_page_alloc 
edac_core snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan 
kvm_amd kvm radeon backlight drm_kms_helper ttm
CPU 3 
Pid: 1850, comm: trinity-child37 Tainted: GB3.9.0-rc4+ #7 Gigabyte 
Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[812c20fb]  [812c20fb] free_msg+0x2b/0x40
RSP: 0018:8800a1d3bdd0  EFLAGS: 00010202
RAX:  RBX: 6b6b6b6b6b6b6b6b RCX: 
RDX:  RSI:  RDI: 810b6ced
RBP: 8800a1d3bde0 R08:  R09: 0001
R10: 0001 R11: 0001 R12: 88009997e620
R13: 81c7ace0 R14: 8800caf359d8 R15: 81c7b024
FS:  7f2d7be64740() GS:88012ac0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 7f8bd7bb6000 CR3: a1f0a000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process trinity-child37 (pid: 1850, threadinfo 8800a1d3a000, task 
8800a1e62490)
Stack:
 6b6b6b6b6b6b6b6b 8800caf35928 8800a1d3be18 812c289f
  81c7ace0 8800caf35928 
 8800a1d3be28 8800a1d3bec8 812c2a93 8800a1d3be40
Call Trace:
 [812c289f] freeque+0xcf/0x140
 [812c2a93] msgctl_down.constprop.9+0x183/0x200
 [810767cf] ? up_read+0x1f/0x40
 [816c8f94] ? __do_page_fault+0x214/0x5b0
 [810b94be] ? lock_release_non_nested+0x23e/0x320
 [812c2da9] sys_msgctl+0x139/0x400
 [816c5d4d] ? retint_swapgs+0xe/0x13
 [810b6c55] ? trace_hardirqs_on_caller+0x115/0x1a0
 [8134b39e] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [816cd942] system_call_fastpath+0x16/0x1b
Code: 66 66 66 66 90 55 48 89 e5 41 54 49 89 fc 53 e8 fc 5e 01 00 49 8b 5c 24 
20 4c 89 e7 e8 8f af ed ff 48 85 db 75 05 eb 13 4c 89 e3 4c 8b 23 48 89 df e8 
7a af ed ff 4d 85 e4 75 ed 5b 41 5c 5d c3 


(Taint is from an ext4 double-free I just reported in a separate thread)

decoded..

   0:   66 66 66 66 90  data32 data32 data32 xchg %ax,%ax
   5:   55  push   %rbp
   6:   48 89 e5mov%rsp,%rbp
   9:   41 54   push   %r12
   b:   49 89 fcmov%rdi,%r12
   e:   53  push   %rbx
   f:   e8 fc 5e 01 00  callq  0x15f10
  14:   49 8b 5c 24 20  mov0x20(%r12),%rbx
  19:   4c 89 e7mov%r12,%rdi
  1c:   e8 8f af ed ff  callq  0xffedafb0
  21:   48 85 dbtest   %rbx,%rbx
  24:   75 05   jne0x2b
  26:   eb 13   jmp0x3b
  28:   4c 89 e3mov%r12,%rbx
  2b:*  4c 8b 23mov(%rbx),%r12 -- trapping instruction
  2e:   48 89 dfmov%rbx,%rdi
  31:   e8 7a af ed ff  callq  0xffedafb0
  36:   4d 85 e4test   %r12,%r12
  39:   75 ed   jne0x28
  3b:   5b  pop%rbx
  3c:   41 5c   pop%r12
  3e:   5d  pop%rbp
  3f:   c3  retq   

Disassembly of free_msg shows..

seg = msg-next;
kfree(msg);
while (seg != NULL) {
struct msg_msgseg *tmp = seg-next;

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones da...@redhat.com wrote:

 Now that I have that reverted, I'm not seeing msgrcv traces any more, but
 I've started seeing this..

 general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC

 Looks like seg was already kfree'd.

Just to clarify: is this you testing -mm that has Davidlohr's and
Rik's scalability patches? Or mainline?

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Dave Jones

On Fri, Mar 29, 2013 at 11:00:27AM -0700, Linus Torvalds wrote:
  On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones da...@redhat.com wrote:
  
   Now that I have that reverted, I'm not seeing msgrcv traces any more, but
   I've started seeing this..
  
   general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
  
   Looks like seg was already kfree'd.
  
  Just to clarify: is this you testing -mm that has Davidlohr's and
  Rik's scalability patches? Or mainline?

mainline. Your current tree.

Dave
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 11:04 AM, Dave Jones da...@redhat.com wrote:

 mainline. Your current tree.

Ok, that's what I thought you were generally testing, just wanted to
verify. The subject kind of implied otherwise..

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones da...@redhat.com wrote:

 Now that I have that reverted, I'm not seeing msgrcv traces any more, but
 I've started seeing this..

 general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC

Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you
disable it?

I think I foud at least one bug in the MSG_COPY stuff: it leaks the
copy allocation if

mode == SEARCH_LESSEQUAL

but maybe I'm misreading it. And that shouldn't cause the problem you
see, but it's indicative of how badly tested and thought through the
MSG_COPY code is.

Totally UNTESTED leak fix appended. Stanislav?

 Linus


patch.diff
Description: Binary data

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Dave Jones

On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote:
  On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones da...@redhat.com wrote:
  
   On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:
   
 Whichever way we go, we should get a wiggle on - this has been hanging
 around for too long.  Dave, do you have time to determine whether
 reverting 88b9e456b1649722673ff (ipc: don't allocate a copy larger
 than max) fixes things up?
   
   Ok, with that reverted it's been grinding away for a few hours without 
   incident.
   Normally I see the oops within a minute or so.
   
  
  OK, thanks, I queued a revert:
  
  From: Andrew Morton a...@linux-foundation.org
  Subject: revert ipc: don't allocate a copy larger than max
  
  Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
  during trinity testing.

I owe Peter an apology. I just hit it again with that backed out.
Andrew, might as well drop that revert.

BUG: unable to handle kernel NULL pointer dereference at 000f
IP: [812c24ca] testmsg.isra.5+0x1a/0x60
PGD 10fd95067 PUD 10f767067 PMD 0 
Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif 
can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi 
af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox 
ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT 
nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter 
ip6_tables raid0 snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec 
bluetooth microcode serio_raw snd_pcm edac_core pcspkr snd_page_alloc rfkill 
snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm 
radeon backlight drm_kms_helper ttm
CPU 2 
Pid: 958, comm: trinity-child20 Not tainted 3.9.0-rc4+ #7 Gigabyte Technology 
Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[812c24ca]  [812c24ca] testmsg.isra.5+0x1a/0x60
RSP: 0018:880117bb5e88  EFLAGS: 00010246
RAX:  RBX: 0004 RCX: 0078
RDX: 0004 RSI: fffe RDI: 000f
RBP: 880117bb5e88 R08: 0004 R09: 0001
R10: 880117bb8000 R11: 0001 R12: fffe
R13: 88010fd10308 R14: 88010fd10258 R15: 
FS:  7fa89c256740() GS:88012aa0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 000f CR3: 00010f76c000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process trinity-child20 (pid: 958, threadinfo 880117bb4000, task 
880117bb8000)
Stack:
 880117bb5f68 812c3746  880117bb8000
 880117bb8000 880117bb8000 81c7ace0 812c2430
 0004   652a928e
Call Trace:
 [812c3746] do_msgrcv+0x1a6/0x5f0
 [812c2430] ? msg_security+0x10/0x10
 [810b6c55] ? trace_hardirqs_on_caller+0x115/0x1a0
 [8134b39e] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [812c3ba5] sys_msgrcv+0x15/0x20
 [816cd942] system_call_fastpath+0x16/0x1b
Code: c3 48 c7 c0 f2 ff ff ff eb e5 0f 1f 80 00 00 00 00 66 66 66 66 90 55 83 
fa 02 48 89 e5 74 3a 7e 28 83 fa 03 74 13 83 fa 04 75 0a 48 39 37 b8 01 00 00 
00 7e 02 31 c0 5d c3 48 3b 37 74 f7 b8 01 
RIP  [812c24ca] testmsg.isra.5+0x1a/0x60

I think I wasn't seeing that this last week because I had inadvertantly 
disabled DEBUG_PAGEALLOC

and.. we're back to square one.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Dave Jones

On Fri, Mar 29, 2013 at 11:43:25AM -0700, Linus Torvalds wrote:
  On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones da...@redhat.com wrote:
  
   Now that I have that reverted, I'm not seeing msgrcv traces any more, but
   I've started seeing this..
  
   general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
  
  Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you
  disable it?
  
  I think I foud at least one bug in the MSG_COPY stuff: it leaks the
  copy allocation if
  
  mode == SEARCH_LESSEQUAL
  
  but maybe I'm misreading it. And that shouldn't cause the problem you
  see, but it's indicative of how badly tested and thought through the
  MSG_COPY code is.
  
  Totally UNTESTED leak fix appended. Stanislav?

I'll give it a shot.

Btw, something that's really bothering me is just how much bogus
'follow-on' spew we have lately. I'm not sure what changed, but it
seems to have gotten worse.

Here's an oops I just hit..

BUG: unable to handle kernel NULL pointer dereference at 000f
IP: [812c24ca] testmsg.isra.5+0x1a/0x60
PGD 10fd95067 PUD 10f767067 PMD 0 
Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif 
can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi 
af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox 
ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT 
nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter 
ip6_tables raid0 snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec 
bluetooth microcode serio_raw snd_pcm edac_core pcspkr snd_page_alloc rfkill 
snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm 
radeon backlight drm_kms_helper ttm
CPU 2 
Pid: 958, comm: trinity-child20 Not tainted 3.9.0-rc4+ #7 Gigabyte Technology 
Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[812c24ca]  [812c24ca] testmsg.isra.5+0x1a/0x60
RSP: 0018:880117bb5e88  EFLAGS: 00010246
RAX:  RBX: 0004 RCX: 0078
RDX: 0004 RSI: fffe RDI: 000f
RBP: 880117bb5e88 R08: 0004 R09: 0001
R10: 880117bb8000 R11: 0001 R12: fffe
R13: 88010fd10308 R14: 88010fd10258 R15: 
FS:  7fa89c256740() GS:88012aa0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 000f CR3: 00010f76c000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process trinity-child20 (pid: 958, threadinfo 880117bb4000, task 
880117bb8000)
Stack:
 880117bb5f68 812c3746  880117bb8000
 880117bb8000 880117bb8000 81c7ace0 812c2430
 0004   652a928e
Call Trace:
 [812c3746] do_msgrcv+0x1a6/0x5f0
 [812c2430] ? msg_security+0x10/0x10
 [810b6c55] ? trace_hardirqs_on_caller+0x115/0x1a0
 [8134b39e] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [812c3ba5] sys_msgrcv+0x15/0x20
 [816cd942] system_call_fastpath+0x16/0x1b
Code: c3 48 c7 c0 f2 ff ff ff eb e5 0f 1f 80 00 00 00 00 66 66 66 66 90 55 83 
fa 02 48 89 e5 74 3a 7e 28 83 fa 03 74 13 83 fa 04 75 0a 48 39 37 b8 01 00 00 
00 7e 02 31 c0 5d c3 48 3b 37 74 f7 b8 01 
RIP  [812c24ca] testmsg.isra.5+0x1a/0x60
 RSP 880117bb5e88
CR2: 000f
---[ end trace 8f0713d2aacb6aa3 ]---
BUG: sleeping function called from invalid context at kernel/rwsem.c:20
in_atomic(): 1, irqs_disabled(): 0, pid: 958, name: trinity-child20
INFO: lockdep is turned off.
Pid: 958, comm: trinity-child20 Tainted: G  D  3.9.0-rc4+ #7
Call Trace:
 [8107dba5] __might_sleep+0x145/0x200
 [816c215a] down_read+0x2a/0xa0
 [8105fa14] exit_signals+0x24/0x130
 [8104b80c] do_exit+0xbc/0xd10
 [81048b25] ? kmsg_dump+0x1b5/0x230
 [81048995] ? kmsg_dump+0x25/0x230
 [816c6b6c] oops_end+0x9c/0xe0
 [816b7a40] no_context+0x268/0x275
 [816b7ac5] __bad_area_nosemaphore+0x78/0x1d1
 [816b7c31] bad_area_nosemaphore+0x13/0x15
 [816c90e6] __do_page_fault+0x366/0x5b0
 [810b72b5] ? __lock_acquire+0x2e5/0x1a00
 [810b3348] ? trace_hardirqs_off_caller+0x28/0xc0
 [8134b3dd] ? trace_hardirqs_off_thunk+0x3a/0x3c
 [816c933e] do_page_fault+0xe/0x10
 [816c5fa2] page_fault+0x22/0x30
 [812c24ca] ? testmsg.isra.5+0x1a/0x60
 [812d80b6] ? security_msg_queue_msgrcv+0x16/0x20
 [812c3746] do_msgrcv+0x1a6/0x5f0
 [812c2430] ? msg_security+0x10/0x10
 [810b6c55] ? trace_hardirqs_on_caller+0x115/0x1a0
 [8134b39e] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [812c3ba5] sys_msgrcv+0x15/0x20
 [816cd942]

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones da...@redhat.com wrote:

 Btw, something that's really bothering me is just how much bogus
 'follow-on' spew we have lately. I'm not sure what changed, but it
 seems to have gotten worse.

.. we have many more sanity checks, and they tend to trigger in the
case of us killing somebody holding locks etc.

 Related: is there any value in printing out all the ? symbols on
 kernels with frame pointers enabled ?

I've found them useful. You can often see what happened just before.

Of course, I'd still prefer gcc not generate unnecessarily big stack
frames (which is one of the bigger reasons for old stuff shining
through) but as things are, they often give hints about what happened
just before.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones da...@redhat.com wrote:

 Here's an oops I just hit..

 BUG: unable to handle kernel NULL pointer dereference at 000f
 IP: [812c24ca] testmsg.isra.5+0x1a/0x60

Btw, looking at the code leading up to this, what the f*ck is wrong
with the IPC stuff?

It's using the generic list stuff for most of the lists, but then it
open-codes the accesses.

So instead of using

   for_each_entry(walk_msg, msq-q_messages, m_list) {
  ..
   }

the ipc/msg.c code does all that by hand, with

   tmp = msq-q_messages.next;
   while (tmp != msq-q_messages) {
  struct msg_msg *walk_msg;

  walk_msg = list_entry(tmp, struct msg_msg, m_list);
  ...
  tmp = tmp-next;
   }

Ugh. The code is near unreadable. And then it has magic memory
barriers etc, implying that it doesn't lock the data structures, but
no comments about them. See expunge_all() and pipelined_send().

The code seems entirely random, and it's badly set up (annoyance of
the day: crazy helper functions in ipc/msgutil.c to make sure that (a)
you have to spend more effort looking for them, and (b) they won't get
inlined).

Clearly nobody has cared for the crazy IPC message code in a long time.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Peter Hurley

On Fri, 2013-03-29 at 11:43 -0700, Linus Torvalds wrote:
 I think I foud at least one bug in the MSG_COPY stuff: it leaks the
 copy allocation if
 
 mode == SEARCH_LESSEQUAL
 
 but maybe I'm misreading it.

Yes, you're misreading it. copy_msg() returns the 'copy' address when
copying is successful. So this patch double-frees 'copy'.

Andrew has had the patches that fix this for a month but because they're
fairly extensive he didn't want to apply them for 3.9.

Regards,
Peter


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Peter Hurley

On Fri, 2013-03-29 at 12:26 -0700, Linus Torvalds wrote:
 On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones da...@redhat.com wrote:
 
  Here's an oops I just hit..
 
  BUG: unable to handle kernel NULL pointer dereference at 000f
  IP: [812c24ca] testmsg.isra.5+0x1a/0x60
 
 Btw, looking at the code leading up to this, what the f*ck is wrong
 with the IPC stuff?
 
 It's using the generic list stuff for most of the lists, but then it
 open-codes the accesses.
 
 So instead of using
 
for_each_entry(walk_msg, msq-q_messages, m_list) {
   ..
}
 
 the ipc/msg.c code does all that by hand, with
 
tmp = msq-q_messages.next;
while (tmp != msq-q_messages) {
   struct msg_msg *walk_msg;
 
   walk_msg = list_entry(tmp, struct msg_msg, m_list);
   ...
   tmp = tmp-next;
}
 
 Ugh. The code is near unreadable. And then it has magic memory
 barriers etc, implying that it doesn't lock the data structures, but
 no comments about them. See expunge_all() and pipelined_send().
 
 The code seems entirely random, and it's badly set up (annoyance of
 the day: crazy helper functions in ipc/msgutil.c to make sure that (a)
 you have to spend more effort looking for them, and (b) they won't get
 inlined).
 
 Clearly nobody has cared for the crazy IPC message code in a long time.

Exactly that's what my patch series does; clean this mess up.

This is what I wrote to Andrew a couple of days ago.

On Tue, 2013-03-26 at 22:33 -0400, Peter Hurley wrote:
I just figured out how the queue is being corrupted and why my series
 fixes it.
 
 
 With MSG_COPY set, the queue scan can exit with the local variable
'msg'
 pointing to a real msg if the msg_counter never reaches the
copy_number.
 
 The remaining execution looks like this:
 
   if (!IS_ERR(msg)) {
   
   if (msgflg  MSG_COPY)
   goto out_unlock;
   
 
 out_unlock:
   msg_unlock(msq);
   break;
   }
   }
   if (IS_ERR(msg))
   
 
   bufsz = msg_handler();
   free_msg(msg);   msg never unlinked
 
 
 Since the msg should not have been found (because it failed the match
 criteria), the if (!IS_ERR(msg)) clause should never have executed.
 
 That's why my refactor fixes resolve this; because msg is not
 inadvertently treated as a found msg.
 
 But let's be honest; the real bug here is the poor structure of this
 function that even made this possible. The deepest nesting executes a
 goto to a label in the middle of an if clause. Yuck! No wonder this
 thing's fragile.
 
 So my recommendation still stands. The series that fixes this has been
 getting tested in linux-next for a month. Fixing this some other way
is
 just asking for more trouble.
 
 But why not just revert MSG_COPY altogether for 3.9?

Regards,
Peter Hurley

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 12:33 PM, Peter Hurley pe...@hurleysoftware.com wrote:
 On Fri, 2013-03-29 at 11:43 -0700, Linus Torvalds wrote:
 I think I foud at least one bug in the MSG_COPY stuff: it leaks the
 copy allocation if

 mode == SEARCH_LESSEQUAL

 but maybe I'm misreading it.

 Yes, you're misreading it. copy_msg() returns the 'copy' address when
 copying is successful. So this patch double-frees 'copy'.

No it doesn't.

The case where copy_msg() *is* called and is successful, msg gets
set to copy, my patch sets copy to NULL, so no, it doesn't
double-free.

That said, it looks like if MSG_COPY is set, we cannot trigger the
mode == SEARCH_LESSEQUAL case, because it forces msgtyp to 0, which
in turn forces mode = SEARCH_ANY.

So the actual leak case seems to not be possible, although that's
*really* hard to see from looking at the code. The code is just an
unreadable mess.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones da...@redhat.com wrote:

 Now that I have that reverted, I'm not seeing msgrcv traces any more, but
 I've started seeing this..

 general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 RIP: 0010:[812c20fb]  [812c20fb] free_msg+0x2b/0x40
 Call Trace:
  [812c289f] freeque+0xcf/0x140
  [812c2a93] msgctl_down.constprop.9+0x183/0x200
  [812c2da9] sys_msgctl+0x139/0x400
  [816cd942] system_call_fastpath+0x16/0x1b

 Looks like seg was already kfree'd.

Hmm.

I have a suspicion.

The use of ipc_rcu_getref/ipc_rcu_putref() seems a bit iffy.

In particular, the refcount is not an atomic variable, so we
absolutely *depend* on the spinlock for it.

However, looking at freeque(), that's not actually the case. It
releases the message queue spinlock *before* it does the
ipc_rcu_putref(), and it does that because the thing has become
unreachable (it did a msg_rmid(), which will set -deleted, which in
turn will mean that nobody should successfully look it up any more).

HOWEVER.

While the deleted flag is serialized, the actual refcount is not. So
in *parallel* with the freeque() call, we may have some other user
that does something like

...
ipc_rcu_getref(msq);
msg_unlock(msq);
schedule();

ipc_lock_by_ptr(msq-q_perm);
ipc_rcu_putref(msq);
if (msq-q_perm.deleted) {
err = -EIDRM;
goto out_unlock_free;
}
...


which got the lock for the deleted test, so that part is all fine,
but notice the ipc_rcu_putref(). It can happen at the same time that
freeque() also does its own ipc_rcu_putref().

So now refcount may get buggered, resulting in random early reuse,
double free's or leaking of the msq.

There may be some reason I don't see why this cannot happen, but it
does look suspicious. I wonder if the refcount should be an atomic
value.

The alternative would be to make sure the thing is always locked (and
in a rcu-read-safe region) before putref/getref. The only place (apart
from the initial allocation, which doesn't matter, because nothing can
see if itf that path fails) seems to be that freeque(), but I didn't
check everything.

Moving the

msg_unlock(msq);

to the end of freeque() might be the way to go. It's going to cause us
to hold the lock for longer, but I'm not sure we care for that path.

Guys, am I missing something? This kind of refcounting problem might
well explain the rcu-freeing-time bugs reported with the scalability
patches: long-time race that just got *much* easier to expose with the
higher level of parallelism?

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 1:41 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 The alternative would be to make sure the thing is always locked (and
 in a rcu-read-safe region) before putref/getref. The only place (apart
 from the initial allocation, which doesn't matter, because nothing can
 see if itf that path fails) seems to be that freeque(), but I didn't
 check everything.

 Moving the

 msg_unlock(msq);

 to the end of freeque() might be the way to go. It's going to cause us
 to hold the lock for longer, but I'm not sure we care for that path.

Uhhuh. I think shm_destroy() has the same pattern. And I think that
one has locking reasons why it has to do the shm_unlock() before
tearing some things down, although I'm not sure..

The good news is that shm doesn't seem to have any users of
ipc_rcu_get/putref(), so I don't see anything to race against. So it
has the same buggy pattern, but I don't think it can trigger anything.

And ipc/sem.c has the same bug-pattern in freeary(). It does
sem_unlock(sma) followed by ipc_rcu_putref(sma);, and it *does*
seem to have code that that can race against (sem_lock_and_putref()).
The whole sem_putref() tries to be careful and gets the lock for the
last ref, but getting the lock doesn't help when freeary() does the
refcount access without it.

The semaphore case seems to argue fairly strongly for an atomic_t
refcount, because right now it does a lot of sem_putref() stuff
that just gets the lock for the putref. So not only is that
insufficient (if it races against freeary(), but it's also more
expensive than just an atomic refcount would have been.

I dunno. I'm still not sure this is triggerable, but it looks bad. But
both the semaphore case and the msg cases seem to be solvable by
moving the unlock down, and shm seem to have no getref/putref users to
race with, so this (whitespace-damaged) patch *may* be sufficient:

diff --git a/ipc/msg.c b/ipc/msg.c
index 31cd1bf6af27..338d8e2b589b 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -284,7 +284,6 @@ static void freeque(struct ipc_namespace *ns,
struct kern_ipc_perm *ipcp)
expunge_all(msq, -EIDRM);
ss_wakeup(msq-q_senders, 1);
msg_rmid(ns, msq);
-   msg_unlock(msq);

tmp = msq-q_messages.next;
while (tmp != msq-q_messages) {
@@ -297,6 +296,7 @@ static void freeque(struct ipc_namespace *ns,
struct kern_ipc_perm *ipcp)
atomic_sub(msq-q_cbytes, ns-msg_bytes);
security_msg_queue_free(msq);
ipc_rcu_putref(msq);
+   msg_unlock(msq);
 }

 /*
diff --git a/ipc/sem.c b/ipc/sem.c
index 58d31f1c1eb5..1cf024b9eac0 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -766,12 +766,12 @@ static void freeary(struct ipc_namespace *ns,
struct kern_ipc_perm *ipcp)

/* Remove the semaphore set from the IDR */
sem_rmid(ns, sma);
-   sem_unlock(sma);

wake_up_sem_queue_do(tasks);
ns-used_sems -= sma-sem_nsems;
security_sem_free(sma);
ipc_rcu_putref(sma);
+   sem_unlock(sma);
 }

 static unsigned long copy_semid_to_user(void __user *buf, struct
semid64_ds *in, int version)

(I didn't check very carefully, it's possible that we end up having
some locking problem if we move the unlocks down later, but it *looks*
fine)

Anybody?

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 2:12 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 I dunno. I'm still not sure this is triggerable, but it looks bad. But
 both the semaphore case and the msg cases seem to be solvable by
 moving the unlock down, and shm seem to have no getref/putref users to
 race with, so this (whitespace-damaged) patch *may* be sufficient:

Well, the patch doesn't seem to cause any problems, at least neither
lockdep nor spinlock sleep debugging complains. I have no idea whether
it actually fixes any problems, though.

I do wonder if this might explain the problem Emmanuel saw. A double
free of a RCU-freeable object would possibly result in exactly the
kind of mess that Emmanuel reported with the semaphore scalability
patches.

Emmanuel, can you try the attached patch? I think it applies cleanly
on top of the scalability series too without any changes, but I didn't
check if the patches perhaps changed some of the naming or something.

  Linus


patch.diff
Description: Binary data

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Emmanuel Benisty

Hi Linus,

On Sat, Mar 30, 2013 at 6:16 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 Emmanuel, can you try the attached patch? I think it applies cleanly
 on top of the scalability series too without any changes, but I didn't
 check if the patches perhaps changed some of the naming or something.

I had to slightly modify the patch since it wouldn't match the changes
introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
hope that was the right thing to do. So, what I tried was: original 7
patches + the one liner + your patch blindly modified by me on the top
of 3.9-rc4 and I'm still having twilight zone issues.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Davidlohr Bueso

On Sat, 2013-03-30 at 08:36 +0700, Emmanuel Benisty wrote:
 Hi Linus,
 
 On Sat, Mar 30, 2013 at 6:16 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
  Emmanuel, can you try the attached patch? I think it applies cleanly
  on top of the scalability series too without any changes, but I didn't
  check if the patches perhaps changed some of the naming or something.
 
 I had to slightly modify the patch since it wouldn't match the changes
 introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
 hope that was the right thing to do. So, what I tried was: original 7
 patches + the one liner + your patch blindly modified by me on the top
 of 3.9-rc4 and I'm still having twilight zone issues.

Not sure which one liner you refer to, but, if you haven't already done
so, please try with these fixes (queued in linux-next):

http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=a9cead0347283f3e72a39e7b76a3cc479b048e51
http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=4db64b89525ac357cba754c3120065a9ec31

I've been trying to reproduce your twilight zone problem on five
different machines now without any luck. Is there anything you're doing
to trigger the issue? Does the machine boot ok and then do weird things,
say after X starts, open some program, etc?

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 6:36 PM, Emmanuel Benisty benist...@gmail.com wrote:

 I had to slightly modify the patch since it wouldn't match the changes
 introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
 hope that was the right thing to do. So, what I tried was: original 7
 patches + the one liner + your patch blindly modified by me on the top
 of 3.9-rc4 and I'm still having twilight zone issues.

Ok, please send your patch so that I can double-check what you did,
but it was simple enough that you probably did the right thing.

Sad. Your case definitely looks like a double rcu-free, as shown by
the fact that when you enabled SLUB debugging the oops happened with
the use-after-free pattern (it's __rcu_reclaim() doing the
head-func(head); thing, and func is 0x6b6b6b6b6b6b6b6b, so head
has already been free'd once).

So ipc_rcu_putref() and a refcounting error looked very promising.as a
potential explanation.

The 'un' undo structure is also free'd with rcu, but the locking
around that seems much more robust. The undo entry is on two lists
(sma-list_id, under sma-sem_perm.lock and ulp-list_proc, under
ulp-lock). But those locks are actually tested with
assert_spin_locked() in all the relevant places, and the code actually
looks sane. So I had high hopes for ipc_rcu_putref()...

Hmm. Except for exit_sem() that does odd things. You have preemption
enabled, don't you? exit_sem() does a lookup of the first list_proc
entry under tcy_read_lock to lookup un-semid, and then it drops the
rcu read lock. At which point un is no longer reliable, I think. But
then it still uses un-semid, rather than the stable value it looked
up under the rcu read lock. Which looks bogus.

So I'd like you to test a few more things:

 (a) In exit_sem(), can you change the

 sma = sem_lock_check(tsk-nsproxy-ipc_ns, un-semid);

 to use just semid rather than un-semid, because I don't
think un is stable here.

 (b) does the problem go away if you change disable CONFIG_PREEMPT
(perhaps to PREEMPT_NONE or PREEMPT_VOLUNTARY?)

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Davidlohr Bueso

On Fri, 2013-03-29 at 19:09 -0700, Linus Torvalds wrote:
 On Fri, Mar 29, 2013 at 6:36 PM, Emmanuel Benisty benist...@gmail.com wrote:
 
  I had to slightly modify the patch since it wouldn't match the changes
  introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
  hope that was the right thing to do. So, what I tried was: original 7
  patches + the one liner + your patch blindly modified by me on the top
  of 3.9-rc4 and I'm still having twilight zone issues.
 
 Ok, please send your patch so that I can double-check what you did,
 but it was simple enough that you probably did the right thing.
 
 Sad. Your case definitely looks like a double rcu-free, as shown by
 the fact that when you enabled SLUB debugging the oops happened with
 the use-after-free pattern (it's __rcu_reclaim() doing the
 head-func(head); thing, and func is 0x6b6b6b6b6b6b6b6b, so head
 has already been free'd once).
 
 So ipc_rcu_putref() and a refcounting error looked very promising.as a
 potential explanation.
 
 The 'un' undo structure is also free'd with rcu, but the locking
 around that seems much more robust. The undo entry is on two lists
 (sma-list_id, under sma-sem_perm.lock and ulp-list_proc, under
 ulp-lock). But those locks are actually tested with
 assert_spin_locked() in all the relevant places, and the code actually
 looks sane. So I had high hopes for ipc_rcu_putref()...
 
 Hmm. Except for exit_sem() that does odd things. You have preemption
 enabled, don't you? exit_sem() does a lookup of the first list_proc
 entry under tcy_read_lock to lookup un-semid, and then it drops the
 rcu read lock. At which point un is no longer reliable, I think. But
 then it still uses un-semid, rather than the stable value it looked
 up under the rcu read lock. Which looks bogus.
 
 So I'd like you to test a few more things:
 
  (a) In exit_sem(), can you change the
 
  sma = sem_lock_check(tsk-nsproxy-ipc_ns, un-semid);
 
  to use just semid rather than un-semid, because I don't
 think un is stable here.

Well that's not really the case in the new code. We don't drop the rcu
read lock until the end of the loop, in sem_unlock(). However, I just
noticed that we're checking sma for error after trying to acquire
sma-sem_perm.lock:

sma = sem_obtain_object_check(tsk-nsproxy-ipc_ns, un-semid);
sem_lock(sma, NULL, -1);

/* exit_sem raced with IPC_RMID, nothing to do */
if (IS_ERR(sma))
continue;

The IS_ERR(sma) check should be right after the sem_obtain_object_check() call 
instead.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Emmanuel Benisty

Hi Davidlohr,

On Sat, Mar 30, 2013 at 9:08 AM, Davidlohr Bueso davidlohr.bu...@hp.com wrote:
 Not sure which one liner you refer to, but, if you haven't already done
 so, please try with these fixes (queued in linux-next):

 http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=a9cead0347283f3e72a39e7b76a3cc479b048e51
 http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=4db64b89525ac357cba754c3120065a9ec31

 I've been trying to reproduce your twilight zone problem on five
 different machines now without any luck. Is there anything you're doing
 to trigger the issue? Does the machine boot ok and then do weird things,
 say after X starts, open some program, etc?

I was missing a9cead0, thanks. What I usually do is starting a
standard session, which looks like this:

init-+-5*[agetty]
 |-bash---startx---xinit-+-X---2*[{X}]
 |   `-dwm---sh---sleep
 |-bash---chromium-+-chromium
 | |-chromium-+-chromium
 | |  `-2*[{chromium}]
 |
|-chromium-sandbo---chromium---chromium---4*[chromium---4*[{chromium}]]
 | `-65*[{chromium}]
 |-crond
 |-dbus-daemon
 |-klogd
 |-syslogd
 |-tmux-+-alsamixer
 |  |-bash---bash
 |  |-bash
 |  |-htop
 |  `-newsbeuter---{newsbeuter}
 |-udevd
 |-urxvtd-+-bash---pstree
 |`-bash---tmux
 `-wpa_supplicant

Then I start building a random package and the problems start. They
may also happen without compiling but this seems to trigger the bug
quite quickly. Anyway, some progress here, I hope: dmesg seems to be
willing to reveal some secrets (using some pastebin service since this
is pretty big):

https://gist.github.com/anonymous/5275120

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty benist...@gmail.com wrote:

 Then I start building a random package and the problems start. They
 may also happen without compiling but this seems to trigger the bug
 quite quickly.

I suspect it's about preemption, and the build just results in enough
scheduling load that you start hitting whatever race there is.

 Anyway, some progress here, I hope: dmesg seems to be
 willing to reveal some secrets (using some pastebin service since this
 is pretty big):

 https://gist.github.com/anonymous/5275120

That looks like exactly the exit_sem() bug that Davidlohr was talking
about, where the

/* exit_sem raced with IPC_RMID, nothing to do */
if (IS_ERR(sma))
continue;

should be moved to *before* the

sem_lock(sma, NULL, -1);

call. And apparently the bug I had found is already fixed in -next.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Emmanuel Benisty

On Sat, Mar 30, 2013 at 10:46 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty benist...@gmail.com wrote:

 Then I start building a random package and the problems start. They
 may also happen without compiling but this seems to trigger the bug
 quite quickly.

 I suspect it's about preemption, and the build just results in enough
 scheduling load that you start hitting whatever race there is.

 Anyway, some progress here, I hope: dmesg seems to be
 willing to reveal some secrets (using some pastebin service since this
 is pretty big):

 https://gist.github.com/anonymous/5275120

 That looks like exactly the exit_sem() bug that Davidlohr was talking
 about, where the

 /* exit_sem raced with IPC_RMID, nothing to do */
 if (IS_ERR(sma))
 continue;

 should be moved to *before* the

 sem_lock(sma, NULL, -1);

 call. And apparently the bug I had found is already fixed in -next.

I just tried the 7 original patches + the 2 one liners from -next +
modified Linus' patch (attached) on the top of 3.9-rc4 using
PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained
above. I was building two packages at the same time, went away for 30
seconds, came back and everything froze as soon as I touched the
laptop's touchpad. Maybe a coincidence but anyway... Another shot in
the dark, I had this weird message when trying to build gcc:
semop(2): encountered an error: Identifier removed


patch.diff
Description: Binary data

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds

On Fri, Mar 29, 2013 at 9:33 PM, Emmanuel Benisty benist...@gmail.com wrote:

 I just tried the 7 original patches + the 2 one liners from -next +
 modified Linus' patch (attached)

.. that patch looks fine.

 on the top of 3.9-rc4 using
 PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained
 above. I was building two packages at the same time, went away for 30
 seconds, came back and everything froze as soon as I touched the
 laptop's touchpad. Maybe a coincidence but anyway... Another shot in
 the dark, I had this weird message when trying to build gcc:
 semop(2): encountered an error: Identifier removed

This came from the gcc build?

That's just crazy. No normal app uses sysv semaphores. I have an older
gcc build environment, and some grepping shows it has some ipc
semaphore use in the libstdc++ testsuite, and some libmudflap hooks,
but that should be very very minor.

You seem to trigger errors really trivially easily, which is really
odd. It's sounding less and less like some subtle race, and more like
the error just happens all the time. If you can make even the gcc
build generate errors, I don't think they can be some rare blue-moon
thing.

I notice that your dmesg says that your kernel is compiled by
gcc-4.8.1 prerelease. Is there any chance that you could try to
install a known-stable gcc, like 4.7.2 or something. It's entirely
possible that it's a kernel bug even if it's triggered by some more
aggressive compiler optimization or something, but it would be really
good to try to see if this might be gcc-specific.

For example, I wonder if your gcc might miscompile idr_alloc() or
something, so that we get the same ID for different ipc objects. That
would certainly potentially cause chaos.

Hmm?

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Emmanuel Benisty

On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 Another shot in
 the dark, I had this weird message when trying to build gcc:
 semop(2): encountered an error: Identifier removed

 This came from the gcc build?

yes, very early in the build process, IIRC this line was repeated a
few times and the build just stalled.

 That's just crazy. No normal app uses sysv semaphores. I have an older
 gcc build environment, and some grepping shows it has some ipc
 semaphore use in the libstdc++ testsuite, and some libmudflap hooks,
 but that should be very very minor.

 I notice that your dmesg says that your kernel is compiled by
 gcc-4.8.1 prerelease. Is there any chance that you could try to
 install a known-stable gcc, like 4.7.2 or something. It's entirely
 possible that it's a kernel bug even if it's triggered by some more
 aggressive compiler optimization or something, but it would be really
 good to try to see if this might be gcc-specific.

I could build a kernel on another machine on which 4.7.2 is installed,
kernel oops'd as well:
http://i.imgur.com/uk6gmq1.jpg

FWIW, I have a few things disabled in my config so here is the one I
used, maybe I'm missing something (but again, everything works
perfectly with your tree):
https://gist.github.com/anonymous/5275566

Lastly, just FTR, I tried Andrew's 3.9-rc4-mm1 and got the same
issues, unsurprisingly I guess.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Andrew Morton

On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones  wrote:

> On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:
> 
>  > Whichever way we go, we should get a wiggle on - this has been hanging
>  > around for too long.  Dave, do you have time to determine whether
>  > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
>  > than max") fixes things up?
> 
> Ok, with that reverted it's been grinding away for a few hours without 
> incident.
> Normally I see the oops within a minute or so.
> 

OK, thanks, I queued a revert:

From: Andrew Morton 
Subject: revert "ipc: don't allocate a copy larger than max"

Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
during trinity testing.

Cc: Peter Hurley 
Cc: Stanislav Kinsbursky 
Reported-by: Dave Jones 
Cc: 
Signed-off-by: Andrew Morton 
---

 ipc/msg.c |6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff -puN ipc/msg.c~revert-ipc-dont-allocate-a-copy-larger-than-max ipc/msg.c
--- a/ipc/msg.c~revert-ipc-dont-allocate-a-copy-larger-than-max
+++ a/ipc/msg.c
@@ -820,17 +820,15 @@ long do_msgrcv(int msqid, void __user *b
struct msg_msg *copy = NULL;
unsigned long copy_number = 0;
 
-   ns = current->nsproxy->ipc_ns;
-
if (msqid < 0 || (long) bufsz < 0)
return -EINVAL;
if (msgflg & MSG_COPY) {
-   copy = prepare_copy(buf, min_t(size_t, bufsz, ns->msg_ctlmax),
-   msgflg, , _number);
+   copy = prepare_copy(buf, bufsz, msgflg, , _number);
if (IS_ERR(copy))
return PTR_ERR(copy);
}
mode = convert_mode(, msgflg);
+   ns = current->nsproxy->ipc_ns;
 
msq = msg_lock_check(ns, msqid);
if (IS_ERR(msq)) {
_

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Dave Jones

On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:

 > Whichever way we go, we should get a wiggle on - this has been hanging
 > around for too long.  Dave, do you have time to determine whether
 > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
 > than max") fixes things up?

Ok, with that reverted it's been grinding away for a few hours without incident.
Normally I see the oops within a minute or so.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Andrew Morton

On Tue, 26 Mar 2013 10:59:27 -0700 Davidlohr Bueso  
wrote:

> On Mon, 2013-03-25 at 20:47 +0700, Emmanuel Benisty wrote:
> > On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
> >  wrote:
> > > And you never see this problem without Rik's patches?
> > 
> > No, never.
> > 
> > > Could you bisect
> > > *which* patch it starts with? Are the first four ones ok (the moving
> > > of the locking around, but without the fine-grained ones), for
> > > example?
> > 
> > With the first four patches only, I got some X server freeze (just tried 
> > once).
> 
> Going over the code again, I found a potential recursive spinlock scenario. 
> Andrew, if you have no objections, please queue this up.
> 
> Thanks.
> 
> ---8<---
> 
> From: Davidlohr Bueso 
> Subject: [PATCH] ipc, sem: prevent possible deadlock
> 
> In semctl_main(), when cmd == GETALL, we're locking
> sma->sem_perm.lock (through sem_lock_and_putref), yet
> after the conditional, we lock it again.
> Unlock sma right after exiting the conditional.
> 
> Signed-off-by: Davidlohr Bueso 
> ---
>  ipc/sem.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/ipc/sem.c b/ipc/sem.c
> index 1a2913d..f257afe 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -1243,6 +1243,7 @@ static int semctl_main(struct ipc_namespace *ns, int 
> semid, int semnum,
>   err = -EIDRM;
>   goto out_free;
>   }
> + sem_unlock(sma, -1);
>   }
>  
>   sem_lock(sma, NULL, -1);

Looks right. 

Do we need the locking at all?  What does it actually do?

sem_lock_and_putref(sma);
if (sma->sem_perm.deleted) {
sem_unlock(sma, -1);
err = -EIDRM;
goto out_free;
}
sem_unlock(sma, -1);

We're taking the lock, testing an int and then dropping the lock. 
What's the point in that?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Rik van Riel


On 03/26/2013 02:07 PM, Sasha Levin wrote:

On 03/26/2013 01:51 PM, Davidlohr Bueso wrote:

On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote:

On 03/20/2013 03:55 PM, Rik van Riel wrote:

This series makes the sysv semaphore code more scalable,
by reducing the time the semaphore lock is held, and making
the locking more scalable for semaphore arrays with multiple
semaphores.


Hi Rik,

Another issue that came up is:

[   96.347341] 
[   96.348085] [ BUG: lock held when returning to user space! ]
[   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G 
   W
[   96.360300] 
[   96.361084] trinity-child9/7583 is leaving the kernel with locks still held!
[   96.362019] 1 lock held by trinity-child9/7583:
[   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [] 
SYSC_semtimedop+0x1fb/0xec0

It seems that we can leave semtimedop without releasing the rcu read lock.

I'm a bit confused by what's going on in semtimedop with regards to rcu read 
lock, it
seems that this behaviour is actually intentional?

 rcu_read_lock();
 sma = sem_obtain_object_check(ns, semid);
 if (IS_ERR(sma)) {
 if (un)
 rcu_read_unlock();
 error = PTR_ERR(sma);
 goto out_free;
 }

When I've looked at that it seems that not releasing the read lock was (very)
intentional.


This logic was from the original code, which I also found to be quite
confusing.


I wasn't getting this warning with the old code, so there was probably something
else that triggers this now.



After that, the only code path that would release the lock starts with:

 if (un) {
...

So we won't release the lock at all if un is NULL?



Not necessarily, we do release everything at the end of the function:

out_unlock_free:
sem_unlock(sma, locknum);


Ow, there's a rcu_read_unlock() in sem_unlock()? This complicates things even
more I suspect. If un is non-NULL we'll be unlocking rcu lock twice?


It is uglier than you think...

On success, find_alloc_undo will call rcu_read_lock, so we have the
rcu_read_lock held twice :(

Some of the ipc code is quite ugly, but making too many large changes
at once is just asking for trouble. I suspect we're going to have to
untangle this one bit at a time...


--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Rik van Riel

On 03/26/2013 01:59 PM, Davidlohr Bueso wrote:

From: Davidlohr Bueso 
Subject: [PATCH] ipc, sem: prevent possible deadlock

In semctl_main(), when cmd == GETALL, we're locking
sma->sem_perm.lock (through sem_lock_and_putref), yet
after the conditional, we lock it again.
Unlock sma right after exiting the conditional.

Signed-off-by: Davidlohr Bueso 

Reviewed-by: Rik van Riel 

--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Sasha Levin

On 03/20/2013 03:55 PM, Rik van Riel wrote:
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.

Hi Rik,

Another issue that came up is:

[   96.347341] 
[   96.348085] [ BUG: lock held when returning to user space! ]
[   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G 
   W
[   96.360300] 
[   96.361084] trinity-child9/7583 is leaving the kernel with locks still held!
[   96.362019] 1 lock held by trinity-child9/7583:
[   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [] 
SYSC_semtimedop+0x1fb/0xec0

It seems that we can leave semtimedop without releasing the rcu read lock.

I'm a bit confused by what's going on in semtimedop with regards to rcu read 
lock, it
seems that this behaviour is actually intentional?

rcu_read_lock();
sma = sem_obtain_object_check(ns, semid);
if (IS_ERR(sma)) {
if (un)
rcu_read_unlock();
error = PTR_ERR(sma);
goto out_free;
}

When I've looked at that it seems that not releasing the read lock was (very)
intentional.

After that, the only code path that would release the lock starts with:

if (un) {
...

So we won't release the lock at all if un is NULL?

Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Sasha Levin

On 03/26/2013 01:51 PM, Davidlohr Bueso wrote:
> On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote:
>> On 03/20/2013 03:55 PM, Rik van Riel wrote:
>>> This series makes the sysv semaphore code more scalable,
>>> by reducing the time the semaphore lock is held, and making
>>> the locking more scalable for semaphore arrays with multiple
>>> semaphores.
>>
>> Hi Rik,
>>
>> Another issue that came up is:
>>
>> [   96.347341] 
>> [   96.348085] [ BUG: lock held when returning to user space! ]
>> [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G  
>>   W
>> [   96.360300] 
>> [   96.361084] trinity-child9/7583 is leaving the kernel with locks still 
>> held!
>> [   96.362019] 1 lock held by trinity-child9/7583:
>> [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [] 
>> SYSC_semtimedop+0x1fb/0xec0
>>
>> It seems that we can leave semtimedop without releasing the rcu read lock.
>>
>> I'm a bit confused by what's going on in semtimedop with regards to rcu read 
>> lock, it
>> seems that this behaviour is actually intentional?
>>
>> rcu_read_lock();
>> sma = sem_obtain_object_check(ns, semid);
>> if (IS_ERR(sma)) {
>> if (un)
>> rcu_read_unlock();
>> error = PTR_ERR(sma);
>> goto out_free;
>> }
>>
>> When I've looked at that it seems that not releasing the read lock was (very)
>> intentional.
> 
> This logic was from the original code, which I also found to be quite
> confusing.

I wasn't getting this warning with the old code, so there was probably something
else that triggers this now.

>>
>> After that, the only code path that would release the lock starts with:
>>
>> if (un) {
>>  ...
>>
>> So we won't release the lock at all if un is NULL?
>>
> 
> Not necessarily, we do release everything at the end of the function: 
> 
> out_unlock_free:
>   sem_unlock(sma, locknum);

Ow, there's a rcu_read_unlock() in sem_unlock()? This complicates things even
more I suspect. If un is non-NULL we'll be unlocking rcu lock twice?

if (un->semid == -1) {
rcu_read_unlock();
goto out_unlock_free;
}
[...]
out_unlock_free:
sem_unlock(sma, locknum);


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Davidlohr Bueso

On Mon, 2013-03-25 at 20:47 +0700, Emmanuel Benisty wrote:
> On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
>  wrote:
> > And you never see this problem without Rik's patches?
> 
> No, never.
> 
> > Could you bisect
> > *which* patch it starts with? Are the first four ones ok (the moving
> > of the locking around, but without the fine-grained ones), for
> > example?
> 
> With the first four patches only, I got some X server freeze (just tried 
> once).

Going over the code again, I found a potential recursive spinlock scenario. 
Andrew, if you have no objections, please queue this up.

Thanks.

---8<---

From: Davidlohr Bueso 
Subject: [PATCH] ipc, sem: prevent possible deadlock

In semctl_main(), when cmd == GETALL, we're locking
sma->sem_perm.lock (through sem_lock_and_putref), yet
after the conditional, we lock it again.
Unlock sma right after exiting the conditional.

Signed-off-by: Davidlohr Bueso 
---
 ipc/sem.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/ipc/sem.c b/ipc/sem.c
index 1a2913d..f257afe 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1243,6 +1243,7 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
err = -EIDRM;
goto out_free;
}
+   sem_unlock(sma, -1);
}
 
sem_lock(sma, NULL, -1);
-- 
1.7.11.7



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Paul E. McKenney

On Tue, Mar 26, 2013 at 01:33:07PM -0400, Sasha Levin wrote:
> On 03/20/2013 03:55 PM, Rik van Riel wrote:
> > This series makes the sysv semaphore code more scalable,
> > by reducing the time the semaphore lock is held, and making
> > the locking more scalable for semaphore arrays with multiple
> > semaphores.
> 
> Hi Rik,
> 
> Another issue that came up is:
> 
> [   96.347341] 
> [   96.348085] [ BUG: lock held when returning to user space! ]
> [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G   
>  W
> [   96.360300] 
> [   96.361084] trinity-child9/7583 is leaving the kernel with locks still 
> held!
> [   96.362019] 1 lock held by trinity-child9/7583:
> [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [] 
> SYSC_semtimedop+0x1fb/0xec0
> 
> It seems that we can leave semtimedop without releasing the rcu read lock.
> 
> I'm a bit confused by what's going on in semtimedop with regards to rcu read 
> lock, it
> seems that this behaviour is actually intentional?
> 
> rcu_read_lock();
> sma = sem_obtain_object_check(ns, semid);
> if (IS_ERR(sma)) {
> if (un)
> rcu_read_unlock();
> error = PTR_ERR(sma);
> goto out_free;
> }
> 
> When I've looked at that it seems that not releasing the read lock was (very)
> intentional.
> 
> After that, the only code path that would release the lock starts with:
> 
> if (un) {
>   ...
> 
> So we won't release the lock at all if un is NULL?

Intentions notwithstanding, it is absolutely required to exit any and
all RCU read-side critical sections prior to going into user mode.

I suggest removing the "if (un)".

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Davidlohr Bueso

On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote:
> On 03/20/2013 03:55 PM, Rik van Riel wrote:
> > This series makes the sysv semaphore code more scalable,
> > by reducing the time the semaphore lock is held, and making
> > the locking more scalable for semaphore arrays with multiple
> > semaphores.
> 
> Hi Rik,
> 
> Another issue that came up is:
> 
> [   96.347341] 
> [   96.348085] [ BUG: lock held when returning to user space! ]
> [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G   
>  W
> [   96.360300] 
> [   96.361084] trinity-child9/7583 is leaving the kernel with locks still 
> held!
> [   96.362019] 1 lock held by trinity-child9/7583:
> [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [] 
> SYSC_semtimedop+0x1fb/0xec0
> 
> It seems that we can leave semtimedop without releasing the rcu read lock.
> 
> I'm a bit confused by what's going on in semtimedop with regards to rcu read 
> lock, it
> seems that this behaviour is actually intentional?
> 
> rcu_read_lock();
> sma = sem_obtain_object_check(ns, semid);
> if (IS_ERR(sma)) {
> if (un)
> rcu_read_unlock();
> error = PTR_ERR(sma);
> goto out_free;
> }
> 
> When I've looked at that it seems that not releasing the read lock was (very)
> intentional.

This logic was from the original code, which I also found to be quite
confusing.

> 
> After that, the only code path that would release the lock starts with:
> 
> if (un) {
>   ...
> 
> So we won't release the lock at all if un is NULL?
> 

Not necessarily, we do release everything at the end of the function: 

out_unlock_free:
sem_unlock(sma, locknum);

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Davidlohr Bueso

On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote:
 On 03/20/2013 03:55 PM, Rik van Riel wrote:
  This series makes the sysv semaphore code more scalable,
  by reducing the time the semaphore lock is held, and making
  the locking more scalable for semaphore arrays with multiple
  semaphores.
 
 Hi Rik,
 
 Another issue that came up is:
 
 [   96.347341] 
 [   96.348085] [ BUG: lock held when returning to user space! ]
 [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G   
  W
 [   96.360300] 
 [   96.361084] trinity-child9/7583 is leaving the kernel with locks still 
 held!
 [   96.362019] 1 lock held by trinity-child9/7583:
 [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [8192eafb] 
 SYSC_semtimedop+0x1fb/0xec0
 
 It seems that we can leave semtimedop without releasing the rcu read lock.
 
 I'm a bit confused by what's going on in semtimedop with regards to rcu read 
 lock, it
 seems that this behaviour is actually intentional?
 
 rcu_read_lock();
 sma = sem_obtain_object_check(ns, semid);
 if (IS_ERR(sma)) {
 if (un)
 rcu_read_unlock();
 error = PTR_ERR(sma);
 goto out_free;
 }
 
 When I've looked at that it seems that not releasing the read lock was (very)
 intentional.

This logic was from the original code, which I also found to be quite
confusing.

 
 After that, the only code path that would release the lock starts with:
 
 if (un) {
   ...
 
 So we won't release the lock at all if un is NULL?
 

Not necessarily, we do release everything at the end of the function: 

out_unlock_free:
sem_unlock(sma, locknum);

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Paul E. McKenney

On Tue, Mar 26, 2013 at 01:33:07PM -0400, Sasha Levin wrote:
 On 03/20/2013 03:55 PM, Rik van Riel wrote:
  This series makes the sysv semaphore code more scalable,
  by reducing the time the semaphore lock is held, and making
  the locking more scalable for semaphore arrays with multiple
  semaphores.
 
 Hi Rik,
 
 Another issue that came up is:
 
 [   96.347341] 
 [   96.348085] [ BUG: lock held when returning to user space! ]
 [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G   
  W
 [   96.360300] 
 [   96.361084] trinity-child9/7583 is leaving the kernel with locks still 
 held!
 [   96.362019] 1 lock held by trinity-child9/7583:
 [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [8192eafb] 
 SYSC_semtimedop+0x1fb/0xec0
 
 It seems that we can leave semtimedop without releasing the rcu read lock.
 
 I'm a bit confused by what's going on in semtimedop with regards to rcu read 
 lock, it
 seems that this behaviour is actually intentional?
 
 rcu_read_lock();
 sma = sem_obtain_object_check(ns, semid);
 if (IS_ERR(sma)) {
 if (un)
 rcu_read_unlock();
 error = PTR_ERR(sma);
 goto out_free;
 }
 
 When I've looked at that it seems that not releasing the read lock was (very)
 intentional.
 
 After that, the only code path that would release the lock starts with:
 
 if (un) {
   ...
 
 So we won't release the lock at all if un is NULL?

Intentions notwithstanding, it is absolutely required to exit any and
all RCU read-side critical sections prior to going into user mode.

I suggest removing the if (un).

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Davidlohr Bueso

On Mon, 2013-03-25 at 20:47 +0700, Emmanuel Benisty wrote:
 On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
  And you never see this problem without Rik's patches?
 
 No, never.
 
  Could you bisect
  *which* patch it starts with? Are the first four ones ok (the moving
  of the locking around, but without the fine-grained ones), for
  example?
 
 With the first four patches only, I got some X server freeze (just tried 
 once).

Going over the code again, I found a potential recursive spinlock scenario. 
Andrew, if you have no objections, please queue this up.

Thanks.

---8---

From: Davidlohr Bueso davidlohr.bu...@hp.com
Subject: [PATCH] ipc, sem: prevent possible deadlock

In semctl_main(), when cmd == GETALL, we're locking
sma-sem_perm.lock (through sem_lock_and_putref), yet
after the conditional, we lock it again.
Unlock sma right after exiting the conditional.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 ipc/sem.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/ipc/sem.c b/ipc/sem.c
index 1a2913d..f257afe 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1243,6 +1243,7 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
err = -EIDRM;
goto out_free;
}
+   sem_unlock(sma, -1);
}
 
sem_lock(sma, NULL, -1);
-- 
1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Sasha Levin

On 03/26/2013 01:51 PM, Davidlohr Bueso wrote:
 On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote:
 On 03/20/2013 03:55 PM, Rik van Riel wrote:
 This series makes the sysv semaphore code more scalable,
 by reducing the time the semaphore lock is held, and making
 the locking more scalable for semaphore arrays with multiple
 semaphores.

 Hi Rik,

 Another issue that came up is:

 [   96.347341] 
 [   96.348085] [ BUG: lock held when returning to user space! ]
 [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G  
   W
 [   96.360300] 
 [   96.361084] trinity-child9/7583 is leaving the kernel with locks still 
 held!
 [   96.362019] 1 lock held by trinity-child9/7583:
 [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [8192eafb] 
 SYSC_semtimedop+0x1fb/0xec0

 It seems that we can leave semtimedop without releasing the rcu read lock.

 I'm a bit confused by what's going on in semtimedop with regards to rcu read 
 lock, it
 seems that this behaviour is actually intentional?

 rcu_read_lock();
 sma = sem_obtain_object_check(ns, semid);
 if (IS_ERR(sma)) {
 if (un)
 rcu_read_unlock();
 error = PTR_ERR(sma);
 goto out_free;
 }

 When I've looked at that it seems that not releasing the read lock was (very)
 intentional.
 
 This logic was from the original code, which I also found to be quite
 confusing.

I wasn't getting this warning with the old code, so there was probably something
else that triggers this now.


 After that, the only code path that would release the lock starts with:

 if (un) {
  ...

 So we won't release the lock at all if un is NULL?

 
 Not necessarily, we do release everything at the end of the function: 
 
 out_unlock_free:
   sem_unlock(sma, locknum);

Ow, there's a rcu_read_unlock() in sem_unlock()? This complicates things even
more I suspect. If un is non-NULL we'll be unlocking rcu lock twice?

if (un-semid == -1) {
rcu_read_unlock();
goto out_unlock_free;
}
[...]
out_unlock_free:
sem_unlock(sma, locknum);


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Sasha Levin

On 03/20/2013 03:55 PM, Rik van Riel wrote:
 This series makes the sysv semaphore code more scalable,
 by reducing the time the semaphore lock is held, and making
 the locking more scalable for semaphore arrays with multiple
 semaphores.

Hi Rik,

Another issue that came up is:

[   96.347341] 
[   96.348085] [ BUG: lock held when returning to user space! ]
[   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G 
   W
[   96.360300] 
[   96.361084] trinity-child9/7583 is leaving the kernel with locks still held!
[   96.362019] 1 lock held by trinity-child9/7583:
[   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [8192eafb] 
SYSC_semtimedop+0x1fb/0xec0

It seems that we can leave semtimedop without releasing the rcu read lock.

I'm a bit confused by what's going on in semtimedop with regards to rcu read 
lock, it
seems that this behaviour is actually intentional?

rcu_read_lock();
sma = sem_obtain_object_check(ns, semid);
if (IS_ERR(sma)) {
if (un)
rcu_read_unlock();
error = PTR_ERR(sma);
goto out_free;
}

When I've looked at that it seems that not releasing the read lock was (very)
intentional.

After that, the only code path that would release the lock starts with:

if (un) {
...

So we won't release the lock at all if un is NULL?


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Rik van Riel

On 03/26/2013 01:59 PM, Davidlohr Bueso wrote:

From: Davidlohr Bueso davidlohr.bu...@hp.com
Subject: [PATCH] ipc, sem: prevent possible deadlock

In semctl_main(), when cmd == GETALL, we're locking
sma-sem_perm.lock (through sem_lock_and_putref), yet
after the conditional, we lock it again.
Unlock sma right after exiting the conditional.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com

Reviewed-by: Rik van Riel r...@redhat.com

--
All rights reversed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Rik van Riel


On 03/26/2013 02:07 PM, Sasha Levin wrote:

On 03/26/2013 01:51 PM, Davidlohr Bueso wrote:

On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote:

On 03/20/2013 03:55 PM, Rik van Riel wrote:

This series makes the sysv semaphore code more scalable,
by reducing the time the semaphore lock is held, and making
the locking more scalable for semaphore arrays with multiple
semaphores.


Hi Rik,

Another issue that came up is:

[   96.347341] 
[   96.348085] [ BUG: lock held when returning to user space! ]
[   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G 
   W
[   96.360300] 
[   96.361084] trinity-child9/7583 is leaving the kernel with locks still held!
[   96.362019] 1 lock held by trinity-child9/7583:
[   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [8192eafb] 
SYSC_semtimedop+0x1fb/0xec0

It seems that we can leave semtimedop without releasing the rcu read lock.

I'm a bit confused by what's going on in semtimedop with regards to rcu read 
lock, it
seems that this behaviour is actually intentional?

 rcu_read_lock();
 sma = sem_obtain_object_check(ns, semid);
 if (IS_ERR(sma)) {
 if (un)
 rcu_read_unlock();
 error = PTR_ERR(sma);
 goto out_free;
 }

When I've looked at that it seems that not releasing the read lock was (very)
intentional.


This logic was from the original code, which I also found to be quite
confusing.


I wasn't getting this warning with the old code, so there was probably something
else that triggers this now.



After that, the only code path that would release the lock starts with:

 if (un) {
...

So we won't release the lock at all if un is NULL?



Not necessarily, we do release everything at the end of the function:

out_unlock_free:
sem_unlock(sma, locknum);


Ow, there's a rcu_read_unlock() in sem_unlock()? This complicates things even
more I suspect. If un is non-NULL we'll be unlocking rcu lock twice?


It is uglier than you think...

On success, find_alloc_undo will call rcu_read_lock, so we have the
rcu_read_lock held twice :(

Some of the ipc code is quite ugly, but making too many large changes
at once is just asking for trouble. I suspect we're going to have to
untangle this one bit at a time...


--
All rights reversed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Andrew Morton

On Tue, 26 Mar 2013 10:59:27 -0700 Davidlohr Bueso davidlohr.bu...@hp.com 
wrote:

 On Mon, 2013-03-25 at 20:47 +0700, Emmanuel Benisty wrote:
  On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
  torva...@linux-foundation.org wrote:
   And you never see this problem without Rik's patches?
  
  No, never.
  
   Could you bisect
   *which* patch it starts with? Are the first four ones ok (the moving
   of the locking around, but without the fine-grained ones), for
   example?
  
  With the first four patches only, I got some X server freeze (just tried 
  once).
 
 Going over the code again, I found a potential recursive spinlock scenario. 
 Andrew, if you have no objections, please queue this up.
 
 Thanks.
 
 ---8---
 
 From: Davidlohr Bueso davidlohr.bu...@hp.com
 Subject: [PATCH] ipc, sem: prevent possible deadlock
 
 In semctl_main(), when cmd == GETALL, we're locking
 sma-sem_perm.lock (through sem_lock_and_putref), yet
 after the conditional, we lock it again.
 Unlock sma right after exiting the conditional.
 
 Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
 ---
  ipc/sem.c | 1 +
  1 file changed, 1 insertion(+)
 
 diff --git a/ipc/sem.c b/ipc/sem.c
 index 1a2913d..f257afe 100644
 --- a/ipc/sem.c
 +++ b/ipc/sem.c
 @@ -1243,6 +1243,7 @@ static int semctl_main(struct ipc_namespace *ns, int 
 semid, int semnum,
   err = -EIDRM;
   goto out_free;
   }
 + sem_unlock(sma, -1);
   }
  
   sem_lock(sma, NULL, -1);

Looks right. 

Do we need the locking at all?  What does it actually do?

sem_lock_and_putref(sma);
if (sma-sem_perm.deleted) {
sem_unlock(sma, -1);
err = -EIDRM;
goto out_free;
}
sem_unlock(sma, -1);

We're taking the lock, testing an int and then dropping the lock. 
What's the point in that?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 158 matches

Mail list logo