Re: svn commit: r242014 - head/sys/kern

2012-10-25 Thread Andre Oppermann

On 25.10.2012 05:49, Bruce Evans wrote:

On Wed, 24 Oct 2012, Attilio Rao wrote:


On Wed, Oct 24, 2012 at 8:16 PM, Andre Oppermann an...@freebsd.org wrote:

...
Let's go back and see how we can do this the sanest way.  These are
the options I see at the moment:

 1. sprinkle __aligned(CACHE_LINE_SIZE) all over the place


This is wrong because it doesn't give padding.


Unless it is sprinkled in struct declarations.


 2. use a macro like MTX_ALIGN that can be SMP/UP aware and in
the future possibly change to a different compiler dependent
align attribute


What is this macro supposed to do? I don't understand that from your
description.


 3. embed __aligned(CACHE_LINE_SIZE) into struct mtx itself so it
automatically gets aligned in all cases, even when dynamically
allocated.


This works but I think it is overkill for structures including sleep
mutexes which are the vast majority. So I wouldn't certainly be in
favor of such a patch.


This doesn't work either with fully dynamic (auto) allocations.  Stack
alignment is generally broken (limited, and pessimized for both space
and time) in gcc (it works better in clang).  On amd64, it is limited
by the default of -mpreferred-stack-boundary=4.  Since 2**4 is smaller
than the cache line size and stack alignments larger than it are broken
in gcc, __aligned(CACHE_LINE_SIZE) never works (except accidentally,
16/CACHE_LINE_SIZE of the time.  On i386, we reduce the space/time
pessimizations a little by overriding the default to
-mpreferred-stack-boundary=2.  2**2 is even smaller than the cache
line size.  (The pessimizations are for both space and time, since
time and code space is wasted for the code to keep the stack aligned,
and cache space and thus also time are wasted for padding.  Most
functions don't benefit from more than sizeof(register_t) alignment.)


I'm not aware of stack allocated mutexes anywhere in the kernel.
Even if there is a case it's very special and unique.

I've verified that __aligned(CACHE_LINE_SIZE) on the definition of
struct mtx itself (in sys/_mutex.h) correctly aligns and pads the
global .bss resident mutexes for 64B and 128B cache line sizes.


Dynamic allocations via malloc() get whatever alignment malloc() gives.
This is only required to be 4 or 8 or 16 or so (the maximum for a C
object declared in conforming C (no __align()), but malloc() usually
gives more.  If it gives CACHE_LINE_SIZE, that is wasteful for most
small allocations.


Stand-alone mutexes are normally not malloc'ed.  They're always
embedded into some larger structure they protect.


__builtin_alloca() is broken in gcc-3.3.3, but works in gcc-4.2.1, at
least on i386.  In gcc-3.3.3, it assumes that the stack is the default
16-byte aligned even if -mpreferred-stack-boundary=2 is in CFLAGS to
say otherwise, and just subtracts from the stack pointer.  In gcc-4.2.1,
it does the necessary andl of the stack pointer, but only 16-byte
alignment.

It is another bug that there sre no extensions of malloc() or alloca().
Since malloc() is in the library and may give CACHE_LINE_SIZE but
__builtin_alloca() is in the compiler and only gives 16, these functions
are not even as compatible as they should be.

I don't know of any mutexes allocated on the stack, but there are stack
frames with mcontexts in them that need special alignment so they cause
problems on i386.  They can't just be put on the stack due to the above
bugs. They are laboriously allocated using malloc().  Since they are a
quite large, 1 mcontext barely fits on the kernel stack, so kib didn't
like my alloca() method for allocating them.


You lost me here.

--
Andre

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-25 Thread Alan Cox

On 10/25/2012 11:23, Andre Oppermann wrote:

On 25.10.2012 05:49, Bruce Evans wrote:

On Wed, 24 Oct 2012, Attilio Rao wrote:

On Wed, Oct 24, 2012 at 8:16 PM, Andre Oppermann an...@freebsd.org 
wrote:

...
Let's go back and see how we can do this the sanest way.  These are
the options I see at the moment:

 1. sprinkle __aligned(CACHE_LINE_SIZE) all over the place


This is wrong because it doesn't give padding.


Unless it is sprinkled in struct declarations.


 2. use a macro like MTX_ALIGN that can be SMP/UP aware and in
the future possibly change to a different compiler dependent
align attribute


What is this macro supposed to do? I don't understand that from your
description.


 3. embed __aligned(CACHE_LINE_SIZE) into struct mtx itself so it
automatically gets aligned in all cases, even when dynamically
allocated.


This works but I think it is overkill for structures including sleep
mutexes which are the vast majority. So I wouldn't certainly be in
favor of such a patch.


This doesn't work either with fully dynamic (auto) allocations.  Stack
alignment is generally broken (limited, and pessimized for both space
and time) in gcc (it works better in clang).  On amd64, it is limited
by the default of -mpreferred-stack-boundary=4.  Since 2**4 is smaller
than the cache line size and stack alignments larger than it are broken
in gcc, __aligned(CACHE_LINE_SIZE) never works (except accidentally,
16/CACHE_LINE_SIZE of the time.  On i386, we reduce the space/time
pessimizations a little by overriding the default to
-mpreferred-stack-boundary=2.  2**2 is even smaller than the cache
line size.  (The pessimizations are for both space and time, since
time and code space is wasted for the code to keep the stack aligned,
and cache space and thus also time are wasted for padding.  Most
functions don't benefit from more than sizeof(register_t) alignment.)


I'm not aware of stack allocated mutexes anywhere in the kernel.
Even if there is a case it's very special and unique.

I've verified that __aligned(CACHE_LINE_SIZE) on the definition of
struct mtx itself (in sys/_mutex.h) correctly aligns and pads the
global .bss resident mutexes for 64B and 128B cache line sizes.



Padding every mutex is going to have a non-trivial effect on the size of 
some dynamically allocated structures containing locks, like the vm 
object and the vnode.  Moreover, the effect of this padding will be the 
greatest on address space limited systems, like i386, where the size of 
a vm object is only about 130 bytes.



Dynamic allocations via malloc() get whatever alignment malloc() gives.
This is only required to be 4 or 8 or 16 or so (the maximum for a C
object declared in conforming C (no __align()), but malloc() usually
gives more.  If it gives CACHE_LINE_SIZE, that is wasteful for most
small allocations.


Stand-alone mutexes are normally not malloc'ed.  They're always
embedded into some larger structure they protect.


__builtin_alloca() is broken in gcc-3.3.3, but works in gcc-4.2.1, at
least on i386.  In gcc-3.3.3, it assumes that the stack is the default
16-byte aligned even if -mpreferred-stack-boundary=2 is in CFLAGS to
say otherwise, and just subtracts from the stack pointer.  In gcc-4.2.1,
it does the necessary andl of the stack pointer, but only 16-byte
alignment.

It is another bug that there sre no extensions of malloc() or alloca().
Since malloc() is in the library and may give CACHE_LINE_SIZE but
__builtin_alloca() is in the compiler and only gives 16, these functions
are not even as compatible as they should be.

I don't know of any mutexes allocated on the stack, but there are stack
frames with mcontexts in them that need special alignment so they cause
problems on i386.  They can't just be put on the stack due to the above
bugs. They are laboriously allocated using malloc().  Since they are a
quite large, 1 mcontext barely fits on the kernel stack, so kib didn't
like my alloca() method for allocating them.


You lost me here.



___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


svn commit: r242014 - head/sys/kern

2012-10-24 Thread Jim Harris
Author: jimharris
Date: Wed Oct 24 18:36:41 2012
New Revision: 242014
URL: http://svn.freebsd.org/changeset/base/242014

Log:
  Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.
  
  This enables CPU searches (which read tdq_load) to operate independently
  of any contention on the spinlock.  Some scheduler-intensive workloads
  running on an 8C single-socket SNB Xeon show considerable improvement with
  this change (2-3% perf improvement, 5-6% decrease in CPU util).
  
  Sponsored by: Intel
  Reviewed by:  jeff

Modified:
  head/sys/kern/sched_ule.c

Modified: head/sys/kern/sched_ule.c
==
--- head/sys/kern/sched_ule.c   Wed Oct 24 18:33:44 2012(r242013)
+++ head/sys/kern/sched_ule.c   Wed Oct 24 18:36:41 2012(r242014)
@@ -223,8 +223,13 @@ static int sched_idlespinthresh = -1;
  * locking in sched_pickcpu();
  */
 struct tdq {
-   /* Ordered to improve efficiency of cpu_search() and switch(). */
+   /* 
+* Ordered to improve efficiency of cpu_search() and switch().
+* tdq_lock is padded to avoid false sharing with tdq_load and
+* tdq_cpu_idle.
+*/
struct mtx  tdq_lock;   /* run queue lock. */
+   charpad[64 - sizeof(struct mtx)];
struct cpu_group *tdq_cg;   /* Pointer to cpu topology. */
volatile inttdq_load;   /* Aggregate load. */
volatile inttdq_cpu_idle;   /* cpu_idle() is active. */
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Adrian Chadd
On 24 October 2012 11:36, Jim Harris jimhar...@freebsd.org wrote:

   Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.

Ok, but..


 struct mtx  tdq_lock;   /* run queue lock. */
 +   charpad[64 - sizeof(struct mtx)];

.. don't we have an existing compile time macro for the cache line
size, which can be used here?



Adrian
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread John Baldwin
On Wednesday, October 24, 2012 2:36:41 pm Jim Harris wrote:
 Author: jimharris
 Date: Wed Oct 24 18:36:41 2012
 New Revision: 242014
 URL: http://svn.freebsd.org/changeset/base/242014
 
 Log:
   Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.
   
   This enables CPU searches (which read tdq_load) to operate independently
   of any contention on the spinlock.  Some scheduler-intensive workloads
   running on an 8C single-socket SNB Xeon show considerable improvement with
   this change (2-3% perf improvement, 5-6% decrease in CPU util).
   
   Sponsored by:   Intel
   Reviewed by:jeff
 
 Modified:
   head/sys/kern/sched_ule.c
 
 Modified: head/sys/kern/sched_ule.c
 
==
 --- head/sys/kern/sched_ule.c Wed Oct 24 18:33:44 2012(r242013)
 +++ head/sys/kern/sched_ule.c Wed Oct 24 18:36:41 2012(r242014)
 @@ -223,8 +223,13 @@ static int sched_idlespinthresh = -1;
   * locking in sched_pickcpu();
   */
  struct tdq {
 - /* Ordered to improve efficiency of cpu_search() and switch(). */
 + /* 
 +  * Ordered to improve efficiency of cpu_search() and switch().
 +  * tdq_lock is padded to avoid false sharing with tdq_load and
 +  * tdq_cpu_idle.
 +  */
   struct mtx  tdq_lock;   /* run queue lock. */
 + charpad[64 - sizeof(struct mtx)];

Can this use 'tdq_lock __aligned(CACHE_LINE_SIZE)' instead?

-- 
John Baldwin
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Jim Harris
On Wed, Oct 24, 2012 at 11:41 AM, Adrian Chadd adr...@freebsd.org wrote:
 On 24 October 2012 11:36, Jim Harris jimhar...@freebsd.org wrote:

   Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.

 Ok, but..


 struct mtx  tdq_lock;   /* run queue lock. */
 +   charpad[64 - sizeof(struct mtx)];

 .. don't we have an existing compile time macro for the cache line
 size, which can be used here?

Yes, but I didn't use it for a couple of reasons:

1) struct tdq itself is currently using __aligned(64), so I wanted to
keep it consistent.
2) CACHE_LINE_SIZE is currently defined as 128 on x86, due to
NetBurst-based processors having 128-byte cache sectors a while back.
I had planned to start a separate thread on arch@ about this today on
whether this was still appropriate.



 Adrian
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Jim Harris
On Wed, Oct 24, 2012 at 11:43 AM, John Baldwin j...@freebsd.org wrote:
 On Wednesday, October 24, 2012 2:36:41 pm Jim Harris wrote:
 Author: jimharris
 Date: Wed Oct 24 18:36:41 2012
 New Revision: 242014
 URL: http://svn.freebsd.org/changeset/base/242014

 Log:
   Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.

   This enables CPU searches (which read tdq_load) to operate independently
   of any contention on the spinlock.  Some scheduler-intensive workloads
   running on an 8C single-socket SNB Xeon show considerable improvement with
   this change (2-3% perf improvement, 5-6% decrease in CPU util).

   Sponsored by:   Intel
   Reviewed by:jeff

 Modified:
   head/sys/kern/sched_ule.c

 Modified: head/sys/kern/sched_ule.c

 ==
 --- head/sys/kern/sched_ule.c Wed Oct 24 18:33:44 2012(r242013)
 +++ head/sys/kern/sched_ule.c Wed Oct 24 18:36:41 2012(r242014)
 @@ -223,8 +223,13 @@ static int sched_idlespinthresh = -1;
   * locking in sched_pickcpu();
   */
  struct tdq {
 - /* Ordered to improve efficiency of cpu_search() and switch(). */
 + /*
 +  * Ordered to improve efficiency of cpu_search() and switch().
 +  * tdq_lock is padded to avoid false sharing with tdq_load and
 +  * tdq_cpu_idle.
 +  */
   struct mtx  tdq_lock;   /* run queue lock. */
 + charpad[64 - sizeof(struct mtx)];

 Can this use 'tdq_lock __aligned(CACHE_LINE_SIZE)' instead?


No - that doesn't pad it.  I believe that only works if it's global,
i.e. not part of a data structure.

I could have aligned tdq_cg instead of using the char array to get the
same effect.  I chose the padding only because there was precedence
for it in sys/vm/vm_page.h - struct vpglocks.
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Attilio Rao
On Wed, Oct 24, 2012 at 8:00 PM, Jim Harris jim.har...@gmail.com wrote:
 On Wed, Oct 24, 2012 at 11:43 AM, John Baldwin j...@freebsd.org wrote:
 On Wednesday, October 24, 2012 2:36:41 pm Jim Harris wrote:
 Author: jimharris
 Date: Wed Oct 24 18:36:41 2012
 New Revision: 242014
 URL: http://svn.freebsd.org/changeset/base/242014

 Log:
   Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.

   This enables CPU searches (which read tdq_load) to operate independently
   of any contention on the spinlock.  Some scheduler-intensive workloads
   running on an 8C single-socket SNB Xeon show considerable improvement with
   this change (2-3% perf improvement, 5-6% decrease in CPU util).

   Sponsored by:   Intel
   Reviewed by:jeff

 Modified:
   head/sys/kern/sched_ule.c

 Modified: head/sys/kern/sched_ule.c

 ==
 --- head/sys/kern/sched_ule.c Wed Oct 24 18:33:44 2012(r242013)
 +++ head/sys/kern/sched_ule.c Wed Oct 24 18:36:41 2012(r242014)
 @@ -223,8 +223,13 @@ static int sched_idlespinthresh = -1;
   * locking in sched_pickcpu();
   */
  struct tdq {
 - /* Ordered to improve efficiency of cpu_search() and switch(). */
 + /*
 +  * Ordered to improve efficiency of cpu_search() and switch().
 +  * tdq_lock is padded to avoid false sharing with tdq_load and
 +  * tdq_cpu_idle.
 +  */
   struct mtx  tdq_lock;   /* run queue lock. */
 + charpad[64 - sizeof(struct mtx)];

 Can this use 'tdq_lock __aligned(CACHE_LINE_SIZE)' instead?


 No - that doesn't pad it.  I believe that only works if it's global,
 i.e. not part of a data structure.

As I've already said in another thread __align() doesn't work on
object declaration, so what that won't pad it either if it is global
or part of a struct.
It is just implemented as __attribute__((aligned(X))):
http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Type-Attributes.html

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Attilio Rao
On Wed, Oct 24, 2012 at 7:56 PM, Jim Harris jim.har...@gmail.com wrote:
 On Wed, Oct 24, 2012 at 11:41 AM, Adrian Chadd adr...@freebsd.org wrote:
 On 24 October 2012 11:36, Jim Harris jimhar...@freebsd.org wrote:

   Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.

 Ok, but..


 struct mtx  tdq_lock;   /* run queue lock. */
 +   charpad[64 - sizeof(struct mtx)];

 .. don't we have an existing compile time macro for the cache line
 size, which can be used here?

 Yes, but I didn't use it for a couple of reasons:

 1) struct tdq itself is currently using __aligned(64), so I wanted to
 keep it consistent.
 2) CACHE_LINE_SIZE is currently defined as 128 on x86, due to
 NetBurst-based processors having 128-byte cache sectors a while back.
 I had planned to start a separate thread on arch@ about this today on
 whether this was still appropriate.

While you may want to bring CACHE_LINE_SIZE to 64, code in sched_ule
should still use CACHE_LINE_SIZE, as MI code cannot make assumption on
cache line sizes for all the architectures.
Your reasons for using 64 are valid, but a further next step to
cleanup CACHE_LINE_SIZE and then use it properly in sched_ule are due
IMHO.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Jim Harris
On Wed, Oct 24, 2012 at 12:08 PM, Attilio Rao atti...@freebsd.org wrote:
 On Wed, Oct 24, 2012 at 7:56 PM, Jim Harris jim.har...@gmail.com wrote:
 On Wed, Oct 24, 2012 at 11:41 AM, Adrian Chadd adr...@freebsd.org wrote:
 On 24 October 2012 11:36, Jim Harris jimhar...@freebsd.org wrote:

   Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.

 Ok, but..


 struct mtx  tdq_lock;   /* run queue lock. */
 +   charpad[64 - sizeof(struct mtx)];

 .. don't we have an existing compile time macro for the cache line
 size, which can be used here?

 Yes, but I didn't use it for a couple of reasons:

 1) struct tdq itself is currently using __aligned(64), so I wanted to
 keep it consistent.
 2) CACHE_LINE_SIZE is currently defined as 128 on x86, due to
 NetBurst-based processors having 128-byte cache sectors a while back.
 I had planned to start a separate thread on arch@ about this today on
 whether this was still appropriate.

 While you may want to bring CACHE_LINE_SIZE to 64, code in sched_ule
 should still use CACHE_LINE_SIZE, as MI code cannot make assumption on
 cache line sizes for all the architectures.
 Your reasons for using 64 are valid, but a further next step to
 cleanup CACHE_LINE_SIZE and then use it properly in sched_ule are due
 IMHO.

Yes - I agree.  I wanted to get some sort of consensus on using
CACHE_LINE_SIZE=64 on x86 before moving forward with this change.

 Attilio


 --
 Peace can only be achieved by understanding - A. Einstein
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Andre Oppermann

On 24.10.2012 20:56, Jim Harris wrote:

On Wed, Oct 24, 2012 at 11:41 AM, Adrian Chadd adr...@freebsd.org wrote:

On 24 October 2012 11:36, Jim Harris jimhar...@freebsd.org wrote:


   Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.


Ok, but..



 struct mtx  tdq_lock;   /* run queue lock. */
+   charpad[64 - sizeof(struct mtx)];


.. don't we have an existing compile time macro for the cache line
size, which can be used here?


Yes, but I didn't use it for a couple of reasons:

1) struct tdq itself is currently using __aligned(64), so I wanted to
keep it consistent.
2) CACHE_LINE_SIZE is currently defined as 128 on x86, due to
NetBurst-based processors having 128-byte cache sectors a while back.
I had planned to start a separate thread on arch@ about this today on
whether this was still appropriate.


See also the discussion on svn-src-all regarding global struct mtx
alignment.

Thank you for proving my point. ;)

Let's go back and see how we can do this the sanest way.  These are
the options I see at the moment:

 1. sprinkle __aligned(CACHE_LINE_SIZE) all over the place
 2. use a macro like MTX_ALIGN that can be SMP/UP aware and in
the future possibly change to a different compiler dependent
align attribute
 3. embed __aligned(CACHE_LINE_SIZE) into struct mtx itself so it
automatically gets aligned in all cases, even when dynamically
allocated.

Personally I'm undecided between #2 and #3.  #1 is ugly.  In favor
of #3 is that there possibly isn't any case where you'd actually
want the mutex to share a cache line with anything else, even a data
structure.

--
Andre

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Alexander Motin

On 24.10.2012 22:16, Andre Oppermann wrote:

On 24.10.2012 20:56, Jim Harris wrote:

On Wed, Oct 24, 2012 at 11:41 AM, Adrian Chadd adr...@freebsd.org
wrote:

On 24 October 2012 11:36, Jim Harris jimhar...@freebsd.org wrote:


   Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.


Ok, but..



 struct mtx  tdq_lock;   /* run queue lock. */
+   charpad[64 - sizeof(struct mtx)];


.. don't we have an existing compile time macro for the cache line
size, which can be used here?


Yes, but I didn't use it for a couple of reasons:

1) struct tdq itself is currently using __aligned(64), so I wanted to
keep it consistent.
2) CACHE_LINE_SIZE is currently defined as 128 on x86, due to
NetBurst-based processors having 128-byte cache sectors a while back.
I had planned to start a separate thread on arch@ about this today on
whether this was still appropriate.


See also the discussion on svn-src-all regarding global struct mtx
alignment.

Thank you for proving my point. ;)

Let's go back and see how we can do this the sanest way.  These are
the options I see at the moment:

  1. sprinkle __aligned(CACHE_LINE_SIZE) all over the place
  2. use a macro like MTX_ALIGN that can be SMP/UP aware and in
 the future possibly change to a different compiler dependent
 align attribute
  3. embed __aligned(CACHE_LINE_SIZE) into struct mtx itself so it
 automatically gets aligned in all cases, even when dynamically
 allocated.

Personally I'm undecided between #2 and #3.  #1 is ugly.  In favor
of #3 is that there possibly isn't any case where you'd actually
want the mutex to share a cache line with anything else, even a data
structure.


I'm sorry, could you hint me with some theory? I think I can agree that 
cache line sharing can be a problem in case of spin locks -- waiting 
thread will constantly try to access page modified by other CPU, that I 
guess will cause cache line writes to the RAM. But why is it so bad to 
share lock with respective data in case of non-spin locks? Won't 
benefits from free regular prefetch of the right data while grabbing 
lock compensate penalties from relatively rare collisions?


--
Alexander Motin
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Jim Harris
On Wed, Oct 24, 2012 at 12:16 PM, Andre Oppermann an...@freebsd.org wrote:

snip


 See also the discussion on svn-src-all regarding global struct mtx
 alignment.

 Thank you for proving my point. ;)

 Let's go back and see how we can do this the sanest way.  These are
 the options I see at the moment:

  1. sprinkle __aligned(CACHE_LINE_SIZE) all over the place
  2. use a macro like MTX_ALIGN that can be SMP/UP aware and in
 the future possibly change to a different compiler dependent
 align attribute
  3. embed __aligned(CACHE_LINE_SIZE) into struct mtx itself so it
 automatically gets aligned in all cases, even when dynamically
 allocated.

 Personally I'm undecided between #2 and #3.  #1 is ugly.  In favor
 of #3 is that there possibly isn't any case where you'd actually
 want the mutex to share a cache line with anything else, even a data
 structure.

I've run my same tests with #3 as you describe, and I did see further
noticeable improvement.  I had a difficult time though quantifying the
effect it would have on all of the different architectures.  Putting
it in ULE's tdq gained 60-70% of the overall benefit, and was well
contained.

I agree that sprinkling all over the place isn't pretty.  But focused
investigations into specific locks (spin mutexes, default mutexes,
whatever) may find a few key additional ones that would benefit.  I
started down this path with the sleepq and turnstile locks, but none
of those specifically showed noticeable improvement (at least in the
tests I was running).  There's still some additional ones I want to
look at, but haven't had the time yet.

Thanks,

-Jim
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Attilio Rao
On Wed, Oct 24, 2012 at 8:16 PM, Andre Oppermann an...@freebsd.org wrote:
 On 24.10.2012 20:56, Jim Harris wrote:

 On Wed, Oct 24, 2012 at 11:41 AM, Adrian Chadd adr...@freebsd.org wrote:

 On 24 October 2012 11:36, Jim Harris jimhar...@freebsd.org wrote:

Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.


 Ok, but..


  struct mtx  tdq_lock;   /* run queue lock. */
 +   charpad[64 - sizeof(struct mtx)];


 .. don't we have an existing compile time macro for the cache line
 size, which can be used here?


 Yes, but I didn't use it for a couple of reasons:

 1) struct tdq itself is currently using __aligned(64), so I wanted to
 keep it consistent.
 2) CACHE_LINE_SIZE is currently defined as 128 on x86, due to
 NetBurst-based processors having 128-byte cache sectors a while back.
 I had planned to start a separate thread on arch@ about this today on
 whether this was still appropriate.


 See also the discussion on svn-src-all regarding global struct mtx
 alignment.

 Thank you for proving my point. ;)

 Let's go back and see how we can do this the sanest way.  These are
 the options I see at the moment:

  1. sprinkle __aligned(CACHE_LINE_SIZE) all over the place

This is wrong because it doesn't give padding.

  2. use a macro like MTX_ALIGN that can be SMP/UP aware and in
 the future possibly change to a different compiler dependent
 align attribute

What is this macro supposed to do? I don't understand that from your
description.

  3. embed __aligned(CACHE_LINE_SIZE) into struct mtx itself so it
 automatically gets aligned in all cases, even when dynamically
 allocated.

This works but I think it is overkill for structures including sleep
mutexes which are the vast majority. So I wouldn't certainly be in
favor of such a patch.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Jeff Roberson

On Wed, 24 Oct 2012, Attilio Rao wrote:


On Wed, Oct 24, 2012 at 8:16 PM, Andre Oppermann an...@freebsd.org wrote:

On 24.10.2012 20:56, Jim Harris wrote:


On Wed, Oct 24, 2012 at 11:41 AM, Adrian Chadd adr...@freebsd.org wrote:


On 24 October 2012 11:36, Jim Harris jimhar...@freebsd.org wrote:


   Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.



Ok, but..



 struct mtx  tdq_lock;   /* run queue lock. */
+   charpad[64 - sizeof(struct mtx)];



.. don't we have an existing compile time macro for the cache line
size, which can be used here?



Yes, but I didn't use it for a couple of reasons:

1) struct tdq itself is currently using __aligned(64), so I wanted to
keep it consistent.
2) CACHE_LINE_SIZE is currently defined as 128 on x86, due to
NetBurst-based processors having 128-byte cache sectors a while back.
I had planned to start a separate thread on arch@ about this today on
whether this was still appropriate.



See also the discussion on svn-src-all regarding global struct mtx
alignment.

Thank you for proving my point. ;)

Let's go back and see how we can do this the sanest way.  These are
the options I see at the moment:

 1. sprinkle __aligned(CACHE_LINE_SIZE) all over the place


This is wrong because it doesn't give padding.


 2. use a macro like MTX_ALIGN that can be SMP/UP aware and in
the future possibly change to a different compiler dependent
align attribute


What is this macro supposed to do? I don't understand that from your
description.


 3. embed __aligned(CACHE_LINE_SIZE) into struct mtx itself so it
automatically gets aligned in all cases, even when dynamically
allocated.


This works but I think it is overkill for structures including sleep
mutexes which are the vast majority. So I wouldn't certainly be in
favor of such a patch.


I agree.  For locks with little contention we probably want smaller 
structures.  For example, you wouldn't want to put a huge lock in every 
file descriptor.  It would be nice to have an automatic way to pad every 
global lock though.  I think it should be done as needed.


Jeff



Attilio


--
Peace can only be achieved by understanding - A. Einstein


___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Attilio Rao
On Wed, Oct 24, 2012 at 8:30 PM, Alexander Motin m...@freebsd.org wrote:
 On 24.10.2012 22:16, Andre Oppermann wrote:

 On 24.10.2012 20:56, Jim Harris wrote:

 On Wed, Oct 24, 2012 at 11:41 AM, Adrian Chadd adr...@freebsd.org
 wrote:

 On 24 October 2012 11:36, Jim Harris jimhar...@freebsd.org wrote:

Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.


 Ok, but..


  struct mtx  tdq_lock;   /* run queue lock. */
 +   charpad[64 - sizeof(struct mtx)];


 .. don't we have an existing compile time macro for the cache line
 size, which can be used here?


 Yes, but I didn't use it for a couple of reasons:

 1) struct tdq itself is currently using __aligned(64), so I wanted to
 keep it consistent.
 2) CACHE_LINE_SIZE is currently defined as 128 on x86, due to
 NetBurst-based processors having 128-byte cache sectors a while back.
 I had planned to start a separate thread on arch@ about this today on
 whether this was still appropriate.


 See also the discussion on svn-src-all regarding global struct mtx
 alignment.

 Thank you for proving my point. ;)

 Let's go back and see how we can do this the sanest way.  These are
 the options I see at the moment:

   1. sprinkle __aligned(CACHE_LINE_SIZE) all over the place
   2. use a macro like MTX_ALIGN that can be SMP/UP aware and in
  the future possibly change to a different compiler dependent
  align attribute
   3. embed __aligned(CACHE_LINE_SIZE) into struct mtx itself so it
  automatically gets aligned in all cases, even when dynamically
  allocated.

 Personally I'm undecided between #2 and #3.  #1 is ugly.  In favor
 of #3 is that there possibly isn't any case where you'd actually
 want the mutex to share a cache line with anything else, even a data
 structure.


 I'm sorry, could you hint me with some theory? I think I can agree that
 cache line sharing can be a problem in case of spin locks -- waiting thread
 will constantly try to access page modified by other CPU, that I guess will
 cause cache line writes to the RAM. But why is it so bad to share lock with
 respective data in case of non-spin locks? Won't benefits from free regular
 prefetch of the right data while grabbing lock compensate penalties from
 relatively rare collisions?

Yes, but be aware that adaptive spinning imposes the same type of
cache sharing issues than spinlocks have.

I just found this 4 years old patch that implements back-off
algorithms for spinmutexes and adaptive spinning. I think it would be
interesting if someone can find time to benchmark and tune it. I can
clean it up and make a commit candidate if there is an interest:
http://lists.freebsd.org/pipermail/freebsd-smp/2008-June/001561.html

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Andre Oppermann

On 24.10.2012 21:49, Jim Harris wrote:

On Wed, Oct 24, 2012 at 12:16 PM, Andre Oppermann an...@freebsd.org wrote:

snip



See also the discussion on svn-src-all regarding global struct mtx
alignment.

Thank you for proving my point. ;)

Let's go back and see how we can do this the sanest way.  These are
the options I see at the moment:

  1. sprinkle __aligned(CACHE_LINE_SIZE) all over the place
  2. use a macro like MTX_ALIGN that can be SMP/UP aware and in
 the future possibly change to a different compiler dependent
 align attribute
  3. embed __aligned(CACHE_LINE_SIZE) into struct mtx itself so it
 automatically gets aligned in all cases, even when dynamically
 allocated.

Personally I'm undecided between #2 and #3.  #1 is ugly.  In favor
of #3 is that there possibly isn't any case where you'd actually
want the mutex to share a cache line with anything else, even a data
structure.


I've run my same tests with #3 as you describe, and I did see further
noticeable improvement.  I had a difficult time though quantifying the
effect it would have on all of the different architectures.  Putting
it in ULE's tdq gained 60-70% of the overall benefit, and was well
contained.


I just experimented with different specifications of alignment
and couldn't get the globals aligned at all.  This seems to be
because of the linker not understanding or not getting passed
the alignment information when linking the kernel.


I agree that sprinkling all over the place isn't pretty.  But focused
investigations into specific locks (spin mutexes, default mutexes,
whatever) may find a few key additional ones that would benefit.  I
started down this path with the sleepq and turnstile locks, but none
of those specifically showed noticeable improvement (at least in the
tests I was running).  There's still some additional ones I want to
look at, but haven't had the time yet.


This runs the very great risk of optimizing for today's available
architectures and then needs rejiggling every five years.  Just as
you've noticed the issue with 128B alignment from the Netburst days.
We never know how the next micro-architecture will behave.  Micro
optimizing each individual invocation of common building blocks is
the wrong path to go.

I'd very much prefer the alignment *and* padding control to be done
in one place for all of them, either through a magic macro or compiler
__attribute__(whatever).

--
Andre

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Andre Oppermann

On 24.10.2012 21:06, Attilio Rao wrote:

On Wed, Oct 24, 2012 at 8:00 PM, Jim Harris jim.har...@gmail.com wrote:

On Wed, Oct 24, 2012 at 11:43 AM, John Baldwin j...@freebsd.org wrote:

On Wednesday, October 24, 2012 2:36:41 pm Jim Harris wrote:

Author: jimharris
Date: Wed Oct 24 18:36:41 2012
New Revision: 242014
URL: http://svn.freebsd.org/changeset/base/242014

Log:
   Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.

   This enables CPU searches (which read tdq_load) to operate independently
   of any contention on the spinlock.  Some scheduler-intensive workloads
   running on an 8C single-socket SNB Xeon show considerable improvement with
   this change (2-3% perf improvement, 5-6% decrease in CPU util).

   Sponsored by:   Intel
   Reviewed by:jeff

Modified:
   head/sys/kern/sched_ule.c

Modified: head/sys/kern/sched_ule.c


==

--- head/sys/kern/sched_ule.c Wed Oct 24 18:33:44 2012(r242013)
+++ head/sys/kern/sched_ule.c Wed Oct 24 18:36:41 2012(r242014)
@@ -223,8 +223,13 @@ static int sched_idlespinthresh = -1;
   * locking in sched_pickcpu();
   */
  struct tdq {
- /* Ordered to improve efficiency of cpu_search() and switch(). */
+ /*
+  * Ordered to improve efficiency of cpu_search() and switch().
+  * tdq_lock is padded to avoid false sharing with tdq_load and
+  * tdq_cpu_idle.
+  */
   struct mtx  tdq_lock;   /* run queue lock. */
+ charpad[64 - sizeof(struct mtx)];


Can this use 'tdq_lock __aligned(CACHE_LINE_SIZE)' instead?



No - that doesn't pad it.  I believe that only works if it's global,
i.e. not part of a data structure.


As I've already said in another thread __align() doesn't work on
object declaration, so what that won't pad it either if it is global
or part of a struct.
It is just implemented as __attribute__((aligned(X))):
http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Type-Attributes.html


Actually it seems gcc itself doesn't really care and it up to the
linker to honor that.

--
Andre

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Attilio Rao
On Wed, Oct 24, 2012 at 9:25 PM, Andre Oppermann an...@freebsd.org wrote:
 On 24.10.2012 21:06, Attilio Rao wrote:

 On Wed, Oct 24, 2012 at 8:00 PM, Jim Harris jim.har...@gmail.com wrote:

 On Wed, Oct 24, 2012 at 11:43 AM, John Baldwin j...@freebsd.org wrote:

 On Wednesday, October 24, 2012 2:36:41 pm Jim Harris wrote:

 Author: jimharris
 Date: Wed Oct 24 18:36:41 2012
 New Revision: 242014
 URL: http://svn.freebsd.org/changeset/base/242014

 Log:
Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.

This enables CPU searches (which read tdq_load) to operate
 independently
of any contention on the spinlock.  Some scheduler-intensive
 workloads
running on an 8C single-socket SNB Xeon show considerable
 improvement with
this change (2-3% perf improvement, 5-6% decrease in CPU util).

Sponsored by:   Intel
Reviewed by:jeff

 Modified:
head/sys/kern/sched_ule.c

 Modified: head/sys/kern/sched_ule.c


 ==

 --- head/sys/kern/sched_ule.c Wed Oct 24 18:33:44 2012(r242013)
 +++ head/sys/kern/sched_ule.c Wed Oct 24 18:36:41 2012(r242014)
 @@ -223,8 +223,13 @@ static int sched_idlespinthresh = -1;
* locking in sched_pickcpu();
*/
   struct tdq {
 - /* Ordered to improve efficiency of cpu_search() and switch(). */
 + /*
 +  * Ordered to improve efficiency of cpu_search() and switch().
 +  * tdq_lock is padded to avoid false sharing with tdq_load and
 +  * tdq_cpu_idle.
 +  */
struct mtx  tdq_lock;   /* run queue lock. */
 + charpad[64 - sizeof(struct mtx)];


 Can this use 'tdq_lock __aligned(CACHE_LINE_SIZE)' instead?


 No - that doesn't pad it.  I believe that only works if it's global,
 i.e. not part of a data structure.


 As I've already said in another thread __align() doesn't work on
 object declaration, so what that won't pad it either if it is global
 or part of a struct.
 It is just implemented as __attribute__((aligned(X))):
 http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Type-Attributes.html


 Actually it seems gcc itself doesn't really care and it up to the
 linker to honor that.

Yes but the concept being that if you use __aligned() properly (when
defining a struct) the object will be correctly sized, so you will get
padding automatically.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Andre Oppermann

On 24.10.2012 21:30, Alexander Motin wrote:

On 24.10.2012 22:16, Andre Oppermann wrote:

On 24.10.2012 20:56, Jim Harris wrote:

On Wed, Oct 24, 2012 at 11:41 AM, Adrian Chadd adr...@freebsd.org
wrote:

On 24 October 2012 11:36, Jim Harris jimhar...@freebsd.org wrote:


   Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.


Ok, but..



 struct mtx  tdq_lock;   /* run queue lock. */
+   charpad[64 - sizeof(struct mtx)];


.. don't we have an existing compile time macro for the cache line
size, which can be used here?


Yes, but I didn't use it for a couple of reasons:

1) struct tdq itself is currently using __aligned(64), so I wanted to
keep it consistent.
2) CACHE_LINE_SIZE is currently defined as 128 on x86, due to
NetBurst-based processors having 128-byte cache sectors a while back.
I had planned to start a separate thread on arch@ about this today on
whether this was still appropriate.


See also the discussion on svn-src-all regarding global struct mtx
alignment.

Thank you for proving my point. ;)

Let's go back and see how we can do this the sanest way.  These are
the options I see at the moment:

  1. sprinkle __aligned(CACHE_LINE_SIZE) all over the place
  2. use a macro like MTX_ALIGN that can be SMP/UP aware and in
 the future possibly change to a different compiler dependent
 align attribute
  3. embed __aligned(CACHE_LINE_SIZE) into struct mtx itself so it
 automatically gets aligned in all cases, even when dynamically
 allocated.

Personally I'm undecided between #2 and #3.  #1 is ugly.  In favor
of #3 is that there possibly isn't any case where you'd actually
want the mutex to share a cache line with anything else, even a data
structure.


I'm sorry, could you hint me with some theory? I think I can agree that cache 
line sharing can be a
problem in case of spin locks -- waiting thread will constantly try to access 
page modified by other
CPU, that I guess will cause cache line writes to the RAM. But why is it so bad 
to share lock with
respective data in case of non-spin locks? Won't benefits from free regular 
prefetch of the right
data while grabbing lock compensate penalties from relatively rare collisions?


Cliff Click describes it in detail:
 http://www.azulsystems.com/blog/cliff/2009-04-14-odds-ends

For a classic mutex it likely doesn't make much difference since the
cache line is exclusive anyway while the lock is held.  On LL/SC systems
there may be cache line dirtying on a failed locking attempt.

For spin mutexes it hurts badly as you noted.

Especially on RW mutexes it hurts because a read lock dirties the cache
line for all other CPU's.  Here the RW mutex should be on its own cache
line in all cases.

--
Andre

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Andre Oppermann

On 24.10.2012 22:29, Attilio Rao wrote:

On Wed, Oct 24, 2012 at 9:25 PM, Andre Oppermann an...@freebsd.org wrote:

On 24.10.2012 21:06, Attilio Rao wrote:

As I've already said in another thread __align() doesn't work on
object declaration, so what that won't pad it either if it is global
or part of a struct.
It is just implemented as __attribute__((aligned(X))):
http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Type-Attributes.html



Actually it seems gcc itself doesn't really care and it up to the
linker to honor that.


Yes but the concept being that if you use __aligned() properly (when
defining a struct) the object will be correctly sized, so you will get
padding automatically.


Yes.  With __aligned() the start of the element/structure should
begin on an address evenly dividable by the align value *and* it
should pad out any remaining space up to the next evenly dividable
address.

The problem we have is that is apparently doesn't work correctly
within gcc when creating structs nor within the linker when placing
such supposedly aligned structs in the .bss section (at least the
padding is missing).

It seems to come down to either a) fixing gcc+ld; or b) hacking
around it by magically padding the structs that require it.

--
Andre

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Andre Oppermann

On 24.10.2012 22:55, Andre Oppermann wrote:

On 24.10.2012 22:29, Attilio Rao wrote:

On Wed, Oct 24, 2012 at 9:25 PM, Andre Oppermann an...@freebsd.org wrote:

On 24.10.2012 21:06, Attilio Rao wrote:

As I've already said in another thread __align() doesn't work on
object declaration, so what that won't pad it either if it is global
or part of a struct.
It is just implemented as __attribute__((aligned(X))):
http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Type-Attributes.html



Actually it seems gcc itself doesn't really care and it up to the
linker to honor that.


Yes but the concept being that if you use __aligned() properly (when
defining a struct) the object will be correctly sized, so you will get
padding automatically.


Yes.  With __aligned() the start of the element/structure should
begin on an address evenly dividable by the align value *and* it
should pad out any remaining space up to the next evenly dividable
address.

The problem we have is that is apparently doesn't work correctly
within gcc when creating structs nor within the linker when placing
such supposedly aligned structs in the .bss section (at least the
padding is missing).


I spoke too soon.  Attilio is completely right in his assessment.

It does work when done on the struct definition:

struct mtx {
...
} __aligned(CACHE_LINE_SIZE);   /* works including .bss alignment  padding */

When creating a struct (in globals at least) it doesn't work:

struct mtx __aligned(CACHE_LINE_SIZE) foo_mtx;  /* doesn't work */


It seems to come down to either a) fixing gcc+ld; or b) hacking
around it by magically padding the structs that require it.


The question now becomes of whether we can (should?) make the latter
case above work or find another workaround.

--
Andre

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Bruce Evans

On Wed, 24 Oct 2012, Attilio Rao wrote:


On Wed, Oct 24, 2012 at 8:16 PM, Andre Oppermann an...@freebsd.org wrote:

...
Let's go back and see how we can do this the sanest way.  These are
the options I see at the moment:

 1. sprinkle __aligned(CACHE_LINE_SIZE) all over the place


This is wrong because it doesn't give padding.


Unless it is sprinkled in struct declarations.


 2. use a macro like MTX_ALIGN that can be SMP/UP aware and in
the future possibly change to a different compiler dependent
align attribute


What is this macro supposed to do? I don't understand that from your
description.


 3. embed __aligned(CACHE_LINE_SIZE) into struct mtx itself so it
automatically gets aligned in all cases, even when dynamically
allocated.


This works but I think it is overkill for structures including sleep
mutexes which are the vast majority. So I wouldn't certainly be in
favor of such a patch.


This doesn't work either with fully dynamic (auto) allocations.  Stack
alignment is generally broken (limited, and pessimized for both space
and time) in gcc (it works better in clang).  On amd64, it is limited
by the default of -mpreferred-stack-boundary=4.  Since 2**4 is smaller
than the cache line size and stack alignments larger than it are broken
in gcc, __aligned(CACHE_LINE_SIZE) never works (except accidentally,
16/CACHE_LINE_SIZE of the time.  On i386, we reduce the space/time
pessimizations a little by overriding the default to
-mpreferred-stack-boundary=2.  2**2 is even smaller than the cache
line size.  (The pessimizations are for both space and time, since
time and code space is wasted for the code to keep the stack aligned,
and cache space and thus also time are wasted for padding.  Most
functions don't benefit from more than sizeof(register_t) alignment.)

Dynamic allocations via malloc() get whatever alignment malloc() gives.
This is only required to be 4 or 8 or 16 or so (the maximum for a C
object declared in conforming C (no __align()), but malloc() usually
gives more.  If it gives CACHE_LINE_SIZE, that is wasteful for most
small allocations.

__builtin_alloca() is broken in gcc-3.3.3, but works in gcc-4.2.1, at
least on i386.  In gcc-3.3.3, it assumes that the stack is the default
16-byte aligned even if -mpreferred-stack-boundary=2 is in CFLAGS to
say otherwise, and just subtracts from the stack pointer.  In gcc-4.2.1,
it does the necessary andl of the stack pointer, but only 16-byte
alignment.

It is another bug that there sre no extensions of malloc() or alloca().
Since malloc() is in the library and may give CACHE_LINE_SIZE but
__builtin_alloca() is in the compiler and only gives 16, these functions
are not even as compatible as they should be.

I don't know of any mutexes allocated on the stack, but there are stack
frames with mcontexts in them that need special alignment so they cause
problems on i386.  They can't just be put on the stack due to the above
bugs. They are laboriously allocated using malloc().  Since they are a
quite large, 1 mcontext barely fits on the kernel stack, so kib didn't
like my alloca() method for allocating them.

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r242014 - head/sys/kern

2012-10-24 Thread Bruce Evans

On Thu, 25 Oct 2012, Andre Oppermann wrote:


...
I spoke too soon.  Attilio is completely right in his assessment.

It does work when done on the struct definition:

struct mtx {
...
} __aligned(CACHE_LINE_SIZE);	/* works including .bss alignment  padding 
*/


When creating a struct (in globals at least) it doesn't work:

struct mtx __aligned(CACHE_LINE_SIZE) foo_mtx;  /* doesn't work */


It seems to come down to either a) fixing gcc+ld; or b) hacking
around it by magically padding the structs that require it.


The question now becomes of whether we can (should?) make the latter
case above work or find another workaround.


The latter case already works correctly.  The alignment directive only
specifies the alignment requirement for the start of the object.  The
size of the object is whatever it is, and there is no directive for
padding.

Consider an array of objects:

struct mtx __aligned(CACHE_LINE_SIZE) foo_mtx[128];

This aligns the first element in the array.  There is no way that it
can do special alignment for other elements.  The elements must follow
each other, with no padding in between.  The struct size is whatever it
is and is already padded for the correct alignment as known at the
time of the struct declaration, and you can't change its size later.

I once wrote the following mistake which gave excessive alignment,
but also sort or what you want:

   static double __aligned(64)
   P1 = ...,
   P2 = ...,
   P3 = ...;

This was intended to align the first double to 64 bytes and have the
others follow with no padding, much like what would happen if the
variables were in an array (they are named Pn mainly because this looks
nicer than P[n]).  What it actually does is align all the variables to
64 bytes and waste a lot of space and cache resources.  The linker
might fill in the gaps, but at least with static variables in a small
program, it won't be able to find enough variables to fill them all.
With static variables in a large program, dummy variables could
probably be provided to fill in the gaps.

For global mutexes, I would try putting them all in the same struct
starting at a page boundary, a bit like pcpu, so that I can control
the layout of them all.

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org