Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-25 Thread Bruce Evans

On Sat, 24 Dec 2011, Alexander Best wrote:


On Sat Dec 24 11, Bruce Evans wrote:

On Sat, 24 Dec 2011, Alexander Best wrote:


On Sat Dec 24 11, Bruce Evans wrote:

On Fri, 23 Dec 2011, Alexander Best wrote:

...

the gcc(1) man page states the following:


This extra alignment does consume extra stack space, and generally
increases code size.  Code that is sensitive to stack space usage,
such as embedded systems and operating system kernels, may want to
reduce the preferred alignment to -mpreferred-stack-boundary=2.


the comment in sys/conf/kern.mk however sorta suggests that the default
alignment of 4 bytes might improve performance.


The default stack alignment is 16 bytes, which unimproves performance.


maybe the part of the comment in sys/conf/kern.mk, which mentions that a
stack
alignment of 16 bytes might improve micro benchmark results should be
removed.
this would prevent people (like me) from thinking, using a stack alignment
of
4 bytes is a compromise between size and efficiently. it isn't! currently a
stack alignment of 16 bytes has no advantages towards one with 4 bytes on
i386.


I think the comment is clear enough.  It it mentions all the tradeoffs.
It is only slightly cryptic in saying that these are tradeoffs and that
the configuration is our best guess at the best tradeoff -- it just says
while for both.  It goes without saying that we don't use our worst
guess.  Anyone wanting to change this should run benchmarks and beware
that micro-benchmarks are especially useless.  The changed comment is not
so good since it no longer mentions micro-bencharmarks or says while.


if micro benchmark results aren't of any use, why should the claim that the
default stack alignment of 16 bytes might produce better outcome stay?


Because:
- the actual claim is the opposite of that (it is that the default 16-byte
  alignments is probably a loss overall)
- the claim that the default 16-byte alignment may benefit micro-benchmarks
  is true, even without the weaselish miswording of might in it.  There
  is always at least 1 micro-benchmark that will benefit from almost any
  change, and here we expect a benefit in many microbenchmarks that don't
  bust the caches.  Except, 16-byte alignment isn't supported (*) in the
  kernel, so we actually expect a loss from many microbenchmarks that
  don't bust the caches.
- the second claim warns inexperienced benchmarkers not to claim that the
  default is better because it is better in microbenchmarks.


it doesn't seem as if anybody has micro benchmarked 16 bytes vs. 4 bytes stack
alignment, until now. so the micro benchmark statement in the comment seems to
be pure speculation.


No, it is obviously true.


even worse...it indicates that by removing the
-mpreferred-stack-boundary=2 flag, one can gain a performance boost by
sacrifying a few more bytes of kernel (and module) size.


No, it is part of the sentence explaining why removing the
-mpreferred-stack-boundary=2 flag will probably regain the overall loss
that is avoided by using the flag.


this suggests that the behavior -mpreferred-stack-boundary=2 vs. not specyfing
it, losely equals the semantics of -Os vs. -O2.


No, -Os guarantees slower execution by forcing optimization to prefer
space savings over time savings in more ways.  Except, -Os is completely
broken in -current (in the kernel), and gives very large negative space
savings (about 50%).  It last worked with gcc-3.  Its brokenness with
gcc-4 is related to kern.pre.mk still specifying -finline-limit flags
that are more suitable for gcc-3 (gcc has _many_ flags for giving more
delicate control over inlining, and better defaults for them) and
excessive inlining in gcc-4 given by -funit-at-a-time
-finline-functions-called-once.  These apparently cause gcc's inliner
to go insane with -Os.  When I tried to fix this by reducing inlining,
I couldn't find any threshold that fixed -Os without breaking inlining
of functions that are declared inline.

(*) A primary part of the lack of support for 16-byte stack alignment in
the kernel no special stack alignment for the main kernel entry point,
namely syscall().  From i386/exception.s:

%   SUPERALIGN_TEXT
% IDTVEC(int0x80_syscall)

At this point, the stack has 5 words on it (it was 16-byte aligned before
that).

%   pushl   $2  /* sizeof int 0x80 */
%   subl$4,%esp /* skip over tf_trapno */
%   pushal
%   pushl   %ds
%   pushl   %es
%   pushl   %fs
%   SET_KERNEL_SREGS
%   cld
%   FAKE_MCOUNT(TF_EIP(%esp))
%   pushl   %esp

We push 14 more words.  This gives perfect misaligment to the worst odd
word boundary (perfect if only word boundaries are allowed).  gcc wants
the stack to be aligned to a 4*n word boundary before function calls,
but here we have a 4*n+3 word boundary.  (4*n+3 is worse than 4*n+1
since 2 more words instead of 4 will cross the next 16-byte boundary).

%   callsyscall

Using the default 

Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-24 Thread Alexander Best
On Sat Dec 24 11, Bruce Evans wrote:
 On Fri, 23 Dec 2011, Alexander Best wrote:
 
 is -mpreferred-stack-boundary=2 really necessary for i386 builds any 
 longer?
 i built GENERIC (including modules) with and without that flag. the results
 are:
 
 The same as it has always been.  It avoids some bloat.
 
 1654496  bytes with the flag set
 vs.
 1654952  bytes with the flag unset
 
 I don't believe this.  GENERIC is enormously bloated, so it has size
 more like 16MB than 1.6MB.  Even a savings of 4K instead of 456 bytes

i'm sorry. i used du(1) to get those numbers, so i believe those numbers
represent the ammount of 512-byte blocks. if i'm correct GENERIC is even
more bloated than you feared and almost reaches 1GB:

807,859375  megabytes with flag set
vs.
808,0820313 megabytes without the flag set

 is hard to believe.  I get a savings of 9K (text) in a 5MB kernel.
 Changing the default target arch from i386 to pentium-undocumented has
 reduced the text space savings a little, since the default for passing
 args is now to preallocate stack space for them and store to this,
 instead of to push them; this preallocation results in more functions
 needing to allocate some stack space explicitly, and when some is
 allocated explicitly, the text space cost for this doesn't depend on
 the size of the allocation.
 
 Anyway, the savings are mostly from from avoiding cache misses from
 sparse allocation on stacks.
 
 Also, FreeBSD-i386 hasn't been programmed to support aligned stacks:
 - KSTACK_PAGES on i386 is 2, while on amd64 it is 4.  Using more
   stack might push something over the edge
 - not much care is taken to align the initial stack or to keep the
   stack aligned in calls from asm code.  E.g., any alignment for
   mi_startup() (and thus proc0?) is accidental.  This may result
   in perfect alignment or perfect misalignment.  Hopefully, more
   care is taken with thread startup.  For gcc, the alignment is
   done bogusly in main() in userland, but there is no main() in
   the kernel.  The alignment doesn't matter much (provided the
   perfect misalignment is still to a multiple of 4), but when it
   matters, the random misalignment that results from not trying to
   do it at all is better than perfect misalignment from getting it
   wrong.  With 4-byte alignment, the only cases that it helps are
   with 64-bit variables.
 
 the gcc(1) man page states the following:
 
 
 This extra alignment does consume extra stack space, and generally
 increases code size.  Code that is sensitive to stack space usage,
 such as embedded systems and operating system kernels, may want to
 reduce the preferred alignment to -mpreferred-stack-boundary=2.
 
 
 the comment in sys/conf/kern.mk however sorta suggests that the default
 alignment of 4 bytes might improve performance.
 
 The default stack alignment is 16 bytes, which unimproves performance.
 
 clang handles stack alignment correctly (only does it when it is needed)
 so it doesn't need a -mpreferred-stack-boundary option and doesn't
 always break without alignment in main().  Well, at least it used to,
 IIRC.  Testing it now shows that it does the necessary andl of the
 stack pointer for __aligned(32), but for __aligned(16) it now assumes
 that the stack is aligned by the caller.  So it now needs
 -mpreferred-stack-boundary=2, but doesn't have it.  OTOH, clang doesn't
 do the andl in main() like gcc does (unless you put a dummy __aligned(32)
 there), but requires crt to pass an aligned stack.
 
 Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-24 Thread Alexander Best
On Sat Dec 24 11, Bruce Evans wrote:
 On Fri, 23 Dec 2011, Alexander Best wrote:
 
 is -mpreferred-stack-boundary=2 really necessary for i386 builds any 
 longer?
 i built GENERIC (including modules) with and without that flag. the results
 are:
 
 The same as it has always been.  It avoids some bloat.
 
 1654496  bytes with the flag set
 vs.
 1654952  bytes with the flag unset
 
 I don't believe this.  GENERIC is enormously bloated, so it has size
 more like 16MB than 1.6MB.  Even a savings of 4K instead of 456 bytes
 is hard to believe.  I get a savings of 9K (text) in a 5MB kernel.
 Changing the default target arch from i386 to pentium-undocumented has
 reduced the text space savings a little, since the default for passing
 args is now to preallocate stack space for them and store to this,
 instead of to push them; this preallocation results in more functions
 needing to allocate some stack space explicitly, and when some is
 allocated explicitly, the text space cost for this doesn't depend on
 the size of the allocation.
 
 Anyway, the savings are mostly from from avoiding cache misses from
 sparse allocation on stacks.
 
 Also, FreeBSD-i386 hasn't been programmed to support aligned stacks:
 - KSTACK_PAGES on i386 is 2, while on amd64 it is 4.  Using more
   stack might push something over the edge
 - not much care is taken to align the initial stack or to keep the
   stack aligned in calls from asm code.  E.g., any alignment for
   mi_startup() (and thus proc0?) is accidental.  This may result
   in perfect alignment or perfect misalignment.  Hopefully, more
   care is taken with thread startup.  For gcc, the alignment is
   done bogusly in main() in userland, but there is no main() in
   the kernel.  The alignment doesn't matter much (provided the
   perfect misalignment is still to a multiple of 4), but when it
   matters, the random misalignment that results from not trying to
   do it at all is better than perfect misalignment from getting it
   wrong.  With 4-byte alignment, the only cases that it helps are
   with 64-bit variables.
 
 the gcc(1) man page states the following:
 
 
 This extra alignment does consume extra stack space, and generally
 increases code size.  Code that is sensitive to stack space usage,
 such as embedded systems and operating system kernels, may want to
 reduce the preferred alignment to -mpreferred-stack-boundary=2.
 
 
 the comment in sys/conf/kern.mk however sorta suggests that the default
 alignment of 4 bytes might improve performance.
 
 The default stack alignment is 16 bytes, which unimproves performance.

maybe the part of the comment in sys/conf/kern.mk, which mentions that a stack
alignment of 16 bytes might improve micro benchmark results should be removed.
this would prevent people (like me) from thinking, using a stack alignment of
4 bytes is a compromise between size and efficiently. it isn't! currently a
stack alignment of 16 bytes has no advantages towards one with 4 bytes on i386.
so specifying -mpreferred-stack-boundary=2 on i386 is absolutely mandatory.

please see the attached patch, which also introduduces a line break in order to
describe the stack alignment issue in a paragraph of its own.

cheers.
alex

 
 clang handles stack alignment correctly (only does it when it is needed)
 so it doesn't need a -mpreferred-stack-boundary option and doesn't
 always break without alignment in main().  Well, at least it used to,
 IIRC.  Testing it now shows that it does the necessary andl of the
 stack pointer for __aligned(32), but for __aligned(16) it now assumes
 that the stack is aligned by the caller.  So it now needs
 -mpreferred-stack-boundary=2, but doesn't have it.  OTOH, clang doesn't
 do the andl in main() like gcc does (unless you put a dummy __aligned(32)
 there), but requires crt to pass an aligned stack.
 
 Bruce
Index: /usr/src/sys/conf/kern.mk
===
--- /usr/src/sys/conf/kern.mk   (revision 228845)
+++ /usr/src/sys/conf/kern.mk   (working copy)
@@ -30,12 +30,12 @@
 # On i386, do not align the stack to 16-byte boundaries.  Otherwise GCC 2.95
 # and above adds code to the entry and exit point of every function to align 
the
 # stack to 16-byte boundaries -- thus wasting approximately 12 bytes of stack
-# per function call.  While the 16-byte alignment may benefit micro benchmarks,
-# it is probably an overall loss as it makes the code bigger (less efficient
-# use of code cache tag lines) and uses more stack (less efficient use of data
-# cache tag lines).  Explicitly prohibit the use of FPU, SSE and other SIMD
-# operations inside the kernel itself.  These operations are exclusively
-# reserved for user applications.
+# per function call.  This makes the code bigger (less efficient use of code
+# cache tag lines) and uses more stack (less efficient use of data cache tag
+# lines).
+# Explicitly prohibit the use of FPU, SSE and other SIMD operations inside the
+# kernel 

Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-24 Thread Matthias Andree
Am 24.12.2011 00:56, schrieb Alexander Best:
 hi there,
 
 is -mpreferred-stack-boundary=2 really necessary for i386 builds any longer?
 i built GENERIC (including modules) with and without that flag. the results
 are:
 
 1654496   bytes with the flag set
 vs.
 1654952   bytes with the flag unset
 
 the gcc(1) man page states the following:
 
 
 This extra alignment does consume extra stack space, and generally
 increases code size.  Code that is sensitive to stack space usage,
 such as embedded systems and operating system kernels, may want to
 reduce the preferred alignment to -mpreferred-stack-boundary=2.
 
 

What do the numbers above have to do with *stack* alignment or size
(which is a run-time figure, and cannot be statically determined if any
variable-depth recursion takes place).

What are those 16... numbers, anyways? How did you obtain them?
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-24 Thread Stefan Bethke

Am 24.12.2011 um 12:06 schrieb Bruce Evans:

 On Fri, 23 Dec 2011, Adrian Chadd wrote:
 
 Well, the whole kernel is bloated at the moment, sorry.
 
 I've been trying to build the _bare minimum_ required to bootstrap
 -HEAD on these embedded boards and I can't get the kernel down below 5
 megabytes - ie, one with FFS (with options disabled), MIPS, INET (no
 INET6), net80211, ath (which admittedly is big, but I need it no
 matter what, right?) comes in at:
 
 -r-xr-xr-x  1 root  wheel   5307021 Nov 29 19:14 kernel.LSSR71
 
 And with INET6, on another board (and this includes MSDOS and the
 relevant geom modules):
 
 -r-xr-xr-x  1 root  wheel   5916759 Nov 28 12:00 kernel.RSPRO
 
 .. honestly, that's what should be addressed. That's honestly a bit 
 ridiculous.
 
 It's disgusting, but what problems does it cause apart from minor slowness
 from cache misses?

The flash chip on these devices only has 8MB; some of the really cheap ones 
only have 4MB (yes MB, not GB).  And many have only 32MB RAM.  It would be nice 
to have space for actual applications :-)


Stefan

-- 
Stefan Bethke s...@lassitu.de   Fon +49 151 14070811



___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-24 Thread Alexander Best
On Sat Dec 24 11, Bruce Evans wrote:
 On Sat, 24 Dec 2011, Alexander Best wrote:
 
 On Sat Dec 24 11, Bruce Evans wrote:
 On Fri, 23 Dec 2011, Alexander Best wrote:
 
 is -mpreferred-stack-boundary=2 really necessary for i386 builds any
 longer?
 i built GENERIC (including modules) with and without that flag. the 
 results
 are:
 
 The same as it has always been.  It avoids some bloat.
 
 1654496bytes with the flag set
 vs.
 1654952bytes with the flag unset
 
 I don't believe this.  GENERIC is enormously bloated, so it has size
 more like 16MB than 1.6MB.  Even a savings of 4K instead of 456 bytes
 
 i'm sorry. i used du(1) to get those numbers, so i believe those numbers
 represent the ammount of 512-byte blocks. if i'm correct GENERIC is even
 more bloated than you feared and almost reaches 1GB:
 
 807,859375  megabytes with flag set
 vs.
 808,0820313 megabytes without the flag set
 
 That's certainly bloated.  It counts all object files and modules, and
 probably everything is compiled with -g.  I only counted kernel text
 size.

yeah, but for demonstrating the different size between the build with
-mpreferred-stack-boundary=2 set and -mpreferred-stack-boundary=2 unset, it
doesn't really matter how big the directories are and if object files are
included. the difference in size is  1 megabyte. so setting
-mpreferred-stack-boundary=2 doesn't aid in reducing the kernel (or modules)
size, but merely to improve improve stack performance/efficiency.

cheers.
alex

 
 Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-24 Thread Alexander Best
On Sat Dec 24 11, Bruce Evans wrote:
 On Fri, 23 Dec 2011, Adrian Chadd wrote:
 
 Well, the whole kernel is bloated at the moment, sorry.
 
 I've been trying to build the _bare minimum_ required to bootstrap
 -HEAD on these embedded boards and I can't get the kernel down below 5
 megabytes - ie, one with FFS (with options disabled), MIPS, INET (no
 INET6), net80211, ath (which admittedly is big, but I need it no
 matter what, right?) comes in at:
 
 -r-xr-xr-x  1 root  wheel   5307021 Nov 29 19:14 kernel.LSSR71
 
 And with INET6, on another board (and this includes MSDOS and the
 relevant geom modules):
 
 -r-xr-xr-x  1 root  wheel   5916759 Nov 28 12:00 kernel.RSPRO
 
 .. honestly, that's what should be addressed. That's honestly a bit 
 ridiculous.
 
 It's disgusting, but what problems does it cause apart from minor slowness
 from cache misses?
 
 I used to monitor the size of a minimal i386 kernel:
 
 % machine i386
 % cpu I686_CPU
 % ident   MIN
 % options SCHED_4BSD
 
 In FreeBSD-5-CURRENT between 5.1R and 5.2R, this had size:
 
text  data bss dec hex filename
  931241 86524   62356 1080121  107b39 /sysc/i386/compile/min/kernel
 
 A minimal kernel is not useful, but maybe you can add some i/o to it
 without bloating it too much.
 
 This almost builds in -current too.  I had to add the following:
 - NO_MODULES to de-bloat the compile time
 - MK_CTF=no to build -current on FreeBSD.9.  The kernel .mk files are
   still broken (depend on nonstandard/new features in sys.mk).

strange. the build(7) man page claims that:


 WITH_CTF  If defined, the build process will run the DTrace CTF
   conversion tools on built objects.  Please note that
   this WITH_ option is handled differently than all other
   WITH_ options (there is no WITHOUT_CTF, or correspond-
   ing MK_CTF in the build system).


... so setting MK_CTF to anything shouldn't have (according to the man page).

cheers.
alex

 - comment out a line in if.c that refers to Vloif.  if.c is standard
   but the loop device is optional.
 
 A few more changes to remove non-minimalities that are not defaults
 made little difference:
 
 % machine i386
 % cpu I686_CPU
 % ident   MIN
 % options SCHED_4BSD
 % 
 % # XXX kill default misconfigurations.
 % makeoptions NO_MODULES=yes
 % makeoptions COPTFLAGS=-O -pipe
 % 
 % # XXX from here on is to try to kill everything in DEFAULTS.
 % 
 % # nodevice  isa # needed for DELAY...
 % # nooptions ISAPNP  # needed ...
 % 
 % nodevicenpx
 % 
 % nodevicemem
 % nodeviceio
 % 
 % nodeviceuart_ns8250
 % 
 % nooptions   GEOM_PART_BSD
 % nooptions   GEOM_PART_EBR
 % nooptions   GEOM_PART_EBR_COMPAT
 % nooptions   GEOM_PART_MBR
 % 
 % # nooptions NATIVE  # needed ...
 % # nodevice  atpic   # needed ...
 % 
 % nooptions   NEW_PCIB
 % 
 % nooptions   VFS_ALLOW_NONMPSAFE
 
text  data bss dec hex filename
 1663902110632  136892 1911426  1d2a82 kernel
 
 (This was about 100K larger with -O2 and all DEFAULTS).  The bloat since
 FreeBSD-5 is only 70%.
 
 Here are some sizes for my standard kernel (on i386).  The newer
 versions have about the same number of features since they don't support
 so many old isa devices or so many NICs:
 
text  data bss dec hex filename
 1483269106972  172524 1762765  1ae5cd FreeBSD-3/kernel
 1917408157472  194228 2269108  229fb4 FreeBSD-4/kernel
 2604498198948  237720 3041166  2e678e FreeBSD-5.1.5/kernel
 2833842206856  242936 3283634  321ab2 
 FreeBSD-5.1.5/kernel-with-acpi
 2887573192456  288696 3368725  336715 FreeBSD-5.1.5/kernel
 with my changes, -O2 and usb
   added relative to the above
 2582782195756  298936 3077474  2ef562 previous, with some excessive
 inlining avoided, and without -O2,
   and with ipfilter
 1998276159436  137748 2295460  2306a4 kernel.4
 a more up to date and less hacked on
   FreeBSD-4
 4365549262656  209588 4837793  49d1a1 kernel.7
 4406155266496  496532 5169183  4ee01f kernel.7.invariants
 3953248242464  207252 4402964  432f14 kernel.7.noacpi
 4418063268288  240084 4926435  4b2be3 kernel.7.smp
 various fairly stock FreeBSD-7R
   kernels
 3669544262848  249712 4182104  3fd058 kernel.c
 4174317258240  540144 4972701  4be09d kernel.c.invariants
 3964455250656  249808 4464919  442117 kernel.c.noacpi
 3213928

Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-24 Thread Alexander Best
On Sat Dec 24 11, Bruce Evans wrote:
 On Sat, 24 Dec 2011, Alexander Best wrote:
 
 On Sat Dec 24 11, Bruce Evans wrote:
 On Fri, 23 Dec 2011, Alexander Best wrote:
 ...
 the gcc(1) man page states the following:
 
 
 This extra alignment does consume extra stack space, and generally
 increases code size.  Code that is sensitive to stack space usage,
 such as embedded systems and operating system kernels, may want to
 reduce the preferred alignment to -mpreferred-stack-boundary=2.
 
 
 the comment in sys/conf/kern.mk however sorta suggests that the default
 alignment of 4 bytes might improve performance.
 
 The default stack alignment is 16 bytes, which unimproves performance.
 
 maybe the part of the comment in sys/conf/kern.mk, which mentions that a 
 stack
 alignment of 16 bytes might improve micro benchmark results should be 
 removed.
 this would prevent people (like me) from thinking, using a stack alignment 
 of
 4 bytes is a compromise between size and efficiently. it isn't! currently a
 stack alignment of 16 bytes has no advantages towards one with 4 bytes on 
 i386.
 
 I think the comment is clear enough.  It it mentions all the tradeoffs.
 It is only slightly cryptic in saying that these are tradeoffs and that
 the configuration is our best guess at the best tradeoff -- it just says
 while for both.  It goes without saying that we don't use our worst
 guess.  Anyone wanting to change this should run benchmarks and beware
 that micro-benchmarks are especially useless.  The changed comment is not
 so good since it no longer mentions micro-bencharmarks or says while.

if micro benchmark results aren't of any use, why should the claim that the
default stack alignment of 16 bytes might produce better outcome stay?

it doesn't seem as if anybody has micro benchmarked 16 bytes vs. 4 bytes stack
alignment, until now. so the micro benchmark statement in the comment seems to
be pure speculation. even worse...it indicates that by removing the
-mpreferred-stack-boundary=2 flag, one can gain a performance boost by
sacrifying a few more bytes of kernel (and module) size.

this suggests that the behavior -mpreferred-stack-boundary=2 vs. not specyfing
it, losely equals the semantics of -Os vs. -O2.

i don't see how a 4 byte stack alignment for the kernel has any tradeoffs
against the default 16 byte alignment. so if there are no tradeoffs, the
comment shouldn't imply that there are.

cheers.
alex

 
 so specifying -mpreferred-stack-boundary=2 on i386 is absolutely mandatory.
 
 Not mandatory; just an optimization.
 
 
 please see the attached patch, which also introduduces a line break in 
 order to
 describe the stack alignment issue in a paragraph of its own.
 
 There should also be an empty line for a paragraph break.
 
 % +# Explicitly prohibit the use of FPU, SSE and other SIMD operations 
 inside the
 % +# kernel itself.  These operations are exclusively reserved for user
 % +# applications.
 
 This part was actually wronger:
 - these operations are not really reserved, but were just not supported
   in the kernel
 - they have been supported in the kernel for some time, although anything
   wanting to use the compiler to generate them would have to do something
   to kill the options added here.  Kernel code using them must inform the
   kernel that it is doing so, using fpu_kern*(9undoc), and this is
   only valid in some contexts (more or less for kernel-only threads)
   so we still prevent compilers from using them routinely.  The makefile
   is not the right place to describe any of this,
 
 Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-24 Thread Bruce Evans

On Fri, 23 Dec 2011, Alexander Best wrote:


is -mpreferred-stack-boundary=2 really necessary for i386 builds any longer?
i built GENERIC (including modules) with and without that flag. the results
are:


The same as it has always been.  It avoids some bloat.


1654496 bytes with the flag set
vs.
1654952 bytes with the flag unset


I don't believe this.  GENERIC is enormously bloated, so it has size
more like 16MB than 1.6MB.  Even a savings of 4K instead of 456 bytes
is hard to believe.  I get a savings of 9K (text) in a 5MB kernel.
Changing the default target arch from i386 to pentium-undocumented has
reduced the text space savings a little, since the default for passing
args is now to preallocate stack space for them and store to this,
instead of to push them; this preallocation results in more functions
needing to allocate some stack space explicitly, and when some is
allocated explicitly, the text space cost for this doesn't depend on
the size of the allocation.

Anyway, the savings are mostly from from avoiding cache misses from
sparse allocation on stacks.

Also, FreeBSD-i386 hasn't been programmed to support aligned stacks:
- KSTACK_PAGES on i386 is 2, while on amd64 it is 4.  Using more
  stack might push something over the edge
- not much care is taken to align the initial stack or to keep the
  stack aligned in calls from asm code.  E.g., any alignment for
  mi_startup() (and thus proc0?) is accidental.  This may result
  in perfect alignment or perfect misalignment.  Hopefully, more
  care is taken with thread startup.  For gcc, the alignment is
  done bogusly in main() in userland, but there is no main() in
  the kernel.  The alignment doesn't matter much (provided the
  perfect misalignment is still to a multiple of 4), but when it
  matters, the random misalignment that results from not trying to
  do it at all is better than perfect misalignment from getting it
  wrong.  With 4-byte alignment, the only cases that it helps are
  with 64-bit variables.


the gcc(1) man page states the following:


This extra alignment does consume extra stack space, and generally
increases code size.  Code that is sensitive to stack space usage,
such as embedded systems and operating system kernels, may want to
reduce the preferred alignment to -mpreferred-stack-boundary=2.


the comment in sys/conf/kern.mk however sorta suggests that the default
alignment of 4 bytes might improve performance.


The default stack alignment is 16 bytes, which unimproves performance.

clang handles stack alignment correctly (only does it when it is needed)
so it doesn't need a -mpreferred-stack-boundary option and doesn't
always break without alignment in main().  Well, at least it used to,
IIRC.  Testing it now shows that it does the necessary andl of the
stack pointer for __aligned(32), but for __aligned(16) it now assumes
that the stack is aligned by the caller.  So it now needs
-mpreferred-stack-boundary=2, but doesn't have it.  OTOH, clang doesn't
do the andl in main() like gcc does (unless you put a dummy __aligned(32)
there), but requires crt to pass an aligned stack.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-24 Thread Bruce Evans

On Fri, 23 Dec 2011, Adrian Chadd wrote:


Well, the whole kernel is bloated at the moment, sorry.

I've been trying to build the _bare minimum_ required to bootstrap
-HEAD on these embedded boards and I can't get the kernel down below 5
megabytes - ie, one with FFS (with options disabled), MIPS, INET (no
INET6), net80211, ath (which admittedly is big, but I need it no
matter what, right?) comes in at:

-r-xr-xr-x  1 root  wheel   5307021 Nov 29 19:14 kernel.LSSR71

And with INET6, on another board (and this includes MSDOS and the
relevant geom modules):

-r-xr-xr-x  1 root  wheel   5916759 Nov 28 12:00 kernel.RSPRO

.. honestly, that's what should be addressed. That's honestly a bit ridiculous.


It's disgusting, but what problems does it cause apart from minor slowness
from cache misses?

I used to monitor the size of a minimal i386 kernel:

% machine   i386
% cpu   I686_CPU
% ident MIN
% options   SCHED_4BSD

In FreeBSD-5-CURRENT between 5.1R and 5.2R, this had size:

   textdata bss dec hex filename
 931241   86524   62356 1080121  107b39 /sysc/i386/compile/min/kernel

A minimal kernel is not useful, but maybe you can add some i/o to it
without bloating it too much.

This almost builds in -current too.  I had to add the following:
- NO_MODULES to de-bloat the compile time
- MK_CTF=no to build -current on FreeBSD.9.  The kernel .mk files are
  still broken (depend on nonstandard/new features in sys.mk).
- comment out a line in if.c that refers to Vloif.  if.c is standard
  but the loop device is optional.

A few more changes to remove non-minimalities that are not defaults
made little difference:

% machine   i386
% cpu   I686_CPU
% ident MIN
% options   SCHED_4BSD
% 
% # XXX kill default misconfigurations.

% makeoptions   NO_MODULES=yes
% makeoptions   COPTFLAGS=-O -pipe
% 
% # XXX from here on is to try to kill everything in DEFAULTS.
% 
% # nodevice		isa	# needed for DELAY...

% # nooptions   ISAPNP  # needed ...
% 
% nodevice		npx
% 
% nodevice		mem

% nodevice  io
% 
% nodevice		uart_ns8250
% 
% nooptions 	GEOM_PART_BSD

% nooptions GEOM_PART_EBR
% nooptions GEOM_PART_EBR_COMPAT
% nooptions GEOM_PART_MBR
% 
% # nooptions 	NATIVE		# needed ...

% # nodeviceatpic   # needed ...
% 
% nooptions 	NEW_PCIB
% 
% nooptions		VFS_ALLOW_NONMPSAFE


   textdata bss dec hex filename
1663902  110632  136892 1911426  1d2a82 kernel

(This was about 100K larger with -O2 and all DEFAULTS).  The bloat since
FreeBSD-5 is only 70%.

Here are some sizes for my standard kernel (on i386).  The newer
versions have about the same number of features since they don't support
so many old isa devices or so many NICs:

   textdata bss dec hex filename
1483269  106972  172524 1762765  1ae5cd FreeBSD-3/kernel
1917408  157472  194228 2269108  229fb4 FreeBSD-4/kernel
2604498  198948  237720 3041166  2e678e FreeBSD-5.1.5/kernel
2833842  206856  242936 3283634  321ab2 FreeBSD-5.1.5/kernel-with-acpi
2887573  192456  288696 3368725  336715 FreeBSD-5.1.5/kernel
with my changes, -O2 and usb
added relative to the above
2582782  195756  298936 3077474  2ef562 previous, with some excessive
inlining avoided, and without -O2,
and with ipfilter
1998276  159436  137748 2295460  2306a4 kernel.4
a more up to date and less hacked on
FreeBSD-4
4365549  262656  209588 4837793  49d1a1 kernel.7
4406155  266496  496532 5169183  4ee01f kernel.7.invariants
3953248  242464  207252 4402964  432f14 kernel.7.noacpi
4418063  268288  240084 4926435  4b2be3 kernel.7.smp
various fairly stock FreeBSD-7R
kernels
3669544  262848  249712 4182104  3fd058 kernel.c
4174317  258240  540144 4972701  4be09d kernel.c.invariants
3964455  250656  249808 4464919  442117 kernel.c.noacpi
3213928  240160  240596 3694684  38605c kernel.c.noacpi-ule
4285040  268288  286160 4839488  49d840 kernel.c.smp
current before FreeBSD-8R
not all built at the same time or
with the same options.  The 20%
bloat between kernel.c.noacpi.ule
and kernel.c.noacpi is mainly
from not killing the default of
-O2.
4742714  315008  401692 5459414  534dd6 kernel.8
4816900  319200 1813916 6950016  6a0c80 kernel.8.invariants
4490209  304832  395260 5190301  4f329d kernel.8.noacpi
4795475  323680  475420 5594575  555dcf kernel.8.smp
  

Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-24 Thread Bruce Evans

On Sat, 24 Dec 2011, Alexander Best wrote:


On Sat Dec 24 11, Bruce Evans wrote:

On Fri, 23 Dec 2011, Alexander Best wrote:


is -mpreferred-stack-boundary=2 really necessary for i386 builds any
longer?
i built GENERIC (including modules) with and without that flag. the results
are:


The same as it has always been.  It avoids some bloat.


1654496 bytes with the flag set
vs.
1654952 bytes with the flag unset


I don't believe this.  GENERIC is enormously bloated, so it has size
more like 16MB than 1.6MB.  Even a savings of 4K instead of 456 bytes


i'm sorry. i used du(1) to get those numbers, so i believe those numbers
represent the ammount of 512-byte blocks. if i'm correct GENERIC is even
more bloated than you feared and almost reaches 1GB:

807,859375  megabytes with flag set
vs.
808,0820313 megabytes without the flag set


That's certainly bloated.  It counts all object files and modules, and
probably everything is compiled with -g.  I only counted kernel text
size.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-24 Thread Bruce Evans

On Sat, 24 Dec 2011, Alexander Best wrote:


On Sat Dec 24 11, Bruce Evans wrote:

This almost builds in -current too.  I had to add the following:
- NO_MODULES to de-bloat the compile time
- MK_CTF=no to build -current on FreeBSD.9.  The kernel .mk files are
  still broken (depend on nonstandard/new features in sys.mk).


strange. the build(7) man page claims that:


WITH_CTF  If defined, the build process will run the DTrace CTF
  conversion tools on built objects.  Please note that
  this WITH_ option is handled differently than all other
  WITH_ options (there is no WITHOUT_CTF, or correspond-
  ing MK_CTF in the build system).


... so setting MK_CTF to anything shouldn't have (according to the man page).


MK_CTF is an implementation detail.  It is normally set in bsd.own.mk
(not in sys.mk line I said -- this gives another, much larger bug (*)).
But when usr/share/mk is old, it doesn't know anything about MK_CTF.
(For example, in FreeBSD-9, sys.mk sets NO_CTF to 1 if WITH_CTF is not
defined.  This corresponds to bsd.own.mk in -current setting MK_CTF
to no if WITH_CTF is not defined.  Go back to an older version of
FreeBSD and /usr/share/mk/* won't know anything about any CTF variable.)
So when you try to build a current kernel under an old version of
FreeBSD, MK_CTF is used uninitialized and the build fails.  (Of course,
you build kernels normally and don't use the bloated buildkernel
method.)  The bug is in the following files:

kern.post.mk:.if ${MK_CTF} != no
kern.pre.mk:.if ${MK_CTF} != no
kmod.mk:.if defined(MK_CTF)  ${MK_CTF} != no

except for the last one where it has been fixed.

(*) Well, not completely broken, but just annoyingly unportabile.
Consider the following makefile:

%%%
foo: foo.c
%%%

Invoking this under FreeBSD-9 gives:

%%%
cc -O2 -pipe   foo.c  -o foo
[ -z ctfconvert -o -n 1 ] ||  (echo ctfconvert -L VERSION foo   
ctfconvert -L VERSION foo)
%%%

This is the old ctf method.  It is ugly but is fairly portable.

Invoking this under FreeBSD-9 but with -mpath-to-current-mk-directory gives

%%%
cc -O2 -pipe   foo.c  -o foo
${CTFCONVERT_CMD} expands to empty string
%%%

This is because:
- the rule in sys.mk says ${CTFCONVERT_CMD}
- CTFCONVERT_CMD is normally defined in bsd.own.mk.  But bsd.own.mk is only
  included by BSD makefiles.  It is never included by portable makefiles.
  So ${CTFCONVERT_CMD} is used uninitialized.
- for some reason, using variables uninitialized is not fatal in this
  context, although it is for the comparisons of ${MK_CTF} above.
- ${CTFCONVERT_CMD} is replaced by the empty string.  Old versions of
  make warn about the use of an empty string as a shell command.
- the code that is supposed to prevent the previous warning is in
  bsd.own.mk, where it is not reached for portable makefiles.  It is:

% .if ${MK_CTF} != no
% CTFCONVERT_CMD=   ${CTFCONVERT} ${CTFFLAGS} ${.TARGET}

This uses the full ctfconvert if WITH_CTF.

% .elif ${MAKE_VERSION} = 520300
% CTFCONVERT_CMD=

make(1) has been modified to not complain about the empty string.  The
version test detects which versions of make don't complain.

% .else
% CTFCONVERT_CMD=   @:

The default is to generate this non-empty string and an extra shell command
to execute it, for old versions of make.

% .endif

But none of this works for portable makefiles, since it is not reached.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


[rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-23 Thread Alexander Best
hi there,

is -mpreferred-stack-boundary=2 really necessary for i386 builds any longer?
i built GENERIC (including modules) with and without that flag. the results
are:

1654496 bytes with the flag set
vs.
1654952 bytes with the flag unset

the gcc(1) man page states the following:


This extra alignment does consume extra stack space, and generally
increases code size.  Code that is sensitive to stack space usage,
such as embedded systems and operating system kernels, may want to
reduce the preferred alignment to -mpreferred-stack-boundary=2.


the comment in sys/conf/kern.mk however sorta suggests that the default
alignment of 4 bytes might improve performance.

cheers.
alex
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [rfc] removing -mpreferred-stack-boundary=2 flag for i386?

2011-12-23 Thread Adrian Chadd
Well, the whole kernel is bloated at the moment, sorry.

I've been trying to build the _bare minimum_ required to bootstrap
-HEAD on these embedded boards and I can't get the kernel down below 5
megabytes - ie, one with FFS (with options disabled), MIPS, INET (no
INET6), net80211, ath (which admittedly is big, but I need it no
matter what, right?) comes in at:

-r-xr-xr-x  1 root  wheel   5307021 Nov 29 19:14 kernel.LSSR71

And with INET6, on another board (and this includes MSDOS and the
relevant geom modules):

-r-xr-xr-x  1 root  wheel   5916759 Nov 28 12:00 kernel.RSPRO

.. honestly, that's what should be addressed. That's honestly a bit ridiculous.

2c,



Adrian
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org