On Sat, 9 Sep 2017, David Chisnall wrote:
On 8 Sep 2017, at 21:09, Mateusz Guzik <m...@freebsd.org> wrote:
Author: mjg
Date: Fri Sep 8 20:09:14 2017
New Revision: 323329
URL: https://svnweb.freebsd.org/changeset/base/323329
Log:
Allow __builtin_memset instead of bzero for small buffers of known size
This change seems redundant, because modern compilers already do this
optimisation. For example:
#include <strings.h>
char buf[42];
void bz(void)
{
bzero(buf, 42);
}
With clang 4.0 on x86 compiles to:
pushq %rbp
movq %rsp, %rbp
xorps %xmm0, %xmm0
movups %xmm0, buf+26(%rip)
movaps %xmm0, buf+16(%rip)
movaps %xmm0, buf(%rip)
popq %rbp
retq
This is only a valid optimization for POSIX code. Compiling with -std=c99
turns it off, even though the non-C99 <strings.h> is included.
Note that -std is broken in CFLAGS for the kernel. It is c99 (actually
iso9899:1999). Userland uses the correct standard gnu99. -std=c99 breaks
other gnu features like unnamed struct/unions which are then unbroken
bogusly using -fms-extensions.
Neither contains a call, both have inlined the zeroing. This change is
strictly worse, because the compiler has some carefully tuned heuristics that
are set per target for when to inline the memset / bzero and when to call the
function. These are based on both the size and the alignment, including
whether the target supports misaligned accesses and whether misaligned accesses
are cheap. None of this is captured by this change.
It is strictly better than plain bzero() in the kernel, and makes no
difference in userland.
In the kernel, this optimisation is disabled by -ffreestanding, however
__builtin_memset will be turned into a memset call if the size is not constant
or if the memset call would be more efficient (as determined by the
aforementioned heuristics). Simply using __builtin_memset in all cases should
give better code, and is more likely to be forward compatible with future ISAs
where the arbitrary constant picked in this patch may or may not be optimal.
Compilers don't really understand this. In the kernel, extern memset()
is strictly worse than extern bzero(), since memset() is just a wrapper
around bzero(). Thus simply using __builtin_memset() gives worse
strictly code in all cases that are not inlined. Not a large amount
worse since the wrapper only does an extra branch and function call
when it reduces to bzero(). memset() is just slow in other cases.
The constant is not arbitrary, but is hard-coded to the x86 value which is
more arbitrary. x86 compilers know that they don't really understand memory
so they hard-code a conservative value of 64. It is only possible do better
by compiling for a specific arch and knowing how good or bad the extern
function is. 64 is about the smallest value above which the extern function
could possibly do significantly better by knowing or doing more. Usually it
does slightly worse near 64 and also higher since it knows less, especially
if the compilation was for a specific arch (clang seems to have even more
methods than my benchmarks for this -- hundreds instead of tens --, while
x86 kernels have only one method optimized for 25-year old arches and now
OK again on new arches with fast string instructions (above a few hundred
bytes).
Bruce
_______________________________________________
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"