On 18.09.2013 22:08, Edd Barrett wrote:
A few weeks back (at a PyPy sprint) someone asked me why amd64/OpenBSD
has no assembler implementation of memset(3). After asking on icb, there
were a couple of theories:
a) Perhaps the available assembler implementations of memset are slower
than our C one.
b) Perhaps due to a), no-one got round to it.
It turns out that (on the systems I benchmarked on), FreeBSD's memset.S 
is faster than our memset.c in libc. Those interested can see the results
(including graphs) of some benchmarks comparing FreeBSD memset.S and our
memset.C here: https://github.com/vext01/openbsd-libc-benchmarks
In short, each experiment warms up by setting and checking a load of buffers
before setting as many buffers as possible given a one minute timeframe. The
experiments were run with varying buffer sizes under both memset.S and
memset.c. During experimentation, the machines were otherwise idle. Although
the results vary from system to system, it seems that memset.S is between 6
and 30 times faster. The results also show that there was no case (that we
tested) where memset.c was faster than memset.S.
Thw following diff enables memset.S in libc on amd64.
* Is what I have done with the vendor keywords acceptable? (moved -- but
preserving order -- them to the top and removed __FBSDID).
* I removed the non-executable stack hint as I don't see anything
similar in other .S files in-tree.
* I don't think any library bump is needed. Can someone confirm this?
I have run with this diff for the last week or so with no issues. I have
been running some heavy compilation tasks during this time (building
If people think this kind of work is worthwhile, then there are some
other routines we could borrow from the other BSDs too.
RCS file: /cvs/src/lib/libc/arch/amd64/string/Makefile.inc,v
retrieving revision 1.4
diff -u -p -r1.4 Makefile.inc
--- lib/libc/arch/amd64/string/Makefile.inc 4 Sep 2012 03:10:42 -
+++ lib/libc/arch/amd64/string/Makefile.inc 18 Sep 2013 17:05:10 -
@@ -3,4 +3,4 @@
SRCS+= bcmp.c ffs.S index.c memchr.c memcmp.c bcopy.c bzero.c \
rindex.c strcat.c strcmp.c strcpy.c strcspn.c strlen.c \
strncat.c strncmp.c strncpy.c strpbrk.c strsep.c \
-strspn.c strstr.c swab.c memset.c strlcpy.c strlcat.c
+strspn.c strstr.c swab.c memset.S strlcpy.c strlcat.c
RCS file: lib/libc/arch/amd64/string/memset.S
diff -N lib/libc/arch/amd64/string/memset.S
--- /dev/null 1 Jan 1970 00:00:00 -
+++ lib/libc/arch/amd64/string/memset.S 18 Sep 2013 17:05:10 -
@@ -0,0 +1,58 @@
+/* $OpenBSD$ */
+/* FreeBSD revision: 217106 */
+/* $NetBSD: memset.S,v 1.3 2004/02/26 20:50:06 drochner Exp $ */
+ * Written by J.T. Conklin j...@netbsd.org.
+ * Public domain.
+ * Adapted for NetBSD/x86_64 by Frank van der Linden f...@wasabisystems.com
+ cld /* set fill direction forward */
+* if the string is too short, it's really not worth the overhead
+* of aligning to word boundries, etc. So we jump to a plain
+* unaligned set.
+ jle L1
+ movb%al,%ah /* copy char to all bytes in word */
+ orl %edx,%eax
+ orq %rdx,%rax
+ movq%rdi,%rdx /* compute misalignment */
+ movq%rdx,%rcx /* set until word aligned */
+ shrq$3,%rcx /* set by words */
+ movq%r8,%rcx/* set remainder by bytes */
If you end up committing this, it would be nice to fix the spelling of