Re: [PATCH 1/1]: MINOR __builtin_memcpy_inline usage introduction

2022-06-20 Thread David CARLIER
On Mon, 20 Jun 2022 at 07:27, Willy Tarreau  wrote:
>
> Hi David,
>
> On Sat, Jun 18, 2022 at 12:52:23PM +0100, David CARLIER wrote:
> > From 9d7b6448a2407451c3115b701c51f97ab2bf6a59 Mon Sep 17 00:00:00 2001
> > From: David Carlier 
> > Date: Sat, 18 Jun 2022 12:41:11 +0100
> > Subject: [PATCH] MINOR: compiler __builtin_memcpy_inline usage introduction.
> >
> > Optimised version of the existing __builtin_memcpy builtin, differences
> > reside by the fact it works only with constant time sizes and does
> >  generate extra calls.
> > At the moment, is clang exclusive even tough GCC does not seem to
> >  want to implement it, the comments try not to reflect this current
> > state.
> > Usage can be expanded, used purposely only in few places for starter.
> > ---
> >  include/haproxy/compiler.h | 12 
> >  src/haproxy.c  |  4 ++--
> >  src/log.c  |  2 +-
> >  src/proxy.c|  2 +-
> >  4 files changed, 16 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/haproxy/compiler.h b/include/haproxy/compiler.h
> > index 66b5f5835..dca9f6aef 100644
> > --- a/include/haproxy/compiler.h
> > +++ b/include/haproxy/compiler.h
> > @@ -192,6 +192,18 @@
> >  #endif
> >  #endif
> >
> > +/*
> > + * This builtin works only with compile time lengths unlike 
> > __builtin_memcpy.
> > + * Also guarantees no external call is generated.
> > + * We could `replicate` the aforementioned constraint with a
> > + * _Static_assert/__builtin_constant_p theoretically, but that might be
> > + * too much trouble to make it happens (C11 min)
> > + * so here we trust the proper usage with other compilers (and the CI 
> > infrastructure).
> > + */
> > +#if !defined(__clang__) || __clang_major__ < 11
> > +#define __builtin_memcpy_inline(x, y, s) memcpy(x, y, s)
> > +#endif
> > +
>
> Hmmm please no, this is not the right way to do it for two reasons:
>   - it will encourage use of __builtin_memcpy_inline() making one believe
> it's the expected one when it isn't ;
>
>   - developing on non-clang systems will not report the problem with the
> non-constant size that __builtin_memcpy_inline() expects, resulting
> in breakage from time to time.
>
> I'd suggest that instead you create a macro named ha_memcpy_inline() that
> maps to __builtin_memcpy_inline() when present and s is a builtin constant,
> or maps to __builtin_memcpy() for the rest. Please note, by the way, that
> gcc since at least 3.4 has been emitting the same instructions (rep movsb)
> as clang's __builtin_memcpy_inline(), but only when optimizing for size.
> When optimizing for anything else, it can emit ton of inlined garbage^Wvector
> instructions.
>
> Thus I suspect we could achieve the same result by doing a clever
> arrangement of #pragma GCC push_options / #pragma GCC optimize("Os,inline")
> around an inline function definition that does the memcpy, and that is
> called by the macro. I have not tried, but probably something like this
> would do it:
>
> #pragma GCC push_options
> #pragma GCC optimize("Os,inline")
> static inline void _ha_memcpy_inline(void *restrict dest, const void 
> *restrict src, size_t n)
> {
> __builtin_memcpy(dest, src, n);
> }
> #pragma GCC pop_options
>
> #if defined(clang...)
> #define ha_memcpy_inline(d,s,n) do { if (__builtin_constant_p(n)) 
> __builtin_memcpy_inline(d,s,n); else _ha_memcpy_inline(d,s,n); } while (0)
> #else
> #define ha_memcpy_inline(d,s,n) _ha_memcpy_inline(d,s,n)
> #endif
>
> That's not even compile-tested and I haven't looked at the result, but
> you get the idea.

Alright thanks here a newer version.
>
> Now regarding where to use it... I'd say it depends on a lot of factors,
> size, speed, undesired dependency on external libs. I think that each and
> every call place does warrant an individual commit with a justification
> and a check that the benefit matches expectations. There could be some
> value in using this in various low-level protocol layers (QUIC, HTX, SPOE).

Exactly.

> Maybe more, I don't know.

Over time its value will display more.

>
> Thanks,
> Willy
From 90c951d29c32a345cf38d6c0b43ee904f12b8ac7 Mon Sep 17 00:00:00 2001
From: David Carlier 
Date: Sat, 18 Jun 2022 12:41:11 +0100
Subject: [PATCH] MINOR: compiler __builtin_memcpy_inline usage introduction.

Optimised version of the existing __builtin_memcpy builtin, differences
reside by the fact it works only with constant time sizes and does
 not generate extra calls.
At the moment, is clang exclusive even tough GCC does not seem to
 want to implement it, the comments try not to reflect this current
state.
Usage can be expanded, used purposely only in few places for starter.
---
 include/haproxy/compiler.h | 32 
 src/haproxy.c  |  4 ++--
 src/log.c  |  2 +-
 src/proxy.c|  2 +-
 4 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/include/haproxy/compiler.h b/include/haproxy/compiler.h
index 

Re: [PATCH 1/1]: MINOR __builtin_memcpy_inline usage introduction

2022-06-20 Thread Willy Tarreau
Hi David,

On Sat, Jun 18, 2022 at 12:52:23PM +0100, David CARLIER wrote:
> From 9d7b6448a2407451c3115b701c51f97ab2bf6a59 Mon Sep 17 00:00:00 2001
> From: David Carlier 
> Date: Sat, 18 Jun 2022 12:41:11 +0100
> Subject: [PATCH] MINOR: compiler __builtin_memcpy_inline usage introduction.
> 
> Optimised version of the existing __builtin_memcpy builtin, differences
> reside by the fact it works only with constant time sizes and does
>  generate extra calls.
> At the moment, is clang exclusive even tough GCC does not seem to
>  want to implement it, the comments try not to reflect this current
> state.
> Usage can be expanded, used purposely only in few places for starter.
> ---
>  include/haproxy/compiler.h | 12 
>  src/haproxy.c  |  4 ++--
>  src/log.c  |  2 +-
>  src/proxy.c|  2 +-
>  4 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/include/haproxy/compiler.h b/include/haproxy/compiler.h
> index 66b5f5835..dca9f6aef 100644
> --- a/include/haproxy/compiler.h
> +++ b/include/haproxy/compiler.h
> @@ -192,6 +192,18 @@
>  #endif
>  #endif
>  
> +/*
> + * This builtin works only with compile time lengths unlike __builtin_memcpy.
> + * Also guarantees no external call is generated.
> + * We could `replicate` the aforementioned constraint with a
> + * _Static_assert/__builtin_constant_p theoretically, but that might be
> + * too much trouble to make it happens (C11 min)
> + * so here we trust the proper usage with other compilers (and the CI 
> infrastructure).
> + */
> +#if !defined(__clang__) || __clang_major__ < 11
> +#define __builtin_memcpy_inline(x, y, s) memcpy(x, y, s)
> +#endif
> +

Hmmm please no, this is not the right way to do it for two reasons:
  - it will encourage use of __builtin_memcpy_inline() making one believe
it's the expected one when it isn't ;

  - developing on non-clang systems will not report the problem with the
non-constant size that __builtin_memcpy_inline() expects, resulting
in breakage from time to time.

I'd suggest that instead you create a macro named ha_memcpy_inline() that
maps to __builtin_memcpy_inline() when present and s is a builtin constant,
or maps to __builtin_memcpy() for the rest. Please note, by the way, that
gcc since at least 3.4 has been emitting the same instructions (rep movsb)
as clang's __builtin_memcpy_inline(), but only when optimizing for size.
When optimizing for anything else, it can emit ton of inlined garbage^Wvector
instructions.

Thus I suspect we could achieve the same result by doing a clever
arrangement of #pragma GCC push_options / #pragma GCC optimize("Os,inline")
around an inline function definition that does the memcpy, and that is
called by the macro. I have not tried, but probably something like this
would do it:

#pragma GCC push_options
#pragma GCC optimize("Os,inline")
static inline void _ha_memcpy_inline(void *restrict dest, const void *restrict 
src, size_t n)
{
__builtin_memcpy(dest, src, n);
}
#pragma GCC pop_options

#if defined(clang...)
#define ha_memcpy_inline(d,s,n) do { if (__builtin_constant_p(n)) 
__builtin_memcpy_inline(d,s,n); else _ha_memcpy_inline(d,s,n); } while (0)
#else
#define ha_memcpy_inline(d,s,n) _ha_memcpy_inline(d,s,n)
#endif

That's not even compile-tested and I haven't looked at the result, but
you get the idea.

Now regarding where to use it... I'd say it depends on a lot of factors,
size, speed, undesired dependency on external libs. I think that each and
every call place does warrant an individual commit with a justification
and a check that the benefit matches expectations. There could be some
value in using this in various low-level protocol layers (QUIC, HTX, SPOE).
Maybe more, I don't know.

Thanks,
Willy



[PATCH 1/1]: MINOR __builtin_memcpy_inline usage introduction

2022-06-18 Thread David CARLIER
Hi

here a little proposal to use this specific clang builtin, only placed
in few places but over time can be expanded cautiously.

Kindest regards.

Thanks.
From 9d7b6448a2407451c3115b701c51f97ab2bf6a59 Mon Sep 17 00:00:00 2001
From: David Carlier 
Date: Sat, 18 Jun 2022 12:41:11 +0100
Subject: [PATCH] MINOR: compiler __builtin_memcpy_inline usage introduction.

Optimised version of the existing __builtin_memcpy builtin, differences
reside by the fact it works only with constant time sizes and does
 generate extra calls.
At the moment, is clang exclusive even tough GCC does not seem to
 want to implement it, the comments try not to reflect this current
state.
Usage can be expanded, used purposely only in few places for starter.
---
 include/haproxy/compiler.h | 12 
 src/haproxy.c  |  4 ++--
 src/log.c  |  2 +-
 src/proxy.c|  2 +-
 4 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/include/haproxy/compiler.h b/include/haproxy/compiler.h
index 66b5f5835..dca9f6aef 100644
--- a/include/haproxy/compiler.h
+++ b/include/haproxy/compiler.h
@@ -192,6 +192,18 @@
 #endif
 #endif
 
+/*
+ * This builtin works only with compile time lengths unlike __builtin_memcpy.
+ * Also guarantees no external call is generated.
+ * We could `replicate` the aforementioned constraint with a
+ * _Static_assert/__builtin_constant_p theoretically, but that might be
+ * too much trouble to make it happens (C11 min)
+ * so here we trust the proper usage with other compilers (and the CI infrastructure).
+ */
+#if !defined(__clang__) || __clang_major__ < 11
+#define __builtin_memcpy_inline(x, y, s) memcpy(x, y, s)
+#endif
+
 /* Linux-like "container_of". It returns a pointer to the structure of type
  *  which has its member  stored at address .
  */
diff --git a/src/haproxy.c b/src/haproxy.c
index 32eb4bdc6..1464d29ee 100644
--- a/src/haproxy.c
+++ b/src/haproxy.c
@@ -1265,11 +1265,11 @@ static void ha_random_boot(char *const *argv)
 
 	/* stack address (benefit form operating system's ASLR) */
 	l = (unsigned long)
-	memcpy(m, , sizeof(l)); m += sizeof(l);
+	__builtin_memcpy_inline(m, , sizeof(l)); m += sizeof(l);
 
 	/* argv address (benefit form operating system's ASLR) */
 	l = (unsigned long)
-	memcpy(m, , sizeof(l)); m += sizeof(l);
+	__builtin_memcpy_inline(m, , sizeof(l)); m += sizeof(l);
 
 	/* use tv_usec again after all the operations above */
 	gettimeofday(, NULL);
diff --git a/src/log.c b/src/log.c
index 304e3cb68..a631b1e46 100644
--- a/src/log.c
+++ b/src/log.c
@@ -808,7 +808,7 @@ int parse_logsrv(char **args, struct list *logsrvs, int do_del, const char *file
 			}
 
 			node = malloc(sizeof(*node));
-			memcpy(node, logsrv, sizeof(struct logsrv));
+			__builtin_memcpy_inline(node, logsrv, sizeof(struct logsrv));
 			node->ref = logsrv;
 			LIST_INIT(>list);
 			LIST_APPEND(logsrvs, >list);
diff --git a/src/proxy.c b/src/proxy.c
index c10a8cf83..7d02f8c19 100644
--- a/src/proxy.c
+++ b/src/proxy.c
@@ -1818,7 +1818,7 @@ static int proxy_defproxy_cpy(struct proxy *curproxy, const struct proxy *defpro
 			memprintf(errmsg, "proxy '%s': out of memory", curproxy->id);
 			return 1;
 		}
-		memcpy(node, tmplogsrv, sizeof(struct logsrv));
+		__builtin_memcpy_inline(node, tmplogsrv, sizeof(struct logsrv));
 		node->ref = tmplogsrv->ref;
 		LIST_INIT(>list);
 		LIST_APPEND(>logsrvs, >list);
-- 
2.34.1