Re: Popcount optimization using AVX512

2024-04-23 Thread Nathan Bossart
On Thu, Apr 18, 2024 at 05:13:58PM -0500, Nathan Bossart wrote:
> Makes sense, thanks.  I'm planning to commit this fix sometime early next
> week.

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-04-18 Thread Nathan Bossart
On Thu, Apr 18, 2024 at 10:11:08PM +, Devulapalli, Raghuveer wrote:
>> On that note, is it necessary to also check for avx512f?  At the moment,
>> we are assuming that's supported if the other AVX-512 instructions are
>> available.
> 
> No, it's not needed. There are no CPU's with avx512bw/avx512popcnt
> without avx512f.  Unfortunately though, avx512popcnt does not mean
> avx512bw (I think the deprecated Xeon Phi processors falls in this
> category) which is why we need both.

Makes sense, thanks.  I'm planning to commit this fix sometime early next
week.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




RE: Popcount optimization using AVX512

2024-04-18 Thread Devulapalli, Raghuveer
> On that note, is it necessary to also check for avx512f?  At the moment, we 
> are assuming that's supported if the other AVX-512 instructions are available.

No, it's not needed. There are no CPU's with avx512bw/avx512popcnt without 
avx512f.  Unfortunately though, avx512popcnt does not mean avx512bw (I think 
the deprecated Xeon Phi processors falls in this category) which is why we need 
both. 




Re: Popcount optimization using AVX512

2024-04-18 Thread Nathan Bossart
On Thu, Apr 18, 2024 at 09:29:55PM +, Devulapalli, Raghuveer wrote:
> (1) Shouldn't it be: return (_xgetbv(0) & 0xe6) == 0xe6; ? Otherwise
> zmm_regs_available() will return false..

Yes, that's a mistake.  I fixed that in v3.

> (2) Nitpick: avx512_popcnt_available and avx512_bw_available() run the
> same cpuid leaf. You could combine them into one to avoid running cpuid
> twice. My apologies, I should have mentioned this before..

Good call.  The byte-and-word instructions were a late addition to the
patch, so I missed this originally.

On that note, is it necessary to also check for avx512f?  At the moment, we
are assuming that's supported if the other AVX-512 instructions are
available.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e04c348eb389c6aa1597ac35d57b5e7ae7075381 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 18 Apr 2024 15:57:56 -0500
Subject: [PATCH v3 1/1] osxsave

---
 src/port/pg_popcount_avx512_choose.c | 80 
 1 file changed, 47 insertions(+), 33 deletions(-)

diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
index ae3fa3d306..b37107803a 100644
--- a/src/port/pg_popcount_avx512_choose.c
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -34,39 +34,13 @@
 #ifdef TRY_POPCNT_FAST
 
 /*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
+ * Does CPUID say there's support for XSAVE instructions?
  */
-bool
-pg_popcount_avx512_available(void)
+static inline bool
+xsave_available(void)
 {
 	unsigned int exx[4] = {0, 0, 0, 0};
 
-	/* Does CPUID say there's support for AVX-512 popcount instructions? */
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, [0], [1], [2], [3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
-	if ((exx[2] & (1 << 14)) == 0)	/* avx512-vpopcntdq */
-		return false;
-
-	/* Does CPUID say there's support for AVX-512 byte and word instructions? */
-	memset(exx, 0, sizeof(exx));
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, [0], [1], [2], [3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
-	if ((exx[1] & (1 << 30)) == 0)	/* avx512-bw */
-		return false;
-
-	/* Does CPUID say there's support for XSAVE instructions? */
-	memset(exx, 0, sizeof(exx));
 #if defined(HAVE__GET_CPUID)
 	__get_cpuid(1, [0], [1], [2], [3]);
 #elif defined(HAVE__CPUID)
@@ -74,15 +48,55 @@ pg_popcount_avx512_available(void)
 #else
 #error cpuid instruction not available
 #endif
-	if ((exx[2] & (1 << 26)) == 0)	/* xsave */
-		return false;
+	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
+}
 
-	/* Does XGETBV say the ZMM registers are enabled? */
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that xsave_available() returns true
+ * before calling this.
+ */
+static inline bool
+zmm_regs_available(void)
+{
 #ifdef HAVE_XSAVE_INTRINSICS
-	return (_xgetbv(0) & 0xe0) != 0;
+	return (_xgetbv(0) & 0xe6) == 0xe6;
 #else
 	return false;
 #endif
 }
 
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+static inline bool
+avx512_popcnt_available(void)
+{
+	unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(7, 0, [0], [1], [2], [3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+	return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
+		(exx[1] & (1 << 30)) != 0;	/* avx512-bw */
+}
+
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+	return xsave_available() &&
+		zmm_regs_available() &&
+		avx512_popcnt_available();
+}
+
 #endif			/* TRY_POPCNT_FAST */
-- 
2.25.1



RE: Popcount optimization using AVX512

2024-04-18 Thread Devulapalli, Raghuveer
> Thanks for the feedback.  I've attached an updated patch.

(1) Shouldn't it be: return (_xgetbv(0) & 0xe6) == 0xe6; ? Otherwise 
zmm_regs_available() will return false. 
(2) Nitpick: avx512_popcnt_available and avx512_bw_available() run the same 
cpuid leaf. You could combine them into one to avoid running cpuid twice. My 
apologies, I should have mentioned this before. 




Re: Popcount optimization using AVX512

2024-04-18 Thread Nathan Bossart
On Thu, Apr 18, 2024 at 08:24:03PM +, Devulapalli, Raghuveer wrote:
>> This seems to contradict the note about doing step 3 at any point, and
>> given step 1 is the OSXSAVE check, I'm not following what this means,
>> anyway.
> 
> It is recommended that you run the xgetbv code before you check for cpu
> features avx512-popcnt and avx512-bw. The way it is written now is the
> opposite order. I would also recommend splitting the cpuid feature check
> for avx512popcnt/avx512bw and xgetbv section into separate functions to
> make them modular. Something like:
> 
> static inline
> int check_os_avx512_support(void)
> {
> // (1) run cpuid leaf 1 to check for xgetbv instruction support:
> unsigned int exx[4] = {0, 0, 0, 0};
> __get_cpuid(1, [0], [1], [2], [3]);
> if ((exx[2] & (1 << 27)) == 0)  /* xsave */
> return false;
> 
> /* Does XGETBV say the ZMM/YMM/XMM registers are enabled? */
> return (_xgetbv(0) & 0xe0) == 0xe0;
> }
> 
>> I'm also wondering if we need to check that (_xgetbv(0) & 0xe6) == 0xe6
>> instead of just (_xgetbv(0) & 0xe0) != 0, as the status of the lower
>> half of some of the ZMM registers is stored in the SSE and AVX state
>> [0].  I don't know how likely it is that 0xe0 would succeed but 0xe6
>> wouldn't, but we might as well make it correct.
> 
> This is correct. It needs to check all the 3 bits (XMM/YMM and ZMM). The
> way it is written is now is in-correct. 

Thanks for the feedback.  I've attached an updated patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From d20b19804a17d9f6eab1d40de7e9fb10488ac6b0 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 18 Apr 2024 15:57:56 -0500
Subject: [PATCH v2 1/1] osxsave

---
 src/port/pg_popcount_avx512_choose.c | 89 +++-
 1 file changed, 60 insertions(+), 29 deletions(-)

diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
index ae3fa3d306..009f94909a 100644
--- a/src/port/pg_popcount_avx512_choose.c
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -34,27 +34,47 @@
 #ifdef TRY_POPCNT_FAST
 
 /*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
+ * Does CPUID say there's support for XSAVE instructions?
  */
-bool
-pg_popcount_avx512_available(void)
+static inline bool
+xsave_available(void)
 {
 	unsigned int exx[4] = {0, 0, 0, 0};
 
-	/* Does CPUID say there's support for AVX-512 popcount instructions? */
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, [0], [1], [2], [3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(1, [0], [1], [2], [3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
 #else
 #error cpuid instruction not available
 #endif
-	if ((exx[2] & (1 << 14)) == 0)	/* avx512-vpopcntdq */
-		return false;
+	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that xsave_available() returns true
+ * before calling this.
+ */
+static inline bool
+zmm_regs_available(void)
+{
+#ifdef HAVE_XSAVE_INTRINSICS
+	return (_xgetbv(0) & 0xe6) != 0xe6;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount instructions?
+ */
+static inline bool
+avx512_popcnt_available(void)
+{
+	unsigned int exx[4] = {0, 0, 0, 0};
 
-	/* Does CPUID say there's support for AVX-512 byte and word instructions? */
-	memset(exx, 0, sizeof(exx));
 #if defined(HAVE__GET_CPUID_COUNT)
 	__get_cpuid_count(7, 0, [0], [1], [2], [3]);
 #elif defined(HAVE__CPUIDEX)
@@ -62,27 +82,38 @@ pg_popcount_avx512_available(void)
 #else
 #error cpuid instruction not available
 #endif
-	if ((exx[1] & (1 << 30)) == 0)	/* avx512-bw */
-		return false;
+	return (exx[2] & (1 << 14)) != 0;	/* avx512-vpopcntdq */
+}
 
-	/* Does CPUID say there's support for XSAVE instructions? */
-	memset(exx, 0, sizeof(exx));
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, [0], [1], [2], [3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
+/*
+ * Does CPUID say there's support for AVX-512 byte and word instructions?
+ */
+static inline bool
+avx512_bw_available(void)
+{
+	unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(7, 0, [0], [1], [2], [3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
 #else
 #error cpuid instruction not available
 #endif
-	if ((exx[2] & (1 << 26)) == 0)	/* xsave */
-		return false;
+	return (exx[1] & (1 << 30)) != 0;	/* avx512-bw */
+}
 
-	/* Does XGETBV say the ZMM registers are enabled? */
-#ifdef HAVE_XSAVE_INTRINSICS
-	return (_xgetbv(0) & 0xe0) != 0;
-#else
-	return false;
-#endif
+/*
+ * 

Re: Popcount optimization using AVX512

2024-04-18 Thread Nathan Bossart
On Thu, Apr 18, 2024 at 06:12:22PM +, Shankaran, Akash wrote:
> Good find. I confirmed after speaking with an intel expert, and from the 
> intel AVX-512 manual [0] section 14.3, which recommends to check bit27. From 
> the manual:
> 
> "Prior to using Intel AVX, the application must identify that the operating 
> system supports the XGETBV instruction,
> the YMM register state, in addition to processor's support for YMM state 
> management using XSAVE/XRSTOR and
> AVX instructions. The following simplified sequence accomplishes both and is 
> strongly recommended.
> 1) Detect CPUID.1:ECX.OSXSAVE[bit 27] = 1 (XGETBV enabled for application 
> use1).
> 2) Issue XGETBV and verify that XCR0[2:1] = '11b' (XMM state and YMM state 
> are enabled by OS).
> 3) detect CPUID.1:ECX.AVX[bit 28] = 1 (AVX instructions supported).
> (Step 3 can be done in any order relative to 1 and 2.)"

Thanks for confirming.  IIUC my patch should be sufficient, then.

> It also seems that step 1 and step 2 need to be done prior to the CPUID 
> OSXSAVE check in the popcount code.

This seems to contradict the note about doing step 3 at any point, and
given step 1 is the OSXSAVE check, I'm not following what this means,
anyway.

I'm also wondering if we need to check that (_xgetbv(0) & 0xe6) == 0xe6
instead of just (_xgetbv(0) & 0xe0) != 0, as the status of the lower half
of some of the ZMM registers is stored in the SSE and AVX state [0].  I
don't know how likely it is that 0xe0 would succeed but 0xe6 wouldn't, but
we might as well make it correct.

[0] https://en.wikipedia.org/wiki/Control_register#cite_ref-23

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




RE: Popcount optimization using AVX512

2024-04-18 Thread Shankaran, Akash
> It was brought to my attention [0] that we probably should be checking for 
> the OSXSAVE bit instead of the XSAVE bit when determining whether there's 
> support for the XGETBV instruction.  IIUC that should indicate that both the 
> OS and the processor have XGETBV support (not just the processor).
> I've attached a one-line patch to fix this.

> [0] https://github.com/pgvector/pgvector/pull/519#issuecomment-2062804463

Good find. I confirmed after speaking with an intel expert, and from the intel 
AVX-512 manual [0] section 14.3, which recommends to check bit27. From the 
manual:

"Prior to using Intel AVX, the application must identify that the operating 
system supports the XGETBV instruction,
the YMM register state, in addition to processor's support for YMM state 
management using XSAVE/XRSTOR and
AVX instructions. The following simplified sequence accomplishes both and is 
strongly recommended.
1) Detect CPUID.1:ECX.OSXSAVE[bit 27] = 1 (XGETBV enabled for application use1).
2) Issue XGETBV and verify that XCR0[2:1] = '11b' (XMM state and YMM state are 
enabled by OS).
3) detect CPUID.1:ECX.AVX[bit 28] = 1 (AVX instructions supported).
(Step 3 can be done in any order relative to 1 and 2.)"

It also seems that step 1 and step 2 need to be done prior to the CPUID OSXSAVE 
check in the popcount code.

[0]: https://cdrdv2.intel.com/v1/dl/getContent/671200

- Akash Shankaran





Re: Popcount optimization using AVX512

2024-04-17 Thread Nathan Bossart
It was brought to my attention [0] that we probably should be checking for
the OSXSAVE bit instead of the XSAVE bit when determining whether there's
support for the XGETBV instruction.  IIUC that should indicate that both
the OS and the processor have XGETBV support (not just the processor).
I've attached a one-line patch to fix this.

[0] https://github.com/pgvector/pgvector/pull/519#issuecomment-2062804463

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
index ae3fa3d306..cc3e89e096 100644
--- a/src/port/pg_popcount_avx512_choose.c
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -74,7 +74,7 @@ pg_popcount_avx512_available(void)
 #else
 #error cpuid instruction not available
 #endif
-	if ((exx[2] & (1 << 26)) == 0)	/* xsave */
+	if ((exx[2] & (1 << 27)) == 0)	/* osxsave */
 		return false;
 
 	/* Does XGETBV say the ZMM registers are enabled? */


Re: Popcount optimization using AVX512

2024-04-07 Thread Tom Lane
Nathan Bossart  writes:
> On Sun, Apr 07, 2024 at 08:23:32PM -0500, Nathan Bossart wrote:
>> The Intel documentation for _mm256_undefined_si256() [0]
>> indicates that it is intended to return "undefined elements," so it seems
>> like the use of an uninitialized variable might be intentional.

> See also https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=72af61b122.

Ah, interesting.  That hasn't propagated to stable distros yet,
evidently (and even when it does, I wonder how soon Coverity
will understand it).  Anyway, that does establish that it's
gcc's problem not ours.  Thanks for digging!

regards, tom lane




Re: Popcount optimization using AVX512

2024-04-07 Thread Nathan Bossart
On Sun, Apr 07, 2024 at 08:23:32PM -0500, Nathan Bossart wrote:
> The Intel documentation for _mm256_undefined_si256() [0]
> indicates that it is intended to return "undefined elements," so it seems
> like the use of an uninitialized variable might be intentional.

See also https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=72af61b122.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-04-07 Thread Nathan Bossart
On Sun, Apr 07, 2024 at 08:42:12PM -0400, Tom Lane wrote:
> Today's Coverity run produced this warning, which seemingly was
> triggered by one of these commits, but I can't make much sense
> of it:
> 
> *** CID 1596255:  Uninitialized variables  (UNINIT)
> /usr/lib/gcc/x86_64-linux-gnu/10/include/avxintrin.h: 1218 in 
> _mm256_undefined_si256()
> 1214 extern __inline __m256i __attribute__((__gnu_inline__, 
> __always_inline__, __artificial__))
> 1215 _mm256_undefined_si256 (void)
> 1216 {
> 1217   __m256i __Y = __Y;
 CID 1596255:  Uninitialized variables  (UNINIT)
 Using uninitialized value "__Y".
> 1218   return __Y;
> 1219 }
> 
> I see the same code in my local copy of avxintrin.h,
> and I quite agree that it looks like either an undefined
> value or something that properly ought to be an error.
> If we are calling this, why (and from where)?

Nothing in these commits uses this, or even uses the 256-bit registers.
avxintrin.h is included by immintrin.h, which is probably why this is
showing up.  I believe you're supposed to use immintrin.h for the
intrinsics used in these commits, so I don't immediately see a great way to
avoid this.  The Intel documentation for _mm256_undefined_si256() [0]
indicates that it is intended to return "undefined elements," so it seems
like the use of an uninitialized variable might be intentional.

> Anyway, we can certainly just dismiss this warning if it
> doesn't correspond to any real problem in our code.
> But I thought I'd raise the question.

That's probably the right thing to do, unless there's some action we can
take to suppress this warning.

[0] 
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_undefined_si256_expand=6943

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-04-07 Thread Tom Lane
Nathan Bossart  writes:
> Here is what I have staged for commit, which I intend to do shortly.

Today's Coverity run produced this warning, which seemingly was
triggered by one of these commits, but I can't make much sense
of it:

*** CID 1596255:  Uninitialized variables  (UNINIT)
/usr/lib/gcc/x86_64-linux-gnu/10/include/avxintrin.h: 1218 in 
_mm256_undefined_si256()
1214 extern __inline __m256i __attribute__((__gnu_inline__, 
__always_inline__, __artificial__))
1215 _mm256_undefined_si256 (void)
1216 {
1217   __m256i __Y = __Y;
>>> CID 1596255:  Uninitialized variables  (UNINIT)
>>> Using uninitialized value "__Y".
1218   return __Y;
1219 }

I see the same code in my local copy of avxintrin.h,
and I quite agree that it looks like either an undefined
value or something that properly ought to be an error.
If we are calling this, why (and from where)?

Anyway, we can certainly just dismiss this warning if it
doesn't correspond to any real problem in our code.
But I thought I'd raise the question.

regards, tom lane




Re: Popcount optimization using AVX512

2024-04-06 Thread Nathan Bossart
On Sat, Apr 06, 2024 at 02:41:01PM -0500, Nathan Bossart wrote:
> Here is what I have staged for commit, which I intend to do shortly.

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-04-06 Thread Nathan Bossart
_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
+$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
+if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
+  CFLAGS_XSAVE=""
+  pgac_xsave_intrinsics=yes
+fi
+
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
+if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mxsave"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_xsave_intrinsics__mxsave=yes
+else
+  pgac_cv_xsave_intrinsics__mxsave=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
+$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
+if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
+  CFLAGS_XSAVE="-mxsave"
+  pgac_xsave_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+
+$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
+
+fi
+
+
+# Check for AVX-512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+  pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+  CFLAGS_POPCNT=""
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+
+  if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq -mavx512bw"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=yes
+else
+  pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save

Re: Popcount optimization using AVX512

2024-04-05 Thread David Rowley
On Sat, 6 Apr 2024 at 14:17, Nathan Bossart  wrote:
>
> On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote:
> > Won't Valgrind complain about this?
> >
> > +pg_popcount_avx512(const char *buf, int bytes)
> >
> > + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
> >
> > + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
>
> I haven't been able to generate any complaints, at least with some simple
> tests.  But I see your point.  If this did cause such complaints, ISTM we'd
> just want to add it to the suppression file.  Otherwise, I think we'd have
> to go back to the non-maskz approach (which I really wanted to avoid
> because of the weird function overhead juggling) or find another way to do
> a partial load into an __m512i.

[1] seems to think it's ok.  If this is true then the following
shouldn't segfault:

The following seems to run without any issue and if I change the mask
to 1 it crashes, as you'd expect.

#include 
#include 
int main(void)
{
__m512i val;
val = _mm512_maskz_loadu_epi8((__mmask64) 0, NULL);
printf("%llu\n", _mm512_reduce_add_epi64(val));
return 0;
}

gcc avx512.c -o avx512 -O0 -mavx512f -march=native

David

[1] 
https://stackoverflow.com/questions/54497141/when-using-a-mask-register-with-avx-512-load-and-stores-is-a-fault-raised-for-i




Re: Popcount optimization using AVX512

2024-04-05 Thread Nathan Bossart
On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote:
> Won't Valgrind complain about this?
> 
> +pg_popcount_avx512(const char *buf, int bytes)
> 
> + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
> 
> + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);

I haven't been able to generate any complaints, at least with some simple
tests.  But I see your point.  If this did cause such complaints, ISTM we'd
just want to add it to the suppression file.  Otherwise, I think we'd have
to go back to the non-maskz approach (which I really wanted to avoid
because of the weird function overhead juggling) or find another way to do
a partial load into an __m512i.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-04-05 Thread David Rowley
On Sat, 6 Apr 2024 at 04:38, Nathan Bossart  wrote:
> This seems to provide a small performance boost, so I've incorporated it
> into v27.

Won't Valgrind complain about this?

+pg_popcount_avx512(const char *buf, int bytes)

+ buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);

+ val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);

David




Re: Popcount optimization using AVX512

2024-04-05 Thread Nathan Bossart
On Fri, Apr 05, 2024 at 07:58:44AM -0500, Nathan Bossart wrote:
> On Fri, Apr 05, 2024 at 10:33:27AM +0300, Ants Aasma wrote:
>> The main issue I saw was that clang was able to peel off the first
>> iteration of the loop and then eliminate the mask assignment and
>> replace masked load with a memory operand for vpopcnt. I was not able
>> to convince gcc to do that regardless of optimization options.
>> Generated code for the inner loop:
>> 
>> clang:
>> :
>>   50:  add rdx, 64
>>   54:  cmp rdx, rdi
>>   57:  jae 
>>   59:  vpopcntq zmm1, zmmword ptr [rdx]
>>   5f:  vpaddq zmm0, zmm1, zmm0
>>   65:  jmp 
>> 
>> gcc:
>> :
>>   38:  kmovq k1, rdx
>>   3d:  vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax]
>>   43:  add rax, 64
>>   47:  mov rdx, -1
>>   4e:  vpopcntq zmm0, zmm0
>>   54:  vpaddq zmm0, zmm0, zmm1
>>   5a:  vmovdqa64 zmm1, zmm0
>>   60:  cmp rax, rsi
>>   63:  jb 
>> 
>> I'm not sure how much that matters in practice. Attached is a patch to
>> do this manually giving essentially the same result in gcc. As most
>> distro packages are built using gcc I think it would make sense to
>> have the extra code if it gives a noticeable benefit for large cases.
> 
> Yeah, I did see this, but I also wasn't sure if it was worth further
> complicating the code.  I can test with and without your fix and see if it
> makes any difference in the benchmarks.

This seems to provide a small performance boost, so I've incorporated it
into v27.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 9fc4b7556b72d51fce676db84b446099767efff3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v27 1/2] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  11 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |   5 +
 src/port/pg_popcount_avx512.c|  82 +
 src/port/pg_popcount_avx512_choose.c |  81 +
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 690 insertions(+), 3 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..892b3c9580 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
+# -mavx512bw).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i a

Re: Popcount optimization using AVX512

2024-04-05 Thread Nathan Bossart
On Fri, Apr 05, 2024 at 10:33:27AM +0300, Ants Aasma wrote:
> The main issue I saw was that clang was able to peel off the first
> iteration of the loop and then eliminate the mask assignment and
> replace masked load with a memory operand for vpopcnt. I was not able
> to convince gcc to do that regardless of optimization options.
> Generated code for the inner loop:
> 
> clang:
> :
>   50:  add rdx, 64
>   54:  cmp rdx, rdi
>   57:  jae 
>   59:  vpopcntq zmm1, zmmword ptr [rdx]
>   5f:  vpaddq zmm0, zmm1, zmm0
>   65:  jmp 
> 
> gcc:
> :
>   38:  kmovq k1, rdx
>   3d:  vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax]
>   43:  add rax, 64
>   47:  mov rdx, -1
>   4e:  vpopcntq zmm0, zmm0
>   54:  vpaddq zmm0, zmm0, zmm1
>   5a:  vmovdqa64 zmm1, zmm0
>   60:  cmp rax, rsi
>   63:  jb 
> 
> I'm not sure how much that matters in practice. Attached is a patch to
> do this manually giving essentially the same result in gcc. As most
> distro packages are built using gcc I think it would make sense to
> have the extra code if it gives a noticeable benefit for large cases.

Yeah, I did see this, but I also wasn't sure if it was worth further
complicating the code.  I can test with and without your fix and see if it
makes any difference in the benchmarks.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-04-05 Thread Ants Aasma
On Fri, 5 Apr 2024 at 07:15, Nathan Bossart  wrote:
> Here is an updated patch set.  IMHO this is in decent shape and is
> approaching committable.

I checked the code generation on various gcc and clang versions. It
looks mostly fine starting from versions where avx512 is supported,
gcc-7.1 and clang-5.

The main issue I saw was that clang was able to peel off the first
iteration of the loop and then eliminate the mask assignment and
replace masked load with a memory operand for vpopcnt. I was not able
to convince gcc to do that regardless of optimization options.
Generated code for the inner loop:

clang:
:
  50:  add rdx, 64
  54:  cmp rdx, rdi
  57:  jae 
  59:  vpopcntq zmm1, zmmword ptr [rdx]
  5f:  vpaddq zmm0, zmm1, zmm0
  65:  jmp 

gcc:
:
  38:  kmovq k1, rdx
  3d:  vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax]
  43:  add rax, 64
  47:  mov rdx, -1
  4e:  vpopcntq zmm0, zmm0
  54:  vpaddq zmm0, zmm0, zmm1
  5a:  vmovdqa64 zmm1, zmm0
  60:  cmp rax, rsi
  63:  jb 

I'm not sure how much that matters in practice. Attached is a patch to
do this manually giving essentially the same result in gcc. As most
distro packages are built using gcc I think it would make sense to
have the extra code if it gives a noticeable benefit for large cases.

The visibility map patch has the same issue, otherwise looks good.

Regards,
Ants Aasma
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index dacc7553d29..f6e718b86e9 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -52,13 +52,21 @@ pg_popcount_avx512(const char *buf, int bytes)
 	 * Iterate through all but the final iteration.  Starting from second
 	 * iteration, the start index mask is ignored.
 	 */
-	for (; buf < final; buf += sizeof(__m512i))
+	if (buf < final)
 	{
 		val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
 		cnt = _mm512_popcnt_epi64(val);
 		accum = _mm512_add_epi64(accum, cnt);
 
+		buf += sizeof(__m512i);
 		mask = ~UINT64CONST(0);
+
+		for (; buf < final; buf += sizeof(__m512i))
+		{
+			val = _mm512_load_si512((const __m512i *) buf);
+			cnt = _mm512_popcnt_epi64(val);
+			accum = _mm512_add_epi64(accum, cnt);
+		}
 	}
 
 	/* Final iteration needs to ignore bytes that are not within the length */


Re: Popcount optimization using AVX512

2024-04-04 Thread Nathan Bossart
Here is an updated patch set.  IMHO this is in decent shape and is
approaching committable.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From df59d3e78604e4530f5096bafc08ac94e13d82d2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v26 1/2] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  11 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |   5 +
 src/port/pg_popcount_avx512.c|  74 
 src/port/pg_popcount_avx512_choose.c |  81 +
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 682 insertions(+), 3 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..892b3c9580 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
+# -mavx512bw).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..72d20d3945 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, [0], [1], [2], [3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__get_cpuid_count="yes"
+else
+  pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objex

Re: Popcount optimization using AVX512

2024-04-04 Thread Nathan Bossart
On Thu, Apr 04, 2024 at 04:02:53PM +0300, Ants Aasma wrote:
> Speaking of which, what does bumping up the inlined version threshold
> to 16 do with and without AVX-512 available? Linearly extrapolating
> the 2 and 4 byte numbers it might just come ahead in both cases,
> making the choice easy.

IIRC the inlined version starts losing pretty quickly after 8 bytes.  As I
noted in my previous message, I think we have enough data to switch to your
approach already, so I think it's a moot point.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-04-04 Thread Nathan Bossart
On Thu, Apr 04, 2024 at 04:28:58PM +1300, David Rowley wrote:
> On Thu, 4 Apr 2024 at 11:50, Nathan Bossart  wrote:
>> If we can verify this approach won't cause segfaults and can stomach the
>> regression between 8 and 16 bytes, I'd happily pivot to this approach so
>> that we can avoid the function call dance that I have in v25.
> 
> If we're worried about regressions with some narrow range of byte
> values, wouldn't it make more sense to compare that to cc4826dd5~1 at
> the latest rather than to some version that's already probably faster
> than PG16?

Good point.  When compared with REL_16_STABLE, Ants's idea still wins:

  bytes  v25   v25+ants  REL_16_STABLE
  2  1108.205  1033.132  2039.342
  4  1311.227  1289.373  3207.217
  8  1927.954  2360.113  3200.238
 16  2281.091  2365.408  4457.769
 32  3856.992  2390.688  6206.689
 64  3648.72   3242.498  9619.403
128  4108.549  3607.148  17912.081
256  4910.076  4496.852  33591.385

As before, with 2 and 4 bytes, HEAD is using the inlined approach, but
REL_16_STABLE is doing a function call.  For 8 bytes, REL_16_STABLE is
doing a function call as well as a call to a function pointer.  At 16
bytes, it's doing a function call and two calls to a function pointer.
With Ant's approach, both 8 and 16 bytes require a single call to a
function pointer, and of course we are using the AVX-512 implementation for
both.

I think this is sufficient to justify switching approaches.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-04-04 Thread Ants Aasma
On Thu, 4 Apr 2024 at 01:50, Nathan Bossart  wrote:
> If we can verify this approach won't cause segfaults and can stomach the
> regression between 8 and 16 bytes, I'd happily pivot to this approach so
> that we can avoid the function call dance that I have in v25.

The approach I posted does not rely on masking performing page fault
suppression. All loads are 64 byte aligned and always contain at least
one byte of the buffer and therefore are guaranteed to be within a
valid page.

I personally don't mind it being slower for the very small cases,
because when performance on those sizes really matters it makes much
more sense to shoot for an inlined version instead.

Speaking of which, what does bumping up the inlined version threshold
to 16 do with and without AVX-512 available? Linearly extrapolating
the 2 and 4 byte numbers it might just come ahead in both cases,
making the choice easy.

Regards,
Ants Aasma




Re: Popcount optimization using AVX512

2024-04-03 Thread David Rowley
On Thu, 4 Apr 2024 at 11:50, Nathan Bossart  wrote:
> If we can verify this approach won't cause segfaults and can stomach the
> regression between 8 and 16 bytes, I'd happily pivot to this approach so
> that we can avoid the function call dance that I have in v25.
>
> Thoughts?

If we're worried about regressions with some narrow range of byte
values, wouldn't it make more sense to compare that to cc4826dd5~1 at
the latest rather than to some version that's already probably faster
than PG16?

David




Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 11:30:39PM +0300, Ants Aasma wrote:
> On Tue, 2 Apr 2024 at 00:31, Nathan Bossart  wrote:
>> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
>> > What about using the masking capabilities of AVX-512 to handle the
>> > tail in the same code path? Masked out portions of a load instruction
>> > will not generate an exception. To allow byte level granularity
>> > masking, -mavx512bw is needed. Based on wikipedia this will only
>> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
>> > VPOPCNTQ implies availability of BW.
>>
>> Sounds promising.  IMHO we should really be sure that these kinds of loads
>> won't generate segfaults and the like due to the masked-out portions.  I
>> searched around a little bit but haven't found anything that seemed
>> definitive.
> 
> After sleeping on the problem, I think we can avoid this question
> altogether while making the code faster by using aligned accesses.
> Loads that straddle cache line boundaries run internally as 2 load
> operations. Gut feel says that there are enough out-of-order resources
> available to make it not matter in most cases. But even so, not doing
> the extra work is surely better. Attached is another approach that
> does aligned accesses, and thereby avoids going outside bounds.
> 
> Would be interesting to see how well that fares in the small use case.
> Anything that fits into one aligned cache line should be constant
> speed, and there is only one branch, but the mask setup and folding
> the separate popcounts together should add up to about 20-ish cycles
> of overhead.

I tested your patch in comparison to v25 and saw the following:

  bytes  v25   v25+ants
21108.205  1033.132
41311.227  1289.373
81927.954  2360.113
   162281.091  2365.408
   323856.992  2390.688
   643648.72   3242.498
  1284108.549  3607.148
  2564910.076  4496.852

For 2 bytes and 4 bytes, the inlining should take effect, so any difference
there is likely just noise.  At 8 bytes, we are calling the function
pointer, and there is a small regression with the masking approach.
However, by 16 bytes, the masking approach is on par with v25, and it wins
for all larger buffers, although the gains seem to taper off a bit.

If we can verify this approach won't cause segfaults and can stomach the
regression between 8 and 16 bytes, I'd happily pivot to this approach so
that we can avoid the function call dance that I have in v25.

Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart
On Wed, Apr 03, 2024 at 12:41:27PM -0500, Nathan Bossart wrote:
> I committed v23-0001.  Here is a rebased version of the remaining patches.
> I intend to test the masking idea from Ants next.

0002 was missing a cast that is needed for the 32-bit builds.  I've fixed
that in v25.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From fe001e38b3f209c2fe615a2c4c64109d5e4d3da1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v25 1/2] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  29 ++-
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 673 insertions(+), 5 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, [0], [1], [2], [3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO";

Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart
I committed v23-0001.  Here is a rebased version of the remaining patches.
I intend to test the masking idea from Ants next.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 295b03530de5f42fe876b4489191da2f8dc83194 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v24 1/2] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  29 ++-
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 673 insertions(+), 5 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, [0], [1], [2], [3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__get_cpuid_count="yes"
+else
+  pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 05:20:20PM -0500, Nathan Bossart wrote:
> Sorry for the noise.  I noticed a couple of silly mistakes immediately
> after sending v21.

Sigh...  I missed a line while rebasing these patches, which seems to have
grossly offended cfbot.  Apologies again for the noise.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From bfe2b3158378fd822c17fb251178df7557065cfd Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 2 Apr 2024 15:54:49 -0500
Subject: [PATCH v23 1/3] inline pg_popcount for small numbers of bytes

---
 src/include/port/pg_bitutils.h | 34 --
 src/port/pg_bitutils.c | 12 ++--
 2 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..1f487a4bc3 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,16 +302,46 @@ pg_ceil_log2_64(uint64 num)
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
-extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
 
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
 extern int	pg_popcount64(uint64 word);
-extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 pg_popcount_optimized(const char *buf, int bytes);
 
 #endif			/* TRY_POPCNT_FAST */
 
+/*
+ * Returns the number of 1-bits in buf.
+ *
+ * If there aren't many bytes to process, the function call overhead of the
+ * optimized versions isn't worth taking, so we inline a loop that consults
+ * pg_number_of_ones in that case.  If there are many bytes to process, we
+ * accept the function call overhead because the optimized versions are likely
+ * to be faster.
+ */
+static inline uint64
+pg_popcount(const char *buf, int bytes)
+{
+	/*
+	 * We use 8 bytes as the threshold because that's where we'll first use
+	 * special instructions on 64-bit systems.  A threshold of 4 bytes might
+	 * make more sense on 32-bit systems, but it seems unlikely to make a
+	 * tremendous difference.
+	 */
+	if (bytes < 8)
+	{
+		uint64		popcnt = 0;
+
+		while (bytes--)
+			popcnt += pg_number_of_ones[(unsigned char) *buf++];
+		return popcnt;
+	}
+
+	return pg_popcount_optimized(buf, bytes);
+}
+
 /*
  * Rotate the bits of "word" to the right/left by n bits.
  */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 28312f3dd9..6271acea60 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -118,7 +118,7 @@ static uint64 pg_popcount_fast(const char *buf, int bytes);
 
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-uint64		(*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
 #endif			/* TRY_POPCNT_FAST */
 
 #ifdef TRY_POPCNT_FAST
@@ -155,13 +155,13 @@ choose_popcount_functions(void)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
-		pg_popcount = pg_popcount_fast;
+		pg_popcount_optimized = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
-		pg_popcount = pg_popcount_slow;
+		pg_popcount_optimized = pg_popcount_slow;
 	}
 }
 
@@ -183,7 +183,7 @@ static uint64
 pg_popcount_choose(const char *buf, int bytes)
 {
 	choose_popcount_functions();
-	return pg_popcount(buf, bytes);
+	return pg_popcount_optimized(buf, bytes);
 }
 
 /*
@@ -387,11 +387,11 @@ pg_popcount64(uint64 word)
 }
 
 /*
- * pg_popcount
+ * pg_popcount_optimized
  *		Returns the number of 1-bits in buf
  */
 uint64
-pg_popcount(const char *buf, int bytes)
+pg_popcount_optimized(const char *buf, int bytes)
 {
 	return pg_popcount_slow(buf, bytes);
 }
-- 
2.25.1

>From da744d0614021cf002e4d9e292e5c874bd81a84e Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v23 2/3] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  29 ++-
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 05:01:32PM -0500, Nathan Bossart wrote:
> In v21, 0001 is just the above inlining idea, which seems worth doing
> independent of $SUBJECT.  0002 and 0003 are the AVX-512 patches, which I've
> modified similarly to 0001, i.e., I've inlined the "fast" version in the
> function pointer to avoid the function call overhead when there are fewer
> than 64 bytes.  All of this overhead juggling should result in choosing the
> optimal popcount implementation depending on how many bytes there are to
> process, roughly speaking.

Sorry for the noise.  I noticed a couple of silly mistakes immediately
after sending v21.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From cfc5e9fe77f96225ec67a044377b10113c98ce0d Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 2 Apr 2024 15:54:49 -0500
Subject: [PATCH v22 1/3] inline pg_popcount for small numbers of bytes

---
 src/include/port/pg_bitutils.h | 34 --
 src/port/pg_bitutils.c | 12 ++--
 2 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..1f487a4bc3 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,16 +302,46 @@ pg_ceil_log2_64(uint64 num)
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
-extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
 
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
 extern int	pg_popcount64(uint64 word);
-extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 pg_popcount_optimized(const char *buf, int bytes);
 
 #endif			/* TRY_POPCNT_FAST */
 
+/*
+ * Returns the number of 1-bits in buf.
+ *
+ * If there aren't many bytes to process, the function call overhead of the
+ * optimized versions isn't worth taking, so we inline a loop that consults
+ * pg_number_of_ones in that case.  If there are many bytes to process, we
+ * accept the function call overhead because the optimized versions are likely
+ * to be faster.
+ */
+static inline uint64
+pg_popcount(const char *buf, int bytes)
+{
+	/*
+	 * We use 8 bytes as the threshold because that's where we'll first use
+	 * special instructions on 64-bit systems.  A threshold of 4 bytes might
+	 * make more sense on 32-bit systems, but it seems unlikely to make a
+	 * tremendous difference.
+	 */
+	if (bytes < 8)
+	{
+		uint64		popcnt = 0;
+
+		while (bytes--)
+			popcnt += pg_number_of_ones[(unsigned char) *buf++];
+		return popcnt;
+	}
+
+	return pg_popcount_optimized(buf, bytes);
+}
+
 /*
  * Rotate the bits of "word" to the right/left by n bits.
  */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 28312f3dd9..6271acea60 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -118,7 +118,7 @@ static uint64 pg_popcount_fast(const char *buf, int bytes);
 
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-uint64		(*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
 #endif			/* TRY_POPCNT_FAST */
 
 #ifdef TRY_POPCNT_FAST
@@ -155,13 +155,13 @@ choose_popcount_functions(void)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
-		pg_popcount = pg_popcount_fast;
+		pg_popcount_optimized = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
-		pg_popcount = pg_popcount_slow;
+		pg_popcount_optimized = pg_popcount_slow;
 	}
 }
 
@@ -183,7 +183,7 @@ static uint64
 pg_popcount_choose(const char *buf, int bytes)
 {
 	choose_popcount_functions();
-	return pg_popcount(buf, bytes);
+	return pg_popcount_optimized(buf, bytes);
 }
 
 /*
@@ -387,11 +387,11 @@ pg_popcount64(uint64 word)
 }
 
 /*
- * pg_popcount
+ * pg_popcount_optimized
  *		Returns the number of 1-bits in buf
  */
 uint64
-pg_popcount(const char *buf, int bytes)
+pg_popcount_optimized(const char *buf, int bytes)
 {
 	return pg_popcount_slow(buf, bytes);
 }
-- 
2.25.1

>From a8024ebcc54b4ac0d3d145ade5d7cd85eb192afc Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v22 2/3] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   | 

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 01:40:21PM -0500, Nathan Bossart wrote:
> On Tue, Apr 02, 2024 at 01:43:48PM -0400, Tom Lane wrote:
>> I don't like the double evaluation of the macro argument.  Seems like
>> you could get the same results more safely with
>> 
>>  static inline uint64
>>  pg_popcount(const char *buf, int bytes)
>>  {
>>  if (bytes < 64)
>>  {
>>  uint64  popcnt = 0;
>> 
>>  while (bytes--)
>>  popcnt += pg_number_of_ones[(unsigned char) 
>> *buf++];
>> 
>>  return popcnt;
>>  }
>>  return pg_popcount_optimized(buf, bytes);
>>  }
> 
> Yeah, I like that better.  I'll do some testing to see what the threshold
> really should be before posting an actual patch.

My testing shows that inlining wins with fewer than 8 bytes for the current
"fast" implementation.  The "fast" implementation wins with fewer than 64
bytes compared to the AVX-512 implementation.  These results are pretty
intuitive because those are the points at which the optimizations kick in.

In v21, 0001 is just the above inlining idea, which seems worth doing
independent of $SUBJECT.  0002 and 0003 are the AVX-512 patches, which I've
modified similarly to 0001, i.e., I've inlined the "fast" version in the
function pointer to avoid the function call overhead when there are fewer
than 64 bytes.  All of this overhead juggling should result in choosing the
optimal popcount implementation depending on how many bytes there are to
process, roughly speaking.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From ce1180d557cbdf8cff33842ea2f1a22ba6676725 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 2 Apr 2024 15:54:49 -0500
Subject: [PATCH v21 1/3] inline pg_popcount for small numbers of bytes

---
 src/include/port/pg_bitutils.h | 34 --
 src/port/pg_bitutils.c | 10 +-
 2 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..1f487a4bc3 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,16 +302,46 @@ pg_ceil_log2_64(uint64 num)
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
-extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
 
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
 extern int	pg_popcount64(uint64 word);
-extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 pg_popcount_optimized(const char *buf, int bytes);
 
 #endif			/* TRY_POPCNT_FAST */
 
+/*
+ * Returns the number of 1-bits in buf.
+ *
+ * If there aren't many bytes to process, the function call overhead of the
+ * optimized versions isn't worth taking, so we inline a loop that consults
+ * pg_number_of_ones in that case.  If there are many bytes to process, we
+ * accept the function call overhead because the optimized versions are likely
+ * to be faster.
+ */
+static inline uint64
+pg_popcount(const char *buf, int bytes)
+{
+	/*
+	 * We use 8 bytes as the threshold because that's where we'll first use
+	 * special instructions on 64-bit systems.  A threshold of 4 bytes might
+	 * make more sense on 32-bit systems, but it seems unlikely to make a
+	 * tremendous difference.
+	 */
+	if (bytes < 8)
+	{
+		uint64		popcnt = 0;
+
+		while (bytes--)
+			popcnt += pg_number_of_ones[(unsigned char) *buf++];
+		return popcnt;
+	}
+
+	return pg_popcount_optimized(buf, bytes);
+}
+
 /*
  * Rotate the bits of "word" to the right/left by n bits.
  */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 28312f3dd9..4720f8e419 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -118,7 +118,7 @@ static uint64 pg_popcount_fast(const char *buf, int bytes);
 
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-uint64		(*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
 #endif			/* TRY_POPCNT_FAST */
 
 #ifdef TRY_POPCNT_FAST
@@ -155,13 +155,13 @@ choose_popcount_functions(void)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
-		pg_popcount = pg_popcount_fast;
+		pg_popcount_optimized = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;

Re: Popcount optimization using AVX512

2024-04-02 Thread Ants Aasma
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart  wrote:
> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
> > What about using the masking capabilities of AVX-512 to handle the
> > tail in the same code path? Masked out portions of a load instruction
> > will not generate an exception. To allow byte level granularity
> > masking, -mavx512bw is needed. Based on wikipedia this will only
> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
> > VPOPCNTQ implies availability of BW.
>
> Sounds promising.  IMHO we should really be sure that these kinds of loads
> won't generate segfaults and the like due to the masked-out portions.  I
> searched around a little bit but haven't found anything that seemed
> definitive.

After sleeping on the problem, I think we can avoid this question
altogether while making the code faster by using aligned accesses.
Loads that straddle cache line boundaries run internally as 2 load
operations. Gut feel says that there are enough out-of-order resources
available to make it not matter in most cases. But even so, not doing
the extra work is surely better. Attached is another approach that
does aligned accesses, and thereby avoids going outside bounds.

Would be interesting to see how well that fares in the small use case.
Anything that fits into one aligned cache line should be constant
speed, and there is only one branch, but the mask setup and folding
the separate popcounts together should add up to about 20-ish cycles
of overhead.

Regards,
Ants Aasma
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee5..e1fbd98fa14 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -30,20 +30,44 @@
 uint64
 pg_popcount_avx512(const char *buf, int bytes)
 {
-	uint64		popcnt;
+	__m512i		val, cnt;
 	__m512i		accum = _mm512_setzero_si512();
+	const char *final;
+	int 		tail_idx;
+	__mmask64	mask = -1;
 
-	for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
-	{
-		const		__m512i val = _mm512_loadu_si512((const __m512i *) buf);
-		const		__m512i cnt = _mm512_popcnt_epi64(val);
+	/*
+	 * Align buffer down to avoid double load overhead from unaligned access.
+	 * Calculate a mask to ignore preceding bytes. Find start offset of final
+	 * iteration and number of valid bytes making sure that final iteration
+	 * is not empty.
+	 */
+	mask <<= ((uintptr_t) buf) % sizeof(__m512i);
+	tail_idx = (((uintptr_t) buf + bytes - 1) % sizeof(__m512i)) + 1;
+	final = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf + bytes - 1);
+	buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
 
+	/*
+	 * Iterate through all but the final iteration. Starting from second
+	 * iteration, the start index mask is ignored.
+	 */
+	for (; buf < final; buf += sizeof(__m512i))
+	{
+		val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+		cnt = _mm512_popcnt_epi64(val);
 		accum = _mm512_add_epi64(accum, cnt);
-		buf += sizeof(__m512i);
+
+		mask = -1;
 	}
 
-	popcnt = _mm512_reduce_add_epi64(accum);
-	return popcnt + pg_popcount_fast(buf, bytes);
+	/* Final iteration needs to ignore bytes that are not within the length */
+	mask &= ((~0ULL) >> (64 - tail_idx));
+
+	val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+	cnt = _mm512_popcnt_epi64(val);
+	accum = _mm512_add_epi64(accum, cnt);
+
+	return _mm512_reduce_add_epi64(accum);
 }
 
 #endif			/* TRY_POPCNT_FAST */


Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 01:43:48PM -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
>> On 2024-Apr-02, Nathan Bossart wrote:
>>> Another idea I had is to turn pg_popcount() into a macro that just uses the
>>> pg_number_of_ones array when called for few bytes:
>>> 
>>> static inline uint64
>>> pg_popcount_inline(const char *buf, int bytes)
>>> {
>>> uint64  popcnt = 0;
>>> 
>>> while (bytes--)
>>> popcnt += pg_number_of_ones[(unsigned char) *buf++];
>>> 
>>> return popcnt;
>>> }
>>> 
>>> #define pg_popcount(buf, bytes) \
>>> ((bytes < 64) ? \
>>>  pg_popcount_inline(buf, bytes) : \
>>>  pg_popcount_optimized(buf, bytes))
>>> 
>>> But again, I'm not sure this is really worth it for the current use-cases.
> 
>> Eh, that seems simple enough, and then you can forget about that case.
> 
> I don't like the double evaluation of the macro argument.  Seems like
> you could get the same results more safely with
> 
>   static inline uint64
>   pg_popcount(const char *buf, int bytes)
>   {
>   if (bytes < 64)
>   {
>   uint64  popcnt = 0;
> 
>   while (bytes--)
>   popcnt += pg_number_of_ones[(unsigned char) 
> *buf++];
> 
>   return popcnt;
>   }
>   return pg_popcount_optimized(buf, bytes);
>   }

Yeah, I like that better.  I'll do some testing to see what the threshold
really should be before posting an actual patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-04-02 Thread Tom Lane
Alvaro Herrera  writes:
> On 2024-Apr-02, Nathan Bossart wrote:
>> Another idea I had is to turn pg_popcount() into a macro that just uses the
>> pg_number_of_ones array when called for few bytes:
>> 
>>  static inline uint64
>>  pg_popcount_inline(const char *buf, int bytes)
>>  {
>>  uint64  popcnt = 0;
>> 
>>  while (bytes--)
>>  popcnt += pg_number_of_ones[(unsigned char) *buf++];
>> 
>>  return popcnt;
>>  }
>> 
>>  #define pg_popcount(buf, bytes) \
>>  ((bytes < 64) ? \
>>   pg_popcount_inline(buf, bytes) : \
>>   pg_popcount_optimized(buf, bytes))
>> 
>> But again, I'm not sure this is really worth it for the current use-cases.

> Eh, that seems simple enough, and then you can forget about that case.

I don't like the double evaluation of the macro argument.  Seems like
you could get the same results more safely with

static inline uint64
pg_popcount(const char *buf, int bytes)
{
if (bytes < 64)
{
uint64  popcnt = 0;

while (bytes--)
popcnt += pg_number_of_ones[(unsigned char) 
*buf++];

return popcnt;
}
return pg_popcount_optimized(buf, bytes);
}

regards, tom lane




Re: Popcount optimization using AVX512

2024-04-02 Thread Alvaro Herrera
On 2024-Apr-02, Nathan Bossart wrote:

> Another idea I had is to turn pg_popcount() into a macro that just uses the
> pg_number_of_ones array when called for few bytes:
> 
>   static inline uint64
>   pg_popcount_inline(const char *buf, int bytes)
>   {
>   uint64  popcnt = 0;
> 
>   while (bytes--)
>   popcnt += pg_number_of_ones[(unsigned char) *buf++];
> 
>   return popcnt;
>   }
> 
>   #define pg_popcount(buf, bytes) \
>   ((bytes < 64) ? \
>pg_popcount_inline(buf, bytes) : \
>pg_popcount_optimized(buf, bytes))
> 
> But again, I'm not sure this is really worth it for the current use-cases.

Eh, that seems simple enough, and then you can forget about that case.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"




Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Mon, Apr 01, 2024 at 05:11:17PM -0500, Nathan Bossart wrote:
> Here is a v19 of the patch set.  I moved out the refactoring of the
> function pointer selection code to 0001.  I think this is a good change
> independent of $SUBJECT, and I plan to commit this soon.  In 0002, I
> changed the syslogger.c usage of pg_popcount() to use pg_number_of_ones
> instead.  This is standard practice elsewhere where the popcount functions
> are unlikely to win.  I'll probably commit this one soon, too, as it's even
> more trivial than 0001.
>
> 0003 is the AVX512 POPCNT patch.  Besides refactoring out 0001, there are
> no changes from v18.  0004 is an early proof-of-concept for using AVX512
> for the visibility map code.  The code is missing comments, and I haven't
> performed any benchmarking yet, but I figured I'd post it because it
> demonstrates how it's possible to build upon 0003 in other areas.

I've committed the first two patches, and I've attached a rebased version
of the latter two.

> AFAICT the main open question is the function call overhead in 0003 that
> Alvaro brought up earlier.  After 0002 is committed, I believe the only
> in-tree caller of pg_popcount() with very few bytes is bit_count(), and I'm
> not sure it's worth expending too much energy to make sure there are
> absolutely no regressions there.  However, I'm happy to do so if folks feel
> that it is necessary, and I'd be grateful for thoughts on how to proceed on
> this one.

Another idea I had is to turn pg_popcount() into a macro that just uses the
pg_number_of_ones array when called for few bytes:

static inline uint64
pg_popcount_inline(const char *buf, int bytes)
{
uint64  popcnt = 0;

while (bytes--)
popcnt += pg_number_of_ones[(unsigned char) *buf++];

return popcnt;
}

#define pg_popcount(buf, bytes) \
((bytes < 64) ? \
 pg_popcount_inline(buf, bytes) : \
 pg_popcount_optimized(buf, bytes))

But again, I'm not sure this is really worth it for the current use-cases.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 3c5c3fdaffd623b513bcc476ee7c15f6379af1e7 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v20 1/2] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |   7 +-
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 651 insertions(+), 5 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 01:09:57AM +0300, Ants Aasma wrote:
> On Tue, 2 Apr 2024 at 00:31, Nathan Bossart  wrote:
>> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
>> > What about using the masking capabilities of AVX-512 to handle the
>> > tail in the same code path? Masked out portions of a load instruction
>> > will not generate an exception. To allow byte level granularity
>> > masking, -mavx512bw is needed. Based on wikipedia this will only
>> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
>> > VPOPCNTQ implies availability of BW.
>>
>> Sounds promising.  IMHO we should really be sure that these kinds of loads
>> won't generate segfaults and the like due to the masked-out portions.  I
>> searched around a little bit but haven't found anything that seemed
>> definitive.
> 
> Interestingly the Intel software developer manual is not exactly
> crystal clear on how memory faults with masks work, but volume 2A
> chapter 2.8 [1] does specify that MOVDQU8 is of exception class E4.nb
> that supports memory fault suppression on page fault.

Perhaps Paul or Akash could chime in here...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart
Here is a v19 of the patch set.  I moved out the refactoring of the
function pointer selection code to 0001.  I think this is a good change
independent of $SUBJECT, and I plan to commit this soon.  In 0002, I
changed the syslogger.c usage of pg_popcount() to use pg_number_of_ones
instead.  This is standard practice elsewhere where the popcount functions
are unlikely to win.  I'll probably commit this one soon, too, as it's even
more trivial than 0001.

0003 is the AVX512 POPCNT patch.  Besides refactoring out 0001, there are
no changes from v18.  0004 is an early proof-of-concept for using AVX512
for the visibility map code.  The code is missing comments, and I haven't
performed any benchmarking yet, but I figured I'd post it because it
demonstrates how it's possible to build upon 0003 in other areas.

AFAICT the main open question is the function call overhead in 0003 that
Alvaro brought up earlier.  After 0002 is committed, I believe the only
in-tree caller of pg_popcount() with very few bytes is bit_count(), and I'm
not sure it's worth expending too much energy to make sure there are
absolutely no regressions there.  However, I'm happy to do so if folks feel
that it is necessary, and I'd be grateful for thoughts on how to proceed on
this one.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From cedad23b7b35e77fde164b1d577c37fb07a578c6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 1 Apr 2024 16:37:53 -0500
Subject: [PATCH v19 1/4] refactor popcount function choosing

---
 src/port/pg_bitutils.c | 37 +
 1 file changed, 9 insertions(+), 28 deletions(-)

diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..28312f3dd9 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -148,8 +148,8 @@ pg_popcount_available(void)
  * the function pointers so that subsequent calls are routed directly to
  * the chosen implementation.
  */
-static int
-pg_popcount32_choose(uint32 word)
+static inline void
+choose_popcount_functions(void)
 {
 	if (pg_popcount_available())
 	{
@@ -163,45 +163,26 @@ pg_popcount32_choose(uint32 word)
 		pg_popcount64 = pg_popcount64_slow;
 		pg_popcount = pg_popcount_slow;
 	}
+}
 
+static int
+pg_popcount32_choose(uint32 word)
+{
+	choose_popcount_functions();
 	return pg_popcount32(word);
 }
 
 static int
 pg_popcount64_choose(uint64 word)
 {
-	if (pg_popcount_available())
-	{
-		pg_popcount32 = pg_popcount32_fast;
-		pg_popcount64 = pg_popcount64_fast;
-		pg_popcount = pg_popcount_fast;
-	}
-	else
-	{
-		pg_popcount32 = pg_popcount32_slow;
-		pg_popcount64 = pg_popcount64_slow;
-		pg_popcount = pg_popcount_slow;
-	}
-
+	choose_popcount_functions();
 	return pg_popcount64(word);
 }
 
 static uint64
 pg_popcount_choose(const char *buf, int bytes)
 {
-	if (pg_popcount_available())
-	{
-		pg_popcount32 = pg_popcount32_fast;
-		pg_popcount64 = pg_popcount64_fast;
-		pg_popcount = pg_popcount_fast;
-	}
-	else
-	{
-		pg_popcount32 = pg_popcount32_slow;
-		pg_popcount64 = pg_popcount64_slow;
-		pg_popcount = pg_popcount_slow;
-	}
-
+	choose_popcount_functions();
 	return pg_popcount(buf, bytes);
 }
 
-- 
2.25.1

>From 038b74045b006c5d8a5470364f2041370ec0b083 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 1 Apr 2024 16:47:22 -0500
Subject: [PATCH v19 2/4] use pg_number_of_ones instead of pg_popcount for
 single byte

---
 src/backend/postmaster/syslogger.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index 08efe74cc9..437947dbb9 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -898,7 +898,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
 		if (p.nuls[0] == '\0' && p.nuls[1] == '\0' &&
 			p.len > 0 && p.len <= PIPE_MAX_PAYLOAD &&
 			p.pid != 0 &&
-			pg_popcount((char *) _flags, 1) == 1)
+			pg_number_of_ones[dest_flags] == 1)
 		{
 			List	   *buffer_list;
 			ListCell   *cell;
-- 
2.25.1

>From 73ee8d6018b047856e63ad075641a0dcfe889417 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v19 3/4] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |   7 +-
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 src/tes

Re: Popcount optimization using AVX512

2024-04-01 Thread Ants Aasma
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart  wrote:
>
> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
> > What about using the masking capabilities of AVX-512 to handle the
> > tail in the same code path? Masked out portions of a load instruction
> > will not generate an exception. To allow byte level granularity
> > masking, -mavx512bw is needed. Based on wikipedia this will only
> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
> > VPOPCNTQ implies availability of BW.
>
> Sounds promising.  IMHO we should really be sure that these kinds of loads
> won't generate segfaults and the like due to the masked-out portions.  I
> searched around a little bit but haven't found anything that seemed
> definitive.

Interestingly the Intel software developer manual is not exactly
crystal clear on how memory faults with masks work, but volume 2A
chapter 2.8 [1] does specify that MOVDQU8 is of exception class E4.nb
that supports memory fault suppression on page fault.

Regards,
Ants Aasma

[1] https://cdrdv2-public.intel.com/819712/253666-sdm-vol-2a.pdf




Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
> What about using the masking capabilities of AVX-512 to handle the
> tail in the same code path? Masked out portions of a load instruction
> will not generate an exception. To allow byte level granularity
> masking, -mavx512bw is needed. Based on wikipedia this will only
> disable this fast path on Knights Mill (Xeon Phi), in all other cases
> VPOPCNTQ implies availability of BW.

Sounds promising.  IMHO we should really be sure that these kinds of loads
won't generate segfaults and the like due to the masked-out portions.  I
searched around a little bit but haven't found anything that seemed
definitive.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-04-01 Thread Ants Aasma
On Mon, 1 Apr 2024 at 18:53, Nathan Bossart  wrote:
>
> On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote:
> > On 2024-Mar-31, Nathan Bossart wrote:
> >> +popcnt = _mm512_reduce_add_epi64(accum);
> >> +return popcnt + pg_popcount_fast(buf, bytes);
> >
> > Hmm, doesn't this arrangement cause an extra function call to
> > pg_popcount_fast to be used here?  Given the level of micro-optimization
> > being used by this code, I would have thought that you'd have tried to
> > avoid that.  (At least, maybe avoid the call if bytes is 0, no?)
>
> Yes, it does.  I did another benchmark on very small arrays and can see the
> overhead.  This is the time in milliseconds to run pg_popcount() on an
> array 1 billion times:
>
> size (bytes)  HEAD  AVX512-POPCNT
> 1 1707.685  3480.424
> 2 1926.694  4606.182
> 4 3210.412  5284.506
> 8 1920.703  3640.968
> 162936.91   4045.586
> 323627.956  5538.418
> 645347.213  3748.212
>
> I suspect that anything below 64 bytes will see this regression, as that is
> the earliest point where there are enough bytes for ZMM registers.

What about using the masking capabilities of AVX-512 to handle the
tail in the same code path? Masked out portions of a load instruction
will not generate an exception. To allow byte level granularity
masking, -mavx512bw is needed. Based on wikipedia this will only
disable this fast path on Knights Mill (Xeon Phi), in all other cases
VPOPCNTQ implies availability of BW.

Attached is an example of what I mean. I did not have a machine to
test it with, but the code generated looks sane. I added the clang
pragma because it insisted on unrolling otherwise and based on how the
instruction dependencies look that is probably not too helpful even
for large cases (needs to be tested). The configure check and compile
flags of course need to be amended for BW.

Regards,
Ants Aasma
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee5..7fb2ada16c9 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -30,20 +30,27 @@
 uint64
 pg_popcount_avx512(const char *buf, int bytes)
 {
-	uint64		popcnt;
+	__m512i		val, cnt;
+	__mmask64	remaining_mask;
 	__m512i		accum = _mm512_setzero_si512();
 
-	for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+	#pragma clang loop unroll(disable)
+	for (; bytes > sizeof(__m512i); bytes -= sizeof(__m512i))
 	{
-		const		__m512i val = _mm512_loadu_si512((const __m512i *) buf);
-		const		__m512i cnt = _mm512_popcnt_epi64(val);
+		val = _mm512_loadu_si512((const __m512i *) buf);
+		cnt = _mm512_popcnt_epi64(val);
 
 		accum = _mm512_add_epi64(accum, cnt);
 		buf += sizeof(__m512i);
 	}
 
-	popcnt = _mm512_reduce_add_epi64(accum);
-	return popcnt + pg_popcount_fast(buf, bytes);
+	remaining_mask = ~0ULL >> (sizeof(__m512i) - bytes);
+	val = _mm512_maskz_loadu_epi8(remaining_mask, (const __m512i *) buf);
+	cnt = _mm512_popcnt_epi64(val);
+
+	accum = _mm512_add_epi64(accum, cnt);
+
+	return _mm512_reduce_add_epi64(accum);
 }
 
 #endif			/* TRY_POPCNT_FAST */


Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart
On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote:
> On 2024-Mar-31, Nathan Bossart wrote:
>> +popcnt = _mm512_reduce_add_epi64(accum);
>> +return popcnt + pg_popcount_fast(buf, bytes);
> 
> Hmm, doesn't this arrangement cause an extra function call to
> pg_popcount_fast to be used here?  Given the level of micro-optimization
> being used by this code, I would have thought that you'd have tried to
> avoid that.  (At least, maybe avoid the call if bytes is 0, no?)

Yes, it does.  I did another benchmark on very small arrays and can see the
overhead.  This is the time in milliseconds to run pg_popcount() on an
array 1 billion times:

size (bytes)  HEAD  AVX512-POPCNT
1 1707.685  3480.424
2 1926.694  4606.182
4 3210.412  5284.506
8 1920.703  3640.968
162936.91   4045.586
323627.956  5538.418
645347.213  3748.212

I suspect that anything below 64 bytes will see this regression, as that is
the earliest point where there are enough bytes for ZMM registers.

We could avoid the call if there are no remaining bytes, but the numbers
for the smallest arrays probably wouldn't improve much, and that might
actually add some overhead due to branching.  The other option to avoid
this overhead is to put most of pg_bitutils.c into its header file so that
we can inline the call.

Reviewing the current callers of pg_popcount(), IIUC the only ones that are
passing very small arrays are the bit_count() implementations and a call in
the syslogger for a single byte.  I don't know how much to worry about the
overhead for bit_count() since there's presumably a bunch of other
overhead, and the syslogger one could probably be fixed via an inline
function that pulled the value from pg_number_of_ones (which would probably
be an improvement over the status quo, anyway).  But this is all to save a
couple of nanoseconds...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-04-01 Thread Alvaro Herrera
On 2024-Mar-31, Nathan Bossart wrote:

> +uint64
> +pg_popcount_avx512(const char *buf, int bytes)
> +{
> + uint64  popcnt;
> + __m512i accum = _mm512_setzero_si512();
> +
> + for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
> + {
> + const   __m512i val = _mm512_loadu_si512((const __m512i 
> *) buf);
> + const   __m512i cnt = _mm512_popcnt_epi64(val);
> +
> + accum = _mm512_add_epi64(accum, cnt);
> + buf += sizeof(__m512i);
> + }
> +
> + popcnt = _mm512_reduce_add_epi64(accum);
> + return popcnt + pg_popcount_fast(buf, bytes);
> +}

Hmm, doesn't this arrangement cause an extra function call to
pg_popcount_fast to be used here?  Given the level of micro-optimization
being used by this code, I would have thought that you'd have tried to
avoid that.  (At least, maybe avoid the call if bytes is 0, no?)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte"
(Ijon Tichy en Viajes, Stanislaw Lem)




Re: Popcount optimization using AVX512

2024-03-31 Thread Nathan Bossart
On Sat, Mar 30, 2024 at 03:03:29PM -0500, Nathan Bossart wrote:
> My current plan is to add some new tests for
> pg_popcount() with many bytes, and then I'll give it a few more days for
> any additional feedback before committing.

Here is a v18 with a couple of new tests.  Otherwise, it is the same as
v17.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 86a571721ed3ed4ca7e04134b9541fc3ac43b9f1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v18 1/1] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  56 +++---
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 666 insertions(+), 39 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, [0], [1], [2], [3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn

Re: Popcount optimization using AVX512

2024-03-30 Thread Nathan Bossart
I used John Naylor's test_popcount module [0] to put together the attached
graphs (note that the "small arrays" one is semi-logarithmic).  For both
graphs, the X-axis is the number of 64-bit words in the array, and Y-axis
is the amount of time in milliseconds to run pg_popcount() on it 100,000
times (along with a bit of overhead).  This test didn't show any
regressions with a relatively small number of bytes, and it showed the
expected improvements with many bytes.

There isn't a ton of use of pg_popcount() in Postgres, but I do see a few
places that call it with enough bytes for the AVX512 optimization to take
effect.  There may be more callers in the future, though, and it seems
generally useful to have some of the foundational work for using AVX512
instructions in place.  My current plan is to add some new tests for
pg_popcount() with many bytes, and then I'll give it a few more days for
any additional feedback before committing.

[0] 
https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=g...@mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com


Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
Here's a v17 of the patch.  This one has configure checks for everything
(i.e., CPUID, XGETBV, and the AVX512 intrinsics) as well as the relevant
runtime checks (i.e., we call CPUID to check for XGETBV and AVX512 POPCNT
availability, and we call XGETBV to ensure the ZMM registers are enabled).
I restricted the AVX512 configure checks to x86_64 since we know we won't
have TRY_POPCNT_FAST on 32-bit, and we rely on pg_popcount_fast() as our
fallback implementation in the AVX512 version.  Finally, I removed the
inline assembly in favor of using the _xgetbv() intrinsic on all systems.
It looks like that's available on gcc, clang, and msvc, although it
sometimes requires -mxsave, so that's applied to
pg_popcount_avx512_choose.o as needed.  I doubt this will lead to SIGILLs,
but it's admittedly a little shaky.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From a26b209927cc6b266b33f74fd734772eff87bff9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v17 1/1] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  56 +++---
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 13 files changed, 638 insertions(+), 39 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n 

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Fri, Mar 29, 2024 at 03:08:28PM -0500, Nathan Bossart wrote:
>> +#if defined(HAVE__GET_CPUID)
>> +__get_cpuid_count(7, 0, [0], [1], [2], [3]);
>> +#elif defined(HAVE__CPUID)
>> +__cpuidex(exx, 7, 0);
> 
> Is there any reason we can't use __get_cpuid() and __cpuid() here, given
> the sub-leaf is 0?

The answer to this seems to be "no."  After additional research,
__get_cpuid_count/__cpuidex seem new enough that we probably want configure
checks for them, so I'll add those back in the next version of the patch.

Apologies for the stream of consciousness today...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Fri, Mar 29, 2024 at 02:13:12PM -0500, Nathan Bossart wrote:
> * If the compiler understands AVX512 intrinsics, we assume that it also
>   knows about the required CPUID and XGETBV intrinsics, and we assume that
>   the conditions for TRY_POPCNT_FAST are true.

Bleh, cfbot's 32-bit build is unhappy with this [0].  It looks like it's
trying to build the AVX512 stuff, but TRY_POPCNT_FAST isn't set.

[19:39:11.306] ../src/port/pg_popcount_avx512.c:39:18: warning: implicit 
declaration of function ‘pg_popcount_fast’; did you mean ‘pg_popcount’? 
[-Wimplicit-function-declaration]
[19:39:11.306]39 |  return popcnt + pg_popcount_fast(buf, bytes);
[19:39:11.306]   |  ^~~~
[19:39:11.306]   |  pg_popcount

There's also a complaint about the inline assembly:

[19:39:11.443] ../src/port/pg_popcount_avx512_choose.c:55:1: error: 
inconsistent operand constraints in an ‘asm’
[19:39:11.443]55 | __asm__ __volatile__(" xgetbv\n":"=a"(low), 
"=d"(high):"c"(xcr));
[19:39:11.443]   | ^~~

I'm looking into this...

> +#if defined(HAVE__GET_CPUID)
> + __get_cpuid_count(7, 0, [0], [1], [2], [3]);
> +#elif defined(HAVE__CPUID)
> + __cpuidex(exx, 7, 0);

Is there any reason we can't use __get_cpuid() and __cpuid() here, given
the sub-leaf is 0?

[0] https://cirrus-ci.com/task/5475113447981056

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
Okay, here is a slightly different approach that I've dubbed the "maximum
assumption" approach.  In short, I wanted to see how much we could simplify
the patch by making all possibly-reasonable assumptions about the compiler
and CPU.  These include:

* If the compiler understands AVX512 intrinsics, we assume that it also
  knows about the required CPUID and XGETBV intrinsics, and we assume that
  the conditions for TRY_POPCNT_FAST are true.
* If this is x86_64, CPUID will be supported by the CPU.
* If CPUID indicates AVX512 POPCNT support, the CPU also supports XGETBV.

Do any of these assumptions seem unreasonable or unlikely to be true for
all practical purposes?  I don't mind adding back some or all of the
configure/runtime checks if they seem necessary.  I guess the real test
will be the buildfarm...

Another big change in this version is that I've moved
pg_popcount_avx512_available() to its own file so that we only compile
pg_popcount_avx512() with the special compiler flags.  This is just an
oversight in previous versions.

Finally, I've modified the build scripts so that the AVX512 popcount stuff
is conditionally built based on the configure checks for both
autoconf/meson.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From d7864391c455ea77b8e555e40a358c59de1bd702 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v16 1/1] AVX512 popcount support

---
 config/c-compiler.m4 |  34 +
 configure| 100 +++
 configure.ac |  14 
 meson.build  |  35 ++
 src/Makefile.global.in   |   4 ++
 src/include/pg_config.h.in   |   3 +
 src/include/port/pg_bitutils.h   |  17 +
 src/makefiles/meson.build|   3 +-
 src/port/Makefile|   6 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  56 ++-
 src/port/pg_popcount_avx512.c|  40 +++
 src/port/pg_popcount_avx512_choose.c |  61 
 13 files changed, 340 insertions(+), 39 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..7d13368b23 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,37 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..86c471f4ec 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,8 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+CFLAGS_POPCNT
+PG_POPCNT_OBJS
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17438,6 +17440,104 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
 
 fi
 
+# Check for AVX512 popcount intrinsics
+#
+PG_POPCNT_OBJS=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+const char buf[sizeof(__m512i)];

RE: Popcount optimization using AVX512

2024-03-29 Thread Amonson, Paul D
> A counterexample is the CRC32C code.  AFAICT we assume the presence of
> CPUID in that code (and #error otherwise).  I imagine its probably safe to
> assume the compiler understands CPUID if it understands AVX512 intrinsics,
> but that is still mostly a guess.

If AVX-512 intrinsics are available, then yes you will have CPUID. CPUID is 
much older in the hardware/software timeline than AVX-512.

Thanks,
Paul





RE: Popcount optimization using AVX512

2024-03-29 Thread Amonson, Paul D
> On Thu, Mar 28, 2024 at 11:10:33PM +0100, Alvaro Herrera wrote:
> > We don't do MSVC via autoconf/Make.  We used to have a special build
> > framework for MSVC which parsed Makefiles to produce "solution" files,
> > but it was removed as soon as Meson was mature enough to build.  See
> > commit 1301c80b2167.  If it builds with Meson, you're good.
> 
> The latest cfbot build for this seems to indicate that at least newer MSVC
> knows AVX512 intrinsics without any special compiler flags [0], so maybe
> what I had in v14 is good enough.  A previous version of the patch set [1] had
> the following lines:
> 
> +  if host_system == 'windows'
> +test_flags = ['/arch:AVX512']
> +  endif
> 
> I'm not sure if this is needed for older MSVC or something else.  IIRC I 
> couldn't
> find any other examples of this sort of thing in the meson scripts, either.  
> Paul,
> do you recall why you added this?

I asked internal folks here in-the-know and they suggested I add it. I 
personally am not a Windows guy. If it works without it and you are comfortable 
not including the lines, I am fine with it.

Thanks,
Paul





Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Fri, Mar 29, 2024 at 12:30:14PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>>> I see google web references to the xgetbv instruction as far back as 2009
>>> for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for
>>> _xgetbv() MSVC built-in. How far back do you need to go?
> 
>> Hm.  It seems unlikely that a compiler would understand AVX512 intrinsics
>> and not XGETBV then.  I guess the other question is whether CPUID
>> indicating AVX512 is enabled implies the availability of XGETBV on the CPU.
>> If that's not safe, we might need to add another CPUID test.
> 
> Some quick googling says that (1) XGETBV predates AVX and (2) if you
> are worried about old CPUs, you should check CPUID to verify whether
> XGETBV exists before trying to use it.  I did not look for the
> bit-level details on how to do that.

That extra CPUID check should translate to exactly one additional line of
code, so I think I'm inclined to just add it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




RE: Popcount optimization using AVX512

2024-03-29 Thread Shankaran, Akash
> From: Nathan Bossart  
> Sent: Friday, March 29, 2024 9:17 AM
> To: Amonson, Paul D 

> On Fri, Mar 29, 2024 at 04:06:17PM +, Amonson, Paul D wrote:
>> Yeah, I understand that much, but I want to know how portable the 
>> XGETBV instruction is.  Unless I can assume that all x86_64 systems 
>> and compilers support that instruction, we might need an additional 
>> configure check and/or CPUID check.  It looks like MSVC has had 
>> support for the _xgetbv intrinsic for quite a while, but I'm still 
>> researching the other cases.
> 
> I see google web references to the xgetbv instruction as far back as 
> 2009 for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could 
> test for
> _xgetbv() MSVC built-in. How far back do you need to go?

> Hm.  It seems unlikely that a compiler would understand AVX512 intrinsics and 
> not XGETBV then.  I guess the other question is whether CPUID indicating 
> AVX512 is enabled implies the availability of XGETBV on the CPU.
> If that's not safe, we might need to add another CPUID test.

> It would probably be easy enough to add a couple of tests for this, but if we 
> don't have reason to believe there's any practical case to do so, I don't 
> know why we would.  I'm curious what others think about this.

This seems unlikely. Machines supporting XGETBV would support AVX512 
intrinsics. Xgetbv instruction seems to be part of xsave feature set as per 
intel developer manual [2]. XGETBV/XSAVE came first, and seems to be available 
in all x86 systems available since 2011, since Intel SandyBridge architecture 
and AMD the Opteron Gen4 [0].
AVX512 first came into a product in 2016 [1]
[0]: https://kb.vmware.com/s/article/1005764
[1]: https://en.wikipedia.org/wiki/AVX-512
[2]: https://cdrdv2-public.intel.com/774475/252046-sdm-change-document.pdf

- Akash Shankaran





Re: Popcount optimization using AVX512

2024-03-29 Thread Tom Lane
Nathan Bossart  writes:
>> I see google web references to the xgetbv instruction as far back as 2009
>> for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for
>> _xgetbv() MSVC built-in. How far back do you need to go?

> Hm.  It seems unlikely that a compiler would understand AVX512 intrinsics
> and not XGETBV then.  I guess the other question is whether CPUID
> indicating AVX512 is enabled implies the availability of XGETBV on the CPU.
> If that's not safe, we might need to add another CPUID test.

Some quick googling says that (1) XGETBV predates AVX and (2) if you
are worried about old CPUs, you should check CPUID to verify whether
XGETBV exists before trying to use it.  I did not look for the
bit-level details on how to do that.

regards, tom lane




Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Fri, Mar 29, 2024 at 10:59:40AM -0500, Nathan Bossart wrote:
> It might be nice if we conditionally built pg_popcount_avx512.o in autoconf
> builds, too, but AFAICT we still need to wrap most of that code with
> macros, so I'm not sure it's worth the trouble.  I'll take another look at
> this...

If we assumed that TRY_POPCNT_FAST would be set and either
HAVE__GET_CPUID_COUNT or HAVE__CPUIDEX would be set whenever
USE_AVX512_POPCNT_WITH_RUNTIME_CHECK is set, we could probably remove the
surrounding macros and just compile pg_popcount_avx512.c conditionally
based on USE_AVX512_POPCNT_WITH_RUNTIME_CHECK.  However, the surrounding
code seems to be pretty cautious about these assumptions (e.g., the CPUID
macros are checked before setting TRY_POPCNT_FAST), so this would stray
from the nearby precedent a bit.

A counterexample is the CRC32C code.  AFAICT we assume the presence of
CPUID in that code (and #error otherwise).  I imagine its probably safe to
assume the compiler understands CPUID if it understands AVX512 intrinsics,
but that is still mostly a guess.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Fri, Mar 29, 2024 at 04:06:17PM +, Amonson, Paul D wrote:
>> Yeah, I understand that much, but I want to know how portable the XGETBV
>> instruction is.  Unless I can assume that all x86_64 systems and compilers
>> support that instruction, we might need an additional configure check and/or
>> CPUID check.  It looks like MSVC has had support for the _xgetbv intrinsic 
>> for
>> quite a while, but I'm still researching the other cases.
> 
> I see google web references to the xgetbv instruction as far back as 2009
> for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for
> _xgetbv() MSVC built-in. How far back do you need to go?

Hm.  It seems unlikely that a compiler would understand AVX512 intrinsics
and not XGETBV then.  I guess the other question is whether CPUID
indicating AVX512 is enabled implies the availability of XGETBV on the CPU.
If that's not safe, we might need to add another CPUID test.

It would probably be easy enough to add a couple of tests for this, but if
we don't have reason to believe there's any practical case to do so, I
don't know why we would.  I'm curious what others think about this.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




RE: Popcount optimization using AVX512

2024-03-29 Thread Amonson, Paul D
> -Original Message-
>
> Cool.  I think we should run the benchmarks again to be safe, though.

Ok, sure go ahead. :)

> >> I forgot to mention that I also want to understand whether we can
> >> actually assume availability of XGETBV when CPUID says we support
> >> AVX512:
> >
> > You cannot assume as there are edge cases where AVX-512 was found on
> > system one during compile but it's not actually available in a kernel
> > on a second system at runtime despite the CPU actually having the
> > hardware feature.
> 
> Yeah, I understand that much, but I want to know how portable the XGETBV
> instruction is.  Unless I can assume that all x86_64 systems and compilers
> support that instruction, we might need an additional configure check and/or
> CPUID check.  It looks like MSVC has had support for the _xgetbv intrinsic for
> quite a while, but I'm still researching the other cases.

I see google web references to the xgetbv instruction as far back as 2009 for 
Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for _xgetbv() 
MSVC built-in. How far back do you need to go?

Thanks,
Paul





Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Thu, Mar 28, 2024 at 10:29:47PM +, Amonson, Paul D wrote:
> I see in the meson.build you added the new file twice?
> 
> @@ -7,6 +7,7 @@ pgport_sources = [
>'noblock.c',
>'path.c',
>'pg_bitutils.c',
> +  'pg_popcount_avx512.c',
>'pg_strong_random.c',
>'pgcheckdir.c',
>'pgmkdirp.c',
> @@ -84,6 +85,7 @@ replace_funcs_pos = [
>['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
>['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
>['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
> +  ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 
> 'avx512_popcnt'],
> 
> I was putting the file with special flags ONLY in the second section and all 
> seemed to work. :)

Ah, yes, I think that's a mistake, and without looking closely, might
explain the MSVC warnings [0]:

[22:05:47.444] pg_popcount_avx512.c.obj : warning LNK4006: 
pg_popcount_avx512_available already defined in pg_popcount_a...

It might be nice if we conditionally built pg_popcount_avx512.o in autoconf
builds, too, but AFAICT we still need to wrap most of that code with
macros, so I'm not sure it's worth the trouble.  I'll take another look at
this...

[0] http://commitfest.cputube.org/highlights/all.html#4883

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From c924b57f8479e51aa30c8e3cfe194a2ab85497ff Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v15 1/1] AVX512 popcount support

---
 config/c-compiler.m4   |  34 +++
 configure  | 165 +
 configure.ac   |  34 +++
 meson.build|  59 
 src/Makefile.global.in |   1 +
 src/include/pg_config.h.in |   9 ++
 src/include/port/pg_bitutils.h |  20 
 src/makefiles/meson.build  |   1 +
 src/port/Makefile  |   6 ++
 src/port/meson.build   |   5 +-
 src/port/pg_bitutils.c |  56 ---
 src/port/pg_popcount_avx512.c  |  98 
 12 files changed, 450 insertions(+), 38 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..f881e7ec28 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,37 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_AVX512_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_AVX512_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..189264b86e 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17405,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, [0], [1], [2], [3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__get_cpuid_count="yes

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Thu, Mar 28, 2024 at 11:10:33PM +0100, Alvaro Herrera wrote:
> We don't do MSVC via autoconf/Make.  We used to have a special build
> framework for MSVC which parsed Makefiles to produce "solution" files,
> but it was removed as soon as Meson was mature enough to build.  See
> commit 1301c80b2167.  If it builds with Meson, you're good.

The latest cfbot build for this seems to indicate that at least newer MSVC
knows AVX512 intrinsics without any special compiler flags [0], so maybe
what I had in v14 is good enough.  A previous version of the patch set [1]
had the following lines:

+  if host_system == 'windows'
+test_flags = ['/arch:AVX512']
+  endif

I'm not sure if this is needed for older MSVC or something else.  IIRC I
couldn't find any other examples of this sort of thing in the meson
scripts, either.  Paul, do you recall why you added this?

[0] https://cirrus-ci.com/task/5787206636273664?logs=configure#L159
[1] 
https://postgr.es/m/attachment/158206/v12-0002-Feature-Added-AVX-512-acceleration-to-the-pg_popcoun.patch

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Thu, Mar 28, 2024 at 10:03:04PM +, Amonson, Paul D wrote:
>> * I think we need to verify there isn't a huge performance regression for
>>   smaller arrays.  IIUC those will still require an AVX512 instruction or
>>   two as well as a function call, which might add some noticeable overhead.
> 
> Not considering your changes, I had already tested small buffers. At less
> than 512 bytes there was no measurable regression (there was one extra
> condition check) and for 512+ bytes it moved from no regression to some
> gains between 512 and 4096 bytes. Assuming you introduced no extra
> function calls, it should be the same.

Cool.  I think we should run the benchmarks again to be safe, though.

>> I forgot to mention that I also want to understand whether we can
>> actually assume availability of XGETBV when CPUID says we support
>> AVX512:
> 
> You cannot assume as there are edge cases where AVX-512 was found on
> system one during compile but it's not actually available in a kernel on
> a second system at runtime despite the CPU actually having the hardware
> feature.

Yeah, I understand that much, but I want to know how portable the XGETBV
instruction is.  Unless I can assume that all x86_64 systems and compilers
support that instruction, we might need an additional configure check
and/or CPUID check.  It looks like MSVC has had support for the _xgetbv
intrinsic for quite a while, but I'm still researching the other cases.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




RE: Popcount optimization using AVX512

2024-03-28 Thread Amonson, Paul D
> -Original Message-
> From: Amonson, Paul D 
> Sent: Thursday, March 28, 2024 3:03 PM
> To: Nathan Bossart 
> ...
> I will review the new patch to see if there are anything that jumps out at me.

I see in the meson.build you added the new file twice?

@@ -7,6 +7,7 @@ pgport_sources = [
   'noblock.c',
   'path.c',
   'pg_bitutils.c',
+  'pg_popcount_avx512.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
@@ -84,6 +85,7 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 
'avx512_popcnt'],

I was putting the file with special flags ONLY in the second section and all 
seemed to work. :)

Everything else seems good to me.

Thanks,
Paul





Re: Popcount optimization using AVX512

2024-03-28 Thread Alvaro Herrera
On 2024-Mar-28, Amonson, Paul D wrote:

> > -Original Message-
> > From: Nathan Bossart 
> > Sent: Thursday, March 28, 2024 2:39 PM
> > To: Amonson, Paul D 
> > 
> > * The latest patch set from Paul Amonson appeared to support MSVC in the
> >   meson build, but not the autoconf one.  I don't have much expertise here,
> >   so the v14 patch doesn't have any autoconf/meson support for MSVC, which
> >   I thought might be okay for now.  IIUC we assume that 64-bit/MSVC builds
> >   can always compile the x86_64 popcount code, but I don't know whether
> >   that's safe for AVX512.
> 
> I also do not know how to integrate MSVC+Autoconf, the CI uses
> MSVC+Meson+Ninja so I stuck with that.

We don't do MSVC via autoconf/Make.  We used to have a special build
framework for MSVC which parsed Makefiles to produce "solution" files,
but it was removed as soon as Meson was mature enough to build.  See
commit 1301c80b2167.  If it builds with Meson, you're good.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere."(Lamar Owen)




RE: Popcount optimization using AVX512

2024-03-28 Thread Amonson, Paul D
> -Original Message-
> From: Nathan Bossart 
> Sent: Thursday, March 28, 2024 2:39 PM
> To: Amonson, Paul D 
> 
> * The latest patch set from Paul Amonson appeared to support MSVC in the
>   meson build, but not the autoconf one.  I don't have much expertise here,
>   so the v14 patch doesn't have any autoconf/meson support for MSVC, which
>   I thought might be okay for now.  IIUC we assume that 64-bit/MSVC builds
>   can always compile the x86_64 popcount code, but I don't know whether
>   that's safe for AVX512.

I also do not know how to integrate MSVC+Autoconf, the CI uses MSVC+Meson+Ninja 
so I stuck with that.
 
> * I think we need to verify there isn't a huge performance regression for
>   smaller arrays.  IIUC those will still require an AVX512 instruction or
>   two as well as a function call, which might add some noticeable overhead.

Not considering your changes, I had already tested small buffers. At less than 
512 bytes there was no measurable regression (there was one extra condition 
check) and for 512+ bytes it moved from no regression to some gains between 512 
and 4096 bytes. Assuming you introduced no extra function calls, it should be 
the same.

> I forgot to mention that I also want to understand whether we can actually 
> assume availability of XGETBV when CPUID says we support AVX512:

You cannot assume as there are edge cases where AVX-512 was found on system one 
during compile but it's not actually available in a kernel on a second system 
at runtime despite the CPU actually having the hardware feature.

I will review the new patch to see if there are anything that jumps out at me.

Thanks,
Paul





Re: Popcount optimization using AVX512

2024-03-28 Thread Nathan Bossart
On Thu, Mar 28, 2024 at 04:38:54PM -0500, Nathan Bossart wrote:
> Here is a v14 of the patch that I think is beginning to approach something
> committable.  Besides general review and testing, there are two things that
> I'd like to bring up:
> 
> * The latest patch set from Paul Amonson appeared to support MSVC in the
>   meson build, but not the autoconf one.  I don't have much expertise here,
>   so the v14 patch doesn't have any autoconf/meson support for MSVC, which
>   I thought might be okay for now.  IIUC we assume that 64-bit/MSVC builds
>   can always compile the x86_64 popcount code, but I don't know whether
>   that's safe for AVX512.
> 
> * I think we need to verify there isn't a huge performance regression for
>   smaller arrays.  IIUC those will still require an AVX512 instruction or
>   two as well as a function call, which might add some noticeable overhead.

I forgot to mention that I also want to understand whether we can actually
assume availability of XGETBV when CPUID says we support AVX512:

> + /*
> +  * We also need to check that the OS has enabled support for 
> the ZMM
> +  * registers.
> +  */
> +#ifdef _MSC_VER
> + return (_xgetbv(0) & 0xe0) != 0;
> +#else
> + uint64  xcr = 0;
> + uint32  high;
> + uint32  low;
> +
> +__asm__ __volatile__(" xgetbv\n":"=a"(low), "=d"(high):"c"(xcr));
> + return (low & 0xe0) != 0;
> +#endif

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-03-28 Thread Nathan Bossart
Here is a v14 of the patch that I think is beginning to approach something
committable.  Besides general review and testing, there are two things that
I'd like to bring up:

* The latest patch set from Paul Amonson appeared to support MSVC in the
  meson build, but not the autoconf one.  I don't have much expertise here,
  so the v14 patch doesn't have any autoconf/meson support for MSVC, which
  I thought might be okay for now.  IIUC we assume that 64-bit/MSVC builds
  can always compile the x86_64 popcount code, but I don't know whether
  that's safe for AVX512.

* I think we need to verify there isn't a huge performance regression for
  smaller arrays.  IIUC those will still require an AVX512 instruction or
  two as well as a function call, which might add some noticeable overhead.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 9b5725e36aa8cff7caeb8683e11cd09bd5bda745 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v14 1/1] AVX512 popcount support

---
 config/c-compiler.m4   |  34 +++
 configure  | 165 +
 configure.ac   |  34 +++
 meson.build|  59 
 src/Makefile.global.in |   1 +
 src/include/pg_config.h.in |   9 ++
 src/include/port/pg_bitutils.h |  20 
 src/makefiles/meson.build  |   1 +
 src/port/Makefile  |   6 ++
 src/port/meson.build   |   6 +-
 src/port/pg_bitutils.c |  56 ---
 src/port/pg_popcount_avx512.c  |  98 
 12 files changed, 451 insertions(+), 38 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..f881e7ec28 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,37 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_AVX512_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_AVX512_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..189264b86e 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17405,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, [0], [1], [2], [3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__get_cpuid_count="yes"
+else
+  pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&

RE: Popcount optimization using AVX512

2024-03-27 Thread Amonson, Paul D
> -Original Message-
> From: Nathan Bossart 
> Sent: Wednesday, March 27, 2024 3:00 PM
> To: Amonson, Paul D 
> 
> ...  (I realize that I'm essentially
> recanting much of my previous feedback, which I apologize for.)

It happens. LOL As long as the algorithm for AVX-512 is not altered I am 
confident that your new refactor will be fine. :)

Thanks,
Paul





Re: Popcount optimization using AVX512

2024-03-27 Thread Nathan Bossart
On Mon, Mar 25, 2024 at 03:05:51PM -0500, Nathan Bossart wrote:
> On Mon, Mar 25, 2024 at 06:42:36PM +, Amonson, Paul D wrote:
>> Ok, CI turned green after my re-post of the patches.  Can this please get
>> merged?
> 
> Thanks for the new patches.  I intend to take another look soon.

Thanks for your patience.  I spent most of my afternoon looking into the
latest patch set, but I needed to do a CHECKPOINT and take a break.  I am
in the middle of doing some rather heavy editorialization, but the core of
your changes will remain the same (and so I still intend to give you
authorship credit).  I've attached what I have so far, which is still
missing the configuration checks and the changes to make sure the extra
compiler flags make it to the right places.

Unless something pops up while I work on the remainder of this patch, I
think we'll end up going with a simpler approach.  I originally set out to
make this look like the CRC32C stuff (e.g., a file per implementation), but
that seemed primarily useful if we can choose which files need to be
compiled at configure-time.  However, the TRY_POPCNT_FAST macro is defined
at compile-time (AFAICT for good reason [0]), so we end up having to
compile all the files in many cases anyway, and we continue to need to
surround lots of code with "#ifdef TRY_POPCNT_FAST" or similar.  So, my
current thinking is that we should only move the AVX512 stuff to its own
file for the purposes of compiling it with special flags when possible.  (I
realize that I'm essentially recanting much of my previous feedback, which
I apologize for.)

[0] 
https://postgr.es/m/CAApHDvrONNcYxGV6C0O3ZmaL0BvXBWY%2BrBOCBuYcQVUOURwhkA%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 031eb4a365665edd304f0281ad7e412341504749 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v13 1/1] AVX512 popcount support

---
 src/include/port/pg_bitutils.h | 16 +++
 src/port/Makefile  |  1 +
 src/port/meson.build   |  1 +
 src/port/pg_bitutils.c | 53 
 src/port/pg_popcount_avx512.c  | 88 ++
 5 files changed, 125 insertions(+), 34 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..4b1e4d92b4 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -298,6 +298,22 @@ pg_ceil_log2_64(uint64 num)
 #endif
 #endif
 
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files.
+ */
+#if defined(TRY_POPCNT_FAST) && \
+	defined(HAVE__IMMINTRIN) && \
+	defined(HAVE__AVX512_POPCNT)
+#if defined(HAVE__GET_CPUID_COUNT) || defined(HAVE__CPUIDEX)
+#define TRY_POPCNT_AVX512 1
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+#endif
+
 #ifdef TRY_POPCNT_FAST
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..eb1e56fe41 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
 	noblock.o \
 	path.o \
 	pg_bitutils.o \
+	pg_popcount_avx512.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..c77bbd3168 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
   'noblock.c',
   'path.c',
   'pg_bitutils.c',
+  'pg_popcount_avx512.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..2f9a6690e0 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -142,20 +142,18 @@ pg_popcount_available(void)
 	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
 }
 
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
+static inline void
+choose_popcount_functions(void)
 {
 	if (pg_popcount_available())
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
 		pg_popcount = pg_popcount_fast;
+#ifdef TRY_POPCNT_AVX512
+		if (pg_popcount_avx512_available())
+			pg_popcount = pg_popcount_avx512;
+#endif
 	}
 	else
 	{
@@ -163,45 +161,32 @@ pg_popcount32_choose(uint32 word)
 		pg_popcount64 = pg_popcount64_slow;
 		pg_popcount = pg_popcount_slow;
 	}
+}
 
+/*
+ * These functions get ca

Re: Popcount optimization using AVX512

2024-03-25 Thread Nathan Bossart
On Mon, Mar 25, 2024 at 06:42:36PM +, Amonson, Paul D wrote:
> Ok, CI turned green after my re-post of the patches.  Can this please get
> merged?

Thanks for the new patches.  I intend to take another look soon.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




RE: Popcount optimization using AVX512

2024-03-25 Thread Amonson, Paul D
> -Original Message-
> From: Amonson, Paul D 
> Sent: Monday, March 25, 2024 8:20 AM
> To: Tom Lane 
> Cc: David Rowley ; Nathan Bossart
> ; Andres Freund ; Alvaro
> Herrera ; Shankaran, Akash
> ; Noah Misch ; Matthias
> van de Meent ; pgsql-
> hack...@lists.postgresql.org
> Subject: RE: Popcount optimization using AVX512
>

Ok, CI turned green after my re-post of the patches.  Can this please get 
merged?

Thanks,
Paul





Re: Popcount optimization using AVX512

2024-03-25 Thread Joe Conway

On 3/25/24 11:12, Tom Lane wrote:

"Amonson, Paul D"  writes:

I am re-posting the patches as CI for Mac failed (CI error not code/test 
error). The patches are the same as last time.


Just for a note --- the cfbot will re-test existing patches every
so often without needing a bump.  The current cycle period seems to
be about two days.



Just an FYI -- there seems to be an issue with all three of the macos 
cfbot runners (mine included). I spent time over the weekend working 
with Thomas Munro (added to CC list) trying different fixes to no avail. 
Help from macos CI wizards would be gratefully accepted...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





RE: Popcount optimization using AVX512

2024-03-25 Thread Amonson, Paul D
> -Original Message-
> From: Tom Lane 
> Sent: Monday, March 25, 2024 8:12 AM
> To: Amonson, Paul D 
> Cc: David Rowley ; Nathan Bossart
> Subject: Re: Popcount optimization using AVX512
>...
> Just for a note --- the cfbot will re-test existing patches every so often 
> without
> needing a bump.  The current cycle period seems to be about two days.
> 
>   regards, tom lane

Good to know! Maybe this is why I thought it originally passed CI and suddenly 
this morning there is a failure. I noticed at least 2 other patch runs also 
failed in the same way.

Thanks,
Paul





Re: Popcount optimization using AVX512

2024-03-25 Thread Tom Lane
"Amonson, Paul D"  writes:
> I am re-posting the patches as CI for Mac failed (CI error not code/test 
> error). The patches are the same as last time.

Just for a note --- the cfbot will re-test existing patches every
so often without needing a bump.  The current cycle period seems to
be about two days.

regards, tom lane




RE: Popcount optimization using AVX512

2024-03-25 Thread Amonson, Paul D
> -Original Message-
> From: Amonson, Paul D 
> Sent: Thursday, March 21, 2024 12:18 PM
> To: David Rowley 
> Cc: Nathan Bossart ; Andres Freund

I am re-posting the patches as CI for Mac failed (CI error not code/test 
error). The patches are the same as last time.

Thanks,
Paul



v12-0001-Refactor-Split-pg_popcount-functions-into-multiple-f.patch
Description: v12-0001-Refactor-Split-pg_popcount-functions-into-multiple-f.patch


v12-0002-Feature-Added-AVX-512-acceleration-to-the-pg_popcoun.patch
Description: v12-0002-Feature-Added-AVX-512-acceleration-to-the-pg_popcoun.patch


RE: Popcount optimization using AVX512

2024-03-21 Thread Amonson, Paul D
> -Original Message-
> From: David Rowley 
> Sent: Wednesday, March 20, 2024 5:28 PM
> To: Amonson, Paul D 
> Cc: Nathan Bossart ; Andres Freund
>
> I'm not sure about this "extern negates inline" comment.  It seems to me the
> compiler is perfectly free to inline a static function into an external 
> function
> and it's free to inline the static function elsewhere within the same .c file.
> 
> The final sentence of the following comment that the 0001 patch removes
> explains this:
> 
> /*
>  * When the POPCNT instruction is not available, there's no point in using
>  * function pointers to vary the implementation between the fast and slow
>  * method.  We instead just make these actual external functions when
>  * TRY_POPCNT_FAST is not defined.  The compiler should be able to inline
>  * the slow versions here.
>  */
> 
> Also, have a look at [1].  You'll see f_slow() wasn't even compiled and the 
> code
> was just inlined into f().  I just added the
> __attribute__((noinline)) so that usage() wouldn't just perform constant
> folding and just return 6.
> 
> I think, unless you have evidence that some common compiler isn't inlining the
> static into the extern then we shouldn't add the macros.
> It adds quite a bit of churn to the patch and will break out of core code as 
> you
> no longer have functions named pg_popcount32(),
> pg_popcount64() and pg_popcount().

This may be a simple misunderstanding extern != static. If I use the "extern" 
keyword then a symbol *will* be generated and inline will be ignored. This is 
NOT true of "static inline", where the compiler will try to inline the method. 
:)

In this patch set:
* I removed the macro implementation.
* Made everything that could possibly be inlined marked with the "static 
inline" keyword.
* Conditionally made the *_slow() functions "static inline" when 
TRY_POPCONT_FAST is not set.
* Found and fixed some whitespace errors in the AVX code implementation.

Thanks,
Paul


v12-0001-Refactor-Split-pg_popcount-functions-into-multiple-f.patch
Description: v12-0001-Refactor-Split-pg_popcount-functions-into-multiple-f.patch


v12-0002-Feature-Added-AVX-512-acceleration-to-the-pg_popcoun.patch
Description: v12-0002-Feature-Added-AVX-512-acceleration-to-the-pg_popcoun.patch


Re: Popcount optimization using AVX512

2024-03-20 Thread David Rowley
On Wed, 20 Mar 2024 at 11:56, Amonson, Paul D  wrote:
> Changed in this patch set.
>
> * Rebased.
> * Direct *slow* calls via macros as shown in example patch.
> * Changed the choose filename to be platform specific as suggested.
> * Falls back to intermediate "Fast" methods if AVX512 is not available at 
> runtime.
> * inline used where is makes sense, remember using "extern" negates "inline".

I'm not sure about this "extern negates inline" comment.  It seems to
me the compiler is perfectly free to inline a static function into an
external function and it's free to inline the static function
elsewhere within the same .c file.

The final sentence of the following comment that the 0001 patch
removes explains this:

/*
 * When the POPCNT instruction is not available, there's no point in using
 * function pointers to vary the implementation between the fast and slow
 * method.  We instead just make these actual external functions when
 * TRY_POPCNT_FAST is not defined.  The compiler should be able to inline
 * the slow versions here.
 */

Also, have a look at [1].  You'll see f_slow() wasn't even compiled
and the code was just inlined into f().  I just added the
__attribute__((noinline)) so that usage() wouldn't just perform
constant folding and just return 6.

I think, unless you have evidence that some common compiler isn't
inlining the static into the extern then we shouldn't add the macros.
It adds quite a bit of churn to the patch and will break out of core
code as you no longer have functions named pg_popcount32(),
pg_popcount64() and pg_popcount().

David

[1] https://godbolt.org/z/6joExb79d




RE: Popcount optimization using AVX512

2024-03-20 Thread Amonson, Paul D
> -Original Message-
> From: David Rowley 
> Sent: Tuesday, March 19, 2024 9:26 PM
> To: Amonson, Paul D 
> 
> AMD's Zen4 also has AVX512, so it's misleading to indicate it's an Intel only
> instruction.  Also, writing the date isn't necessary as we have "git blame"

Fixed.

Thanks,
Paul


v11-0001-Refactor-inlining-and-direct-calls-for-_slow-functio.patch
Description: v11-0001-Refactor-inlining-and-direct-calls-for-_slow-functio.patch


v11-0002-Refactor-Seperated-slow-fast-and-choose-functionalit.patch
Description: v11-0002-Refactor-Seperated-slow-fast-and-choose-functionalit.patch


v11-0003-Feature-Add-POPCNT512-accelerated-functionality-for-.patch
Description: v11-0003-Feature-Add-POPCNT512-accelerated-functionality-for-.patch


Re: Popcount optimization using AVX512

2024-03-19 Thread David Rowley
On Wed, 20 Mar 2024 at 11:56, Amonson, Paul D  wrote:
> Changed in this patch set.

Thanks for rebasing.

I don't think there's any need to mention Intel in each of the
following comments:

+# Check for Intel AVX512 intrinsics to do POPCNT calculations.

+# Newer Intel processors can use AVX-512 POPCNT Capabilities (01/30/2024)

AMD's Zen4 also has AVX512, so it's misleading to indicate it's an
Intel only instruction.  Also, writing the date isn't necessary as we
have "git blame"

David




RE: Popcount optimization using AVX512

2024-03-19 Thread Amonson, Paul D
> -Original Message-
> From: Nathan Bossart 
>
> Committed.  Thanks for the suggestion and for reviewing!
> 
> Paul, I suspect your patches will need to be rebased after commit cc4826d.
> Would you mind doing so?

Changed in this patch set.

* Rebased.
* Direct *slow* calls via macros as shown in example patch.
* Changed the choose filename to be platform specific as suggested.
* Falls back to intermediate "Fast" methods if AVX512 is not available at 
runtime.
* inline used where is makes sense, remember using "extern" negates "inline".
* Fixed comment issues pointed out in review.

I tested building with and without TRY_POPCOUNT_FAST, for both configure and 
meson build systems, and ran in CI.

Thanks,
Paul



v10-0001-Refactor-inlining-and-direct-calls-for-_slow-functio.patch
Description: v10-0001-Refactor-inlining-and-direct-calls-for-_slow-functio.patch


v10-0002-Refactor-Seperated-slow-fast-and-choose-functionalit.patch
Description: v10-0002-Refactor-Seperated-slow-fast-and-choose-functionalit.patch


v10-0003-Feature-Add-POPCNT512-accelerated-functionality-for-.patch
Description: v10-0003-Feature-Add-POPCNT512-accelerated-functionality-for-.patch


Re: Popcount optimization using AVX512

2024-03-19 Thread Nathan Bossart
On Tue, Mar 19, 2024 at 12:30:50PM +1300, David Rowley wrote:
> Looks good.

Committed.  Thanks for the suggestion and for reviewing!

Paul, I suspect your patches will need to be rebased after commit cc4826d.
Would you mind doing so?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-03-18 Thread David Rowley
On Tue, 19 Mar 2024 at 11:08, Nathan Bossart  wrote:
>
> On Mon, Mar 18, 2024 at 04:29:19PM -0500, Nathan Bossart wrote:
> > Agreed.  Will send an updated patch shortly.
>
> As promised...

Looks good.

David




Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 04:29:19PM -0500, Nathan Bossart wrote:
> Agreed.  Will send an updated patch shortly.

As promised...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From b673663b1d1344549cbd0912220f96ba1712afc6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 18 Mar 2024 12:18:15 -0500
Subject: [PATCH v4 1/1] inline function calls in pg_popcount() when possible

---
 src/include/port/pg_bitutils.h |   5 +-
 src/port/pg_bitutils.c | 155 +
 2 files changed, 121 insertions(+), 39 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 46bf4f0103..53e5239717 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,17 +302,16 @@ pg_ceil_log2_64(uint64 num)
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
+extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
 
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
 extern int	pg_popcount64(uint64 word);
+extern uint64 pg_popcount(const char *buf, int bytes);
 
 #endif			/* TRY_POPCNT_FAST */
 
-/* Count the number of one-bits in a byte array */
-extern uint64 pg_popcount(const char *buf, int bytes);
-
 /*
  * Rotate the bits of "word" to the right/left by n bits.
  */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..1197696e97 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -103,18 +103,22 @@ const uint8 pg_number_of_ones[256] = {
 	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
 };
 
-static int	pg_popcount32_slow(uint32 word);
-static int	pg_popcount64_slow(uint64 word);
+static inline int pg_popcount32_slow(uint32 word);
+static inline int pg_popcount64_slow(uint64 word);
+static uint64 pg_popcount_slow(const char *buf, int bytes);
 
 #ifdef TRY_POPCNT_FAST
 static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
-static int	pg_popcount32_fast(uint32 word);
-static int	pg_popcount64_fast(uint64 word);
+static uint64 pg_popcount_choose(const char *buf, int bytes);
+static inline int pg_popcount32_fast(uint32 word);
+static inline int pg_popcount64_fast(uint64 word);
+static uint64 pg_popcount_fast(const char *buf, int bytes);
 
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+uint64		(*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
 #endif			/* TRY_POPCNT_FAST */
 
 #ifdef TRY_POPCNT_FAST
@@ -151,11 +155,13 @@ pg_popcount32_choose(uint32 word)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
 	}
 
 	return pg_popcount32(word);
@@ -168,21 +174,42 @@ pg_popcount64_choose(uint64 word)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
 	}
 
 	return pg_popcount64(word);
 }
 
+static uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+	if (pg_popcount_available())
+	{
+		pg_popcount32 = pg_popcount32_fast;
+		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
+	}
+	else
+	{
+		pg_popcount32 = pg_popcount32_slow;
+		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
+	}
+
+	return pg_popcount(buf, bytes);
+}
+
 /*
  * pg_popcount32_fast
  *		Return the number of 1 bits set in word
  */
-static int
+static inline int
 pg_popcount32_fast(uint32 word)
 {
 #ifdef _MSC_VER
@@ -199,7 +226,7 @@ __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
  * pg_popcount64_fast
  *		Return the number of 1 bits set in word
  */
-static int
+static inline int
 pg_popcount64_fast(uint64 word)
 {
 #ifdef _MSC_VER
@@ -212,6 +239,52 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
 #endif
 }
 
+/*
+ * pg_popcount_fast
+ *		Returns the number of 1-bits in buf
+ */
+static uint64
+pg_popcount_fast(const char *buf, int bytes)
+{
+	uint64		popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+	/* Process in 64-bit chunks if the buffer is aligned. */
+	if (buf == (const char *) TYPEALIGN(8, buf))
+	{
+		const uint64 *words = (const uint64 *) buf;
+
+		while (bytes >= 8)
+		{
+			popcnt += pg_popcount64_fast(*words++);
+			bytes -= 8;
+		}
+
+		buf = (const char *) words;
+	}
+#else
+	/* Process in 32-bit chunks if the buffer is aligned. */
+	if (buf == (const char *) TYPEALIGN(4, buf))
+	{
+		const uint32 *words = (const uint32 *) buf;
+
+		while 

Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Tue, Mar 19, 2024 at 10:27:58AM +1300, David Rowley wrote:
> On Tue, 19 Mar 2024 at 10:08, Nathan Bossart  wrote:
>> On Tue, Mar 19, 2024 at 10:02:18AM +1300, David Rowley wrote:
>> > The only thing I'd question in the patch is in pg_popcount_fast(). It
>> > looks like you've opted to not do the 32-bit processing on 32-bit
>> > machines. I think that's likely still worth coding in a similar way to
>> > how pg_popcount_slow() works. i.e. use "#if SIZEOF_VOID_P >= 8".
>> > Probably one day we'll remove that code, but it seems strange to have
>> > pg_popcount_slow() do it and not pg_popcount_fast().
>>
>> The only reason I left it out was because I couldn't convince myself that
>> it wasn't dead code, given we assume that popcntq is available in
>> pg_popcount64_fast() today.  But I don't see any harm in adding that just
>> in case.
> 
> It's probably more of a case of using native instructions rather than
> ones that might be implemented only via microcode.  For the record, I
> don't know if that would be the case for popcntq on x86 32-bit and I
> don't have the hardware to test it. It just seems less risky just to
> do it.

Agreed.  Will send an updated patch shortly.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-03-18 Thread David Rowley
On Tue, 19 Mar 2024 at 10:08, Nathan Bossart  wrote:
>
> On Tue, Mar 19, 2024 at 10:02:18AM +1300, David Rowley wrote:
> > The only thing I'd question in the patch is in pg_popcount_fast(). It
> > looks like you've opted to not do the 32-bit processing on 32-bit
> > machines. I think that's likely still worth coding in a similar way to
> > how pg_popcount_slow() works. i.e. use "#if SIZEOF_VOID_P >= 8".
> > Probably one day we'll remove that code, but it seems strange to have
> > pg_popcount_slow() do it and not pg_popcount_fast().
>
> The only reason I left it out was because I couldn't convince myself that
> it wasn't dead code, given we assume that popcntq is available in
> pg_popcount64_fast() today.  But I don't see any harm in adding that just
> in case.

It's probably more of a case of using native instructions rather than
ones that might be implemented only via microcode.  For the record, I
don't know if that would be the case for popcntq on x86 32-bit and I
don't have the hardware to test it. It just seems less risky just to
do it.

David




Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 09:22:43PM +, Amonson, Paul D wrote:
>> The only reason I left it out was because I couldn't convince myself that it
>> wasn't dead code, given we assume that popcntq is available in
>> pg_popcount64_fast() today.  But I don't see any harm in adding that just in
>> case.
> 
> I am not sure how to read this. Does this mean that for popcount32_fast
> and popcount64_fast I can assume that the x86(_64) instructions exists
> and stop doing the runtime checks for instruction availability?

I think my question boils down to "if pg_popcount_available() returns true,
can I safely assume I'm on a 64-bit machine?"

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




RE: Popcount optimization using AVX512

2024-03-18 Thread Amonson, Paul D
> -Original Message-
> From: Nathan Bossart 
> Sent: Monday, March 18, 2024 2:08 PM
> To: David Rowley 
> Cc: Amonson, Paul D ; Andres Freund
>...
> 
> The only reason I left it out was because I couldn't convince myself that it
> wasn't dead code, given we assume that popcntq is available in
> pg_popcount64_fast() today.  But I don't see any harm in adding that just in
> case.

I am not sure how to read this. Does this mean that for popcount32_fast and 
popcount64_fast I can assume that the x86(_64) instructions exists and stop 
doing the runtime checks for instruction availability?

Thanks,
Paul





Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Tue, Mar 19, 2024 at 10:02:18AM +1300, David Rowley wrote:
> I looked at your latest patch and tried out the performance on a Zen4
> running windows and a Zen2 running on Linux. As follows:

Thanks for taking a look.

> The only thing I'd question in the patch is in pg_popcount_fast(). It
> looks like you've opted to not do the 32-bit processing on 32-bit
> machines. I think that's likely still worth coding in a similar way to
> how pg_popcount_slow() works. i.e. use "#if SIZEOF_VOID_P >= 8".
> Probably one day we'll remove that code, but it seems strange to have
> pg_popcount_slow() do it and not pg_popcount_fast().

The only reason I left it out was because I couldn't convince myself that
it wasn't dead code, given we assume that popcntq is available in
pg_popcount64_fast() today.  But I don't see any harm in adding that just
in case.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-03-18 Thread David Rowley
On Tue, 19 Mar 2024 at 06:30, Nathan Bossart  wrote:
> Here is a more fleshed-out version of what I believe David is proposing.
> On my machine, the gains aren't quite as impressive (~8.8s to ~5.2s for the
> test_popcount benchmark).  I assume this is because this patch turns
> pg_popcount() into a function pointer, which is what the AVX512 patches do,
> too.  I left out the 32-bit section from pg_popcount_fast(), but I'll admit
> that I'm not yet 100% sure that we can assume we're on a 64-bit system
> there.

I looked at your latest patch and tried out the performance on a Zen4
running windows and a Zen2 running on Linux. As follows:

AMD 3990x:

master:
postgres=# select drive_popcount(1000, 1024);
Time: 11904.078 ms (00:11.904)
Time: 11907.176 ms (00:11.907)
Time: 11927.983 ms (00:11.928)

patched:
postgres=# select drive_popcount(1000, 1024);
Time: 3641.271 ms (00:03.641)
Time: 3610.934 ms (00:03.611)
Time: 3663.423 ms (00:03.663)


AMD 7945HX Windows

master:
postgres=# select drive_popcount(1000, 1024);
Time: 9832.845 ms (00:09.833)
Time: 9844.460 ms (00:09.844)
Time: 9858.608 ms (00:09.859)

patched:
postgres=# select drive_popcount(1000, 1024);
Time: 3427.942 ms (00:03.428)
Time: 3364.262 ms (00:03.364)
Time: 3413.407 ms (00:03.413)

The only thing I'd question in the patch is in pg_popcount_fast(). It
looks like you've opted to not do the 32-bit processing on 32-bit
machines. I think that's likely still worth coding in a similar way to
how pg_popcount_slow() works. i.e. use "#if SIZEOF_VOID_P >= 8".
Probably one day we'll remove that code, but it seems strange to have
pg_popcount_slow() do it and not pg_popcount_fast().

> IMHO this work is arguably a prerequisite for the AVX512 work, as turning
> pg_popcount() into a function pointer will likely regress performance for
> folks on systems without AVX512 otherwise.

I think so too.

David




Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 12:30:04PM -0500, Nathan Bossart wrote:
> Here is a more fleshed-out version of what I believe David is proposing.
> On my machine, the gains aren't quite as impressive (~8.8s to ~5.2s for the
> test_popcount benchmark).  I assume this is because this patch turns
> pg_popcount() into a function pointer, which is what the AVX512 patches do,
> too.  I left out the 32-bit section from pg_popcount_fast(), but I'll admit
> that I'm not yet 100% sure that we can assume we're on a 64-bit system
> there.
> 
> IMHO this work is arguably a prerequisite for the AVX512 work, as turning
> pg_popcount() into a function pointer will likely regress performance for
> folks on systems without AVX512 otherwise.

Apologies for the noise.  I noticed that we could (and probably should)
inline the pg_popcount32/64 calls in the "slow" version, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 3047674f0950435b7fa30746be7f8e5cc7249e6d Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 18 Mar 2024 12:18:15 -0500
Subject: [PATCH 1/1] inline function calls in pg_popcount() when possible

---
 src/include/port/pg_bitutils.h |   5 +-
 src/port/pg_bitutils.c | 135 -
 2 files changed, 103 insertions(+), 37 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 46bf4f0103..53e5239717 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,17 +302,16 @@ pg_ceil_log2_64(uint64 num)
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
+extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
 
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
 extern int	pg_popcount64(uint64 word);
+extern uint64 pg_popcount(const char *buf, int bytes);
 
 #endif			/* TRY_POPCNT_FAST */
 
-/* Count the number of one-bits in a byte array */
-extern uint64 pg_popcount(const char *buf, int bytes);
-
 /*
  * Rotate the bits of "word" to the right/left by n bits.
  */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..d0c93dafcb 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -103,18 +103,22 @@ const uint8 pg_number_of_ones[256] = {
 	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
 };
 
-static int	pg_popcount32_slow(uint32 word);
-static int	pg_popcount64_slow(uint64 word);
+static inline int pg_popcount32_slow(uint32 word);
+static inline int pg_popcount64_slow(uint64 word);
+static uint64 pg_popcount_slow(const char *buf, int bytes);
 
 #ifdef TRY_POPCNT_FAST
 static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
+static uint64 pg_popcount_choose(const char *buf, int bytes);
 static int	pg_popcount32_fast(uint32 word);
-static int	pg_popcount64_fast(uint64 word);
+static inline int pg_popcount64_fast(uint64 word);
+static uint64 pg_popcount_fast(const char *buf, int bytes);
 
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+uint64		(*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
 #endif			/* TRY_POPCNT_FAST */
 
 #ifdef TRY_POPCNT_FAST
@@ -151,11 +155,13 @@ pg_popcount32_choose(uint32 word)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
 	}
 
 	return pg_popcount32(word);
@@ -168,16 +174,37 @@ pg_popcount64_choose(uint64 word)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
 	}
 
 	return pg_popcount64(word);
 }
 
+static uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+	if (pg_popcount_available())
+	{
+		pg_popcount32 = pg_popcount32_fast;
+		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
+	}
+	else
+	{
+		pg_popcount32 = pg_popcount32_slow;
+		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
+	}
+
+	return pg_popcount(buf, bytes);
+}
+
 /*
  * pg_popcount32_fast
  *		Return the number of 1 bits set in word
@@ -199,7 +226,7 @@ __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
  * pg_popcount64_fast
  *		Return the number of 1 bits set in word
  */
-static int
+static inline int
 pg_popcount64_fast(uint64 word)
 {
 #ifdef _MSC_VER
@@ -212,6 +239,36 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
 #endif
 }
 
+/*
+ * pg_popcount_fast
+ *		Returns the number of 

Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 05:28:32PM +, Amonson, Paul D wrote:
> Question: I applied the patch for the drive_popcount* functions and
> rebuilt.  The resultant server complains that the function is missing.
> What is the trick to make this work?

You probably need to install the test_popcount extension and run "CREATE
EXTENION test_popcount;".

> Another Question: Is there a reason "time psql" is used over the Postgres
> "\timing" command?

I don't think there's any strong reason.  I've used both.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 11:20:18AM -0500, Nathan Bossart wrote:
> I don't think David was suggesting that we need to remove the runtime
> checks for AVX512.  IIUC he was pointing out that most of the performance
> gain is from removing the function call overhead, which your v8-0002 patch
> already does for the proposed AVX512 code.  We can apply a similar
> optimization for systems without AVX512 by inlining the code for
> pg_popcount64() and pg_popcount32().

Here is a more fleshed-out version of what I believe David is proposing.
On my machine, the gains aren't quite as impressive (~8.8s to ~5.2s for the
test_popcount benchmark).  I assume this is because this patch turns
pg_popcount() into a function pointer, which is what the AVX512 patches do,
too.  I left out the 32-bit section from pg_popcount_fast(), but I'll admit
that I'm not yet 100% sure that we can assume we're on a 64-bit system
there.

IMHO this work is arguably a prerequisite for the AVX512 work, as turning
pg_popcount() into a function pointer will likely regress performance for
folks on systems without AVX512 otherwise.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 1d33c803feb7428f798b13fd643a16c73628f8a9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 18 Mar 2024 12:18:15 -0500
Subject: [PATCH 1/1] inline function calls in pg_popcount() when possible

---
 src/include/port/pg_bitutils.h |   5 +-
 src/port/pg_bitutils.c | 123 +
 2 files changed, 97 insertions(+), 31 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 46bf4f0103..53e5239717 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,17 +302,16 @@ pg_ceil_log2_64(uint64 num)
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
+extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
 
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
 extern int	pg_popcount64(uint64 word);
+extern uint64 pg_popcount(const char *buf, int bytes);
 
 #endif			/* TRY_POPCNT_FAST */
 
-/* Count the number of one-bits in a byte array */
-extern uint64 pg_popcount(const char *buf, int bytes);
-
 /*
  * Rotate the bits of "word" to the right/left by n bits.
  */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..e374e753d7 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -105,16 +105,20 @@ const uint8 pg_number_of_ones[256] = {
 
 static int	pg_popcount32_slow(uint32 word);
 static int	pg_popcount64_slow(uint64 word);
+static uint64 pg_popcount_slow(const char *buf, int bytes);
 
 #ifdef TRY_POPCNT_FAST
 static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
+static uint64 pg_popcount_choose(const char *buf, int bytes);
 static int	pg_popcount32_fast(uint32 word);
-static int	pg_popcount64_fast(uint64 word);
+static inline int pg_popcount64_fast(uint64 word);
+static uint64 pg_popcount_fast(const char *buf, int bytes);
 
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+uint64		(*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
 #endif			/* TRY_POPCNT_FAST */
 
 #ifdef TRY_POPCNT_FAST
@@ -151,11 +155,13 @@ pg_popcount32_choose(uint32 word)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
 	}
 
 	return pg_popcount32(word);
@@ -168,16 +174,37 @@ pg_popcount64_choose(uint64 word)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
 	}
 
 	return pg_popcount64(word);
 }
 
+static uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+	if (pg_popcount_available())
+	{
+		pg_popcount32 = pg_popcount32_fast;
+		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
+	}
+	else
+	{
+		pg_popcount32 = pg_popcount32_slow;
+		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
+	}
+
+	return pg_popcount(buf, bytes);
+}
+
 /*
  * pg_popcount32_fast
  *		Return the number of 1 bits set in word
@@ -199,7 +226,7 @@ __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
  * pg_popcount64_fast
  *		Return the number of 1 bits set in word
  */
-static int
+static inline int
 pg_popcount64_fast(uint64 word)
 {
 #ifdef _MSC_VER
@@ -212,6 +239,36 @@ __asm__ __volatile__(" popcntq 

RE: Popcount optimization using AVX512

2024-03-18 Thread Amonson, Paul D
> -Original Message-
> From: Nathan Bossart 
> Sent: Monday, March 18, 2024 9:20 AM
> ...
> I don't think David was suggesting that we need to remove the runtime checks
> for AVX512.  IIUC he was pointing out that most of the performance gain is
> from removing the function call overhead, which your v8-0002 patch already
> does for the proposed AVX512 code.  We can apply a similar optimization for
> systems without AVX512 by inlining the code for
> pg_popcount64() and pg_popcount32().

Ok, got you.

Question: I applied the patch for the drive_popcount* functions and rebuilt.  
The resultant server complains that the function is missing. What is the trick 
to make this work?

Another Question: Is there a reason "time psql" is used over the Postgres 
"\timing" command?

Thanks,
Paul





Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 04:07:40PM +, Amonson, Paul D wrote:
> Won't I still need the runtime checks? If I compile with a compiler
> supporting the HW "feature" but run on HW without that feature,  I will
> want to avoid faults due to illegal operations. Won't that also affect
> performance?

I don't think David was suggesting that we need to remove the runtime
checks for AVX512.  IIUC he was pointing out that most of the performance
gain is from removing the function call overhead, which your v8-0002 patch
already does for the proposed AVX512 code.  We can apply a similar
optimization for systems without AVX512 by inlining the code for
pg_popcount64() and pg_popcount32().

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




RE: Popcount optimization using AVX512

2024-03-18 Thread Amonson, Paul D
Won't I still need the runtime checks? If I compile with a compiler supporting 
the HW "feature" but run on HW without that feature,  I will want to avoid 
faults due to illegal operations. Won't that also affect performance?

Paul

> -Original Message-
> From: Nathan Bossart 
> Sent: Monday, March 18, 2024 8:29 AM
> To: David Rowley 
> Cc: Amonson, Paul D ; Andres Freund
> ; Alvaro Herrera ; Shankaran,
> Akash ; Noah Misch ;
> Tom Lane ; Matthias van de Meent
> ; pgsql-hackers@lists.postgresql.org
> Subject: Re: Popcount optimization using AVX512
> 
> On Mon, Mar 18, 2024 at 09:56:32AM +1300, David Rowley wrote:
> > Maybe it's worth exploring something along the lines of the attached
> > before doing the AVX512 stuff.  It seems like a pretty good speed-up
> > and will apply for CPUs without AVX512 support.
> 
> +1
> 
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 09:56:32AM +1300, David Rowley wrote:
> Maybe it's worth exploring something along the lines of the attached
> before doing the AVX512 stuff.  It seems like a pretty good speed-up
> and will apply for CPUs without AVX512 support.

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-03-17 Thread David Rowley
On Sat, 16 Mar 2024 at 04:06, Nathan Bossart  wrote:
> I ran John Naylor's test_popcount module [0] with the following command on
> an i7-1195G7:
>
> time psql postgres -c 'select drive_popcount(1000, 1024)'
>
> Without your patches, this seems to take somewhere around 8.8 seconds.
> With your patches, it takes 0.6 seconds.  (I re-compiled and re-ran the
> tests a couple of times because I had a difficult time believing the amount
> of improvement.)
>
> [0] 
> https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO%2Bb7AEWHRFANxR1h1kxveEV%3DghLQ%40mail.gmail.com

I think most of that will come from getting rid of the indirect
function that currently exists in pg_popcount().

Using the attached quick hack, the performance using John's test
module goes from:

-- master
postgres=# select drive_popcount(1000, 1024);
Time: 9832.845 ms (00:09.833)
Time: 9844.460 ms (00:09.844)
Time: 9858.608 ms (00:09.859)

-- with attached hacky and untested patch
postgres=# select drive_popcount(1000, 1024);
Time: 2539.029 ms (00:02.539)
Time: 2598.223 ms (00:02.598)
Time: 2611.435 ms (00:02.611)

--- and with the avx512 patch on an AMD 7945HX CPU:
postgres=# select drive_popcount(1000, 1024);
Time: 564.982 ms
Time: 556.540 ms
Time: 554.032 ms

The following comment seems like it could do with some improvements.

 * Use AVX-512 Intrinsics for supported Intel CPUs or fall back the the software
 * loop in pg_bunutils.c and use the best 32 or 64 bit fast methods. If no fast
 * methods are used this will fall back to __builtin_* or pure software.

There's nothing much specific to Intel here.  AMD Zen4 has AVX512.
Plus "pg_bunutils.c" should be "pg_bitutils.c" and "the the"

How about just:

 * Use AVX-512 Intrinsics on supported CPUs. Fall back the software loop in
 * pg_popcount_slow() when AVX-512 is unavailable.

Maybe it's worth exploring something along the lines of the attached
before doing the AVX512 stuff.  It seems like a pretty good speed-up
and will apply for CPUs without AVX512 support.

David
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..85e45cee9b 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -305,7 +305,18 @@ pg_popcount(const char *buf, int bytes)
 
while (bytes >= 8)
{
-   popcnt += pg_popcount64(*words++);
+#ifdef _MSC_VER
+   popcnt += __popcnt64(*words++);
+#else
+   uint64 res;
+
+   __asm__ __volatile__(" popcntq %1,%0\n"
+: "=q"(res)
+: "rm"(word)
+: "cc");
+   popcnt += (int) res;
+   words++;
+#endif
bytes -= 8;
}
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index f1d18a1b29..ae880db64c 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -26,6 +26,7 @@ subdir('test_misc')
 subdir('test_oat_hooks')
 subdir('test_parser')
 subdir('test_pg_dump')
+subdir('test_popcount')
 subdir('test_predtest')
 subdir('test_radixtree')
 subdir('test_rbtree')


RE: Popcount optimization using AVX512

2024-03-15 Thread Amonson, Paul D
> -Original Message-
> From: Amonson, Paul D 
> Sent: Friday, March 15, 2024 8:31 AM
> To: Nathan Bossart 
...
> When I tested the code outside postgres in a micro benchmark I got 200-
> 300% improvements. Your results are interesting, as it implies more than
> 300% improvement. Let me do some research on the benchmark you
> referenced. However, in all cases it seems that there is no regression so 
> should
> we move forward on merging while I run some more local tests?

When running quick test with small buffers (1 to 32K) I see up to about a 740% 
improvement. This was using my stand-alone micro benchmark outside of PG. My 
original 200-300% numbers were averaged including sizes up to 512MB which seems 
to not run as well on large buffers.  I will try the referenced micro benchmark 
on Monday. None of my benchmark testing used the command line "time" command. 
For Postgres is set "\timing" before the run and for the stand-alone benchmark 
is took timestamps in the code. In all cases I used -O2 for optimization.

Thanks,
Paul





RE: Popcount optimization using AVX512

2024-03-15 Thread Amonson, Paul D
> -Original Message-
> From: Nathan Bossart 
> Sent: Friday, March 15, 2024 8:06 AM
> To: Amonson, Paul D 
> Cc: Andres Freund ; Alvaro Herrera  ip.org>; Shankaran, Akash ; Noah Misch
> ; Tom Lane ; Matthias van de
> Meent ; pgsql-
> hack...@lists.postgresql.org
> Subject: Re: Popcount optimization using AVX512
> 
> Which test suite did you run?  Those numbers seem potentially
> indistinguishable from noise, which probably isn't great for such a large 
> patch
> set.

I ran...
psql -c "select bitcount(column) from table;"
...in a loop with "column" widths of 84, 4096, 8192, and 16384 containing 
random data. There DB has 1 million rows.  In the loop before calling the 
select I have code to clear all system caches. If I omit the code to clear 
system caches the margin of error remains the same but the improvement percent 
changes from 1.2% to 14.6% (much less I/O when cached data is available).

> I ran John Naylor's test_popcount module [0] with the following command on
> an i7-1195G7:
> 
>   time psql postgres -c 'select drive_popcount(1000, 1024)'
> 
> Without your patches, this seems to take somewhere around 8.8 seconds.
> With your patches, it takes 0.6 seconds.  (I re-compiled and re-ran the tests 
> a
> couple of times because I had a difficult time believing the amount of
> improvement.)

When I tested the code outside postgres in a micro benchmark I got 200-300% 
improvements. Your results are interesting, as it implies more than 300% 
improvement. Let me do some research on the benchmark you referenced. However, 
in all cases it seems that there is no regression so should we move forward on 
merging while I run some more local tests?

Thanks,
Paul





Re: Popcount optimization using AVX512

2024-03-15 Thread Nathan Bossart
On Thu, Mar 14, 2024 at 07:50:46PM +, Amonson, Paul D wrote:
> As for new performance numbers: I just ran a full suite like I did
> earlier in the process. My latest results an equivalent to a pgbench
> scale factor 10 DB with the target column having varying column widths
> and appropriate random data are 1.2% improvement with a 2.2% Margin of
> Error at a 98% confidence level. Still seeing improvement and no
> regressions.

Which test suite did you run?  Those numbers seem potentially
indistinguishable from noise, which probably isn't great for such a large
patch set.

I ran John Naylor's test_popcount module [0] with the following command on
an i7-1195G7:

time psql postgres -c 'select drive_popcount(1000, 1024)'

Without your patches, this seems to take somewhere around 8.8 seconds.
With your patches, it takes 0.6 seconds.  (I re-compiled and re-ran the
tests a couple of times because I had a difficult time believing the amount
of improvement.)

[0] 
https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO%2Bb7AEWHRFANxR1h1kxveEV%3DghLQ%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




RE: Popcount optimization using AVX512

2024-03-14 Thread Amonson, Paul D
> -Original Message-
> From: Nathan Bossart 
> Sent: Monday, March 11, 2024 6:35 PM
> To: Amonson, Paul D 

> Thanks.  There's no need to wait to post the AVX portion.  I recommend using
> "git format-patch" to construct the patch set for the lists.

After exploring git format-patch command I think I understand what you need. 
Attached.
 
> > What exactly do you suggest here? I am happy to always call either
> > pg_popcount32() or pg_popcount64() with the understanding that it may
> > not be optimal, but I do need to know which to use.
> 
> I'm recommending that we don't change any of the code in the pg_popcount()
> function (which is renamed to pg_popcount_slow() in your v6 patch).  If
> pointers are 8 or more bytes, we'll try to process the buffer in 64-bit 
> chunks.
> Else, we'll try to process it in 32-bit chunks.  Any remaining bytes will be
> processed one-by-one.

Ok, we are on the same page now. :)  It is already fixed that way in the 
refactor patch #1.

As for new performance numbers: I just ran a full suite like I did earlier in 
the process. My latest results an equivalent to a pgbench scale factor 10 DB 
with the target column having varying column widths and appropriate random data 
are 1.2% improvement with a 2.2% Margin of Error at a 98% confidence level. 
Still seeing improvement and no regressions.

As stated in the previous separate chain I updated the code removing the extra 
"extern" keywords.

Thanks,
Paul



v8-0001-Refactor-POPCNT-code-refactored-for-future-accelerat.patch
Description: v8-0001-Refactor-POPCNT-code-refactored-for-future-accelerat.patch


v8-0002-Feat-Add-AVX-512-POPCNT-support-initial-checkin.patch
Description: v8-0002-Feat-Add-AVX-512-POPCNT-support-initial-checkin.patch


  1   2   >