Re: use ARM intrinsics in pg_lfind32() where available

2022-08-29 Thread John Naylor
On Tue, Aug 30, 2022 at 12:17 AM Nathan Bossart
 wrote:
> Thanks!  I've attached a follow-up patch with a couple of small
> suggestions.

Pushed, thanks!

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-29 Thread Nathan Bossart
On Mon, Aug 29, 2022 at 05:49:46PM +0700, John Naylor wrote:
> Bowerbird just reported the same error, so I went ahead and pushed a
> fix with this.

Thanks!  I've attached a follow-up patch with a couple of small
suggestions.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 0ff1549083..745890f77f 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,9 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+#ifndef USE_NO_SIMD
+static inline bool vector32_is_highbit_set(const Vector32 v);
+#endif
 
 /* arithmetic operations */
 static inline Vector8 vector8_or(const Vector8 v1, const Vector8 v2);
@@ -88,7 +91,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
 /*
  * comparisons between vectors
  *
- * Note: These return a vector rather than booloan, which is why we don't
+ * Note: These return a vector rather than boolean, which is why we don't
  * have non-SIMD implementations.
  */
 #ifndef USE_NO_SIMD


Re: use ARM intrinsics in pg_lfind32() where available

2022-08-29 Thread John Naylor
On Mon, Aug 29, 2022 at 4:28 PM John Naylor
 wrote:
>
> Here's the simplest fix I can think of:
>
> /*
>  * Exactly like vector8_is_highbit_set except for the input type, so
> it still looks
>  * at each _byte_ separately.
>  *
>  * XXX x86 uses the same underlying type for vectors with 8-bit,
> 16-bit, and 32-bit
>  * integer elements, but Arm does not, hence the need for a separate function.
>  * We could instead adopt the behavior of Arm's vmaxvq_u32(), i.e. check each
>  * 32-bit element, but that would require an additional mask operation on x86.
>  */
> static inline bool
> vector32_is_highbit_set(const Vector32 v)
> {
> #if defined(USE_NEON)
> return vector8_is_highbit_set((Vector8) v);
> #else
> return vector8_is_highbit_set(v);
> #endif
> }

Bowerbird just reported the same error, so I went ahead and pushed a
fix with this.

--
John Naylor
EDB: http://www.enterprisedb.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-29 Thread John Naylor
On Mon, Aug 29, 2022 at 3:19 PM John Naylor
 wrote:
>
> It turns out MSVC animal drongo doesn't like this cast -- on x86 they
> are the same underlying type. Will look into that as more results come
> in.

Here's the simplest fix I can think of:

/*
 * Exactly like vector8_is_highbit_set except for the input type, so
it still looks
 * at each _byte_ separately.
 *
 * XXX x86 uses the same underlying type for vectors with 8-bit,
16-bit, and 32-bit
 * integer elements, but Arm does not, hence the need for a separate function.
 * We could instead adopt the behavior of Arm's vmaxvq_u32(), i.e. check each
 * 32-bit element, but that would require an additional mask operation on x86.
 */
static inline bool
vector32_is_highbit_set(const Vector32 v)
{
#if defined(USE_NEON)
return vector8_is_highbit_set((Vector8) v);
#else
return vector8_is_highbit_set(v);
#endif
}

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-29 Thread John Naylor
On Mon, Aug 29, 2022 at 11:25 AM John Naylor
 wrote:
> +static inline bool
> +vector32_is_highbit_set(const Vector32 v)
> +{
> +#ifdef USE_SSE2
> + return (_mm_movemask_epi8(v) & 0x) != 0;
> +#endif
> +}
>
> I'm not sure why we need this function -- AFAICS it just adds more
> work on x86 for zero benefit. For our present application, can we just
> cast to Vector8 (for Arm's sake) and call the 8-bit version?

It turns out MSVC animal drongo doesn't like this cast -- on x86 they
are the same underlying type. Will look into that as more results come
in.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-29 Thread John Naylor
On Mon, Aug 29, 2022 at 12:44 PM Nathan Bossart
 wrote:
> [v6]

Pushed with a couple comment adjustments, let's see what the build
farm thinks...

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-28 Thread Nathan Bossart
On Mon, Aug 29, 2022 at 11:25:50AM +0700, John Naylor wrote:
> + uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
> + uint32 nelem_per_iteration = 4 * nelem_per_vector;
> 
> Using local #defines would be my style. I don't have a reason to
> object to this way, but adding const makes these vars more clear.

I added const.

> Speaking of const:
> 
> - const __m128i tmp1 = _mm_or_si128(result1, result2);
> - const __m128i tmp2 = _mm_or_si128(result3, result4);
> - const __m128i result = _mm_or_si128(tmp1, tmp2);
> + tmp1 = vector32_or(result1, result2);
> + tmp2 = vector32_or(result3, result4);
> + result = vector32_or(tmp1, tmp2);
> 
> Any reason to throw away the const declarations?

The only reason is because I had to move the declarations to before the
vector32_load() calls.

> +static inline bool
> +vector32_is_highbit_set(const Vector32 v)
> +{
> +#ifdef USE_SSE2
> + return (_mm_movemask_epi8(v) & 0x) != 0;
> +#endif
> +}
> 
> I'm not sure why we need this function -- AFAICS it just adds more
> work on x86 for zero benefit. For our present application, can we just
> cast to Vector8 (for Arm's sake) and call the 8-bit version?

Good idea.

> - * operations using bitwise operations on unsigned integers.
> + * operations using bitwise operations on unsigned integers.  Note that many
> + * of the functions in this file presently do not have non-SIMD
> + * implementations.
> 
> It's unclear to the reader whether this is a matter of 'round-to-it's.
> I'd like to document what I asserted in this thread, that it's likely
> not worthwhile to do anything with a uint64 representing two 32-bit
> ints. (It *is* demonstrably worth it for handling 8 byte-values at a
> time)

Done.

>   * Use saturating subtraction to find bytes <= c, which will present as
> - * NUL bytes in 'sub'.
> + * NUL bytes.
> 
> I'd like to to point out that the reason to do it this way is to
> workaround SIMD architectures frequent lack of unsigned comparison.

Done.

> + * Return the result of subtracting the respective elements of the input
> + * vectors using saturation.
> 
> I wonder if we should explain briefly what saturating arithmetic is. I
> had never encountered it outside of a SIMD programming context.

Done.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 567c5309f3caa87c8cd7fe2de62309eea429d8c5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 25 Aug 2022 22:18:30 -0700
Subject: [PATCH v6 1/2] abstract architecture-specific implementation details
 from pg_lfind32()

---
 src/include/port/pg_lfind.h | 55 ---
 src/include/port/simd.h | 88 +++--
 2 files changed, 104 insertions(+), 39 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index a4e13dffec..1d9be4eb36 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -91,16 +91,19 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 {
 	uint32		i = 0;
 
-#ifdef USE_SSE2
+#ifndef USE_NO_SIMD
 
 	/*
-	 * A 16-byte register only has four 4-byte lanes. For better
-	 * instruction-level parallelism, each loop iteration operates on a block
-	 * of four registers. Testing has showed this is ~40% faster than using a
-	 * block of two registers.
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.  Testing for SSE2 has showed this is ~40%
+	 * faster than using a block of two registers.
 	 */
-	const		__m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
-	uint32		iterations = nelem & ~0xF;	/* round down to multiple of 16 */
+	const Vector32 keys = vector32_broadcast(key);	/* load copies of key */
+	const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
+	const uint32 nelem_per_iteration = 4 * nelem_per_vector;
+
+	/* round down to multiple of elements per iteration */
+	const uint32 tail_idx = nelem & ~(nelem_per_iteration - 1);
 
 #if defined(USE_ASSERT_CHECKING)
 	bool		assert_result = false;
@@ -116,31 +119,33 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	}
 #endif
 
-	for (i = 0; i < iterations; i += 16)
+	for (i = 0; i < tail_idx; i += nelem_per_iteration)
 	{
-		/* load the next block into 4 registers holding 4 values each */
-		const		__m128i vals1 = _mm_loadu_si128((__m128i *) & base[i]);
-		const		__m128i vals2 = _mm_loadu_si128((__m128i *) & base[i + 4]);
-		const		__m128i vals3 = _mm_loadu_si128((__m128i *) & base[i + 8]);
-		const		__m128i vals4 = _mm_loadu_si128((__m128i *) & base[i + 12]);
+		Vector32	vals1, vals2, vals3, vals4,
+	result1, result2, result3, result4,
+	tmp1, tmp2, result;
+
+		/* load the next block into 4 registers */
+		vector32_load(, [i]);
+		vector32_load(, [i + nelem_per_vector]);
+		vector32_load(, [i + nelem_per_vector * 2]);
+		vector32_load(, [i + nelem_per_vector * 3]);
 
 		/* compare each value to the key */
-		const		__m128i result1 = _mm_cmpeq_epi32(keys, vals1);
-		const		__m128i result2 = 

Re: use ARM intrinsics in pg_lfind32() where available

2022-08-28 Thread Tom Lane
John Naylor  writes:
> I wonder if we should explain briefly what saturating arithmetic is. I
> had never encountered it outside of a SIMD programming context.

+1, it's at least worth a sentence to define the term.

regards, tom lane




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-28 Thread John Naylor
On Sun, Aug 28, 2022 at 10:58 AM Nathan Bossart
 wrote:
>
> Here is a new patch set in which I've attempted to address all feedback.

Looks in pretty good shape. Some more comments:

+ uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
+ uint32 nelem_per_iteration = 4 * nelem_per_vector;

Using local #defines would be my style. I don't have a reason to
object to this way, but adding const makes these vars more clear.
Speaking of const:

- const __m128i tmp1 = _mm_or_si128(result1, result2);
- const __m128i tmp2 = _mm_or_si128(result3, result4);
- const __m128i result = _mm_or_si128(tmp1, tmp2);
+ tmp1 = vector32_or(result1, result2);
+ tmp2 = vector32_or(result3, result4);
+ result = vector32_or(tmp1, tmp2);

Any reason to throw away the const declarations?

+static inline bool
+vector32_is_highbit_set(const Vector32 v)
+{
+#ifdef USE_SSE2
+ return (_mm_movemask_epi8(v) & 0x) != 0;
+#endif
+}

I'm not sure why we need this function -- AFAICS it just adds more
work on x86 for zero benefit. For our present application, can we just
cast to Vector8 (for Arm's sake) and call the 8-bit version?

Aside from that, I plan on rewriting some comments for commit, some of
which pre-date this patch:

- * operations using bitwise operations on unsigned integers.
+ * operations using bitwise operations on unsigned integers.  Note that many
+ * of the functions in this file presently do not have non-SIMD
+ * implementations.

It's unclear to the reader whether this is a matter of 'round-to-it's.
I'd like to document what I asserted in this thread, that it's likely
not worthwhile to do anything with a uint64 representing two 32-bit
ints. (It *is* demonstrably worth it for handling 8 byte-values at a
time)

  * Use saturating subtraction to find bytes <= c, which will present as
- * NUL bytes in 'sub'.
+ * NUL bytes.

I'd like to to point out that the reason to do it this way is to
workaround SIMD architectures frequent lack of unsigned comparison.

+ * Return the result of subtracting the respective elements of the input
+ * vectors using saturation.

I wonder if we should explain briefly what saturating arithmetic is. I
had never encountered it outside of a SIMD programming context.

--
John Naylor
EDB: http://www.enterprisedb.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-27 Thread Nathan Bossart
Here is a new patch set in which I've attempted to address all feedback.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From a5f381097819db05b6e47418597cd56bab411fad Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 25 Aug 2022 22:18:30 -0700
Subject: [PATCH v5 1/2] abstract architecture-specific implementation details
 from pg_lfind32()

---
 src/include/port/pg_lfind.h | 55 +++--
 src/include/port/simd.h | 96 +++--
 2 files changed, 112 insertions(+), 39 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index a4e13dffec..2f8413b59e 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -91,16 +91,19 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 {
 	uint32		i = 0;
 
-#ifdef USE_SSE2
+#ifndef USE_NO_SIMD
 
 	/*
-	 * A 16-byte register only has four 4-byte lanes. For better
-	 * instruction-level parallelism, each loop iteration operates on a block
-	 * of four registers. Testing has showed this is ~40% faster than using a
-	 * block of two registers.
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.  Testing for SSE2 has showed this is ~40%
+	 * faster than using a block of two registers.
 	 */
-	const		__m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
-	uint32		iterations = nelem & ~0xF;	/* round down to multiple of 16 */
+	const Vector32 keys = vector32_broadcast(key);	/* load copies of key */
+	uint32		nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
+	uint32		nelem_per_iteration = 4 * nelem_per_vector;
+
+	/* round down to multiple of elements per iteration */
+	uint32		tail_idx = nelem & ~(nelem_per_iteration - 1);
 
 #if defined(USE_ASSERT_CHECKING)
 	bool		assert_result = false;
@@ -116,31 +119,33 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	}
 #endif
 
-	for (i = 0; i < iterations; i += 16)
+	for (i = 0; i < tail_idx; i += nelem_per_iteration)
 	{
-		/* load the next block into 4 registers holding 4 values each */
-		const		__m128i vals1 = _mm_loadu_si128((__m128i *) & base[i]);
-		const		__m128i vals2 = _mm_loadu_si128((__m128i *) & base[i + 4]);
-		const		__m128i vals3 = _mm_loadu_si128((__m128i *) & base[i + 8]);
-		const		__m128i vals4 = _mm_loadu_si128((__m128i *) & base[i + 12]);
+		Vector32	vals1, vals2, vals3, vals4,
+	result1, result2, result3, result4,
+	tmp1, tmp2, result;
+
+		/* load the next block into 4 registers */
+		vector32_load(, [i]);
+		vector32_load(, [i + nelem_per_vector]);
+		vector32_load(, [i + nelem_per_vector * 2]);
+		vector32_load(, [i + nelem_per_vector * 3]);
 
 		/* compare each value to the key */
-		const		__m128i result1 = _mm_cmpeq_epi32(keys, vals1);
-		const		__m128i result2 = _mm_cmpeq_epi32(keys, vals2);
-		const		__m128i result3 = _mm_cmpeq_epi32(keys, vals3);
-		const		__m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+		result1 = vector32_eq(keys, vals1);
+		result2 = vector32_eq(keys, vals2);
+		result3 = vector32_eq(keys, vals3);
+		result4 = vector32_eq(keys, vals4);
 
 		/* combine the results into a single variable */
-		const		__m128i tmp1 = _mm_or_si128(result1, result2);
-		const		__m128i tmp2 = _mm_or_si128(result3, result4);
-		const		__m128i result = _mm_or_si128(tmp1, tmp2);
+		tmp1 = vector32_or(result1, result2);
+		tmp2 = vector32_or(result3, result4);
+		result = vector32_or(tmp1, tmp2);
 
 		/* see if there was a match */
-		if (_mm_movemask_epi8(result) != 0)
+		if (vector32_is_highbit_set(result))
 		{
-#if defined(USE_ASSERT_CHECKING)
 			Assert(assert_result == true);
-#endif
 			return true;
 		}
 	}
@@ -151,14 +156,14 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	{
 		if (key == base[i])
 		{
-#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING)
+#ifndef USE_NO_SIMD
 			Assert(assert_result == true);
 #endif
 			return true;
 		}
 	}
 
-#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING)
+#ifndef USE_NO_SIMD
 	Assert(assert_result == false);
 #endif
 	return false;
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index a425cd887b..58b5f5ed86 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -31,39 +31,52 @@
 #include 
 #define USE_SSE2
 typedef __m128i Vector8;
+typedef __m128i Vector32;
 
 #else
 /*
  * If no SIMD instructions are available, we can in some cases emulate vector
- * operations using bitwise operations on unsigned integers.
+ * operations using bitwise operations on unsigned integers.  Note that many
+ * of the functions in this file presently do not have non-SIMD
+ * implementations.
  */
 #define USE_NO_SIMD
 typedef uint64 Vector8;
 #endif
 
-
 /* load/store operations */
 static inline void vector8_load(Vector8 *v, const uint8 *s);
+#ifndef USE_NO_SIMD
+static inline void vector32_load(Vector32 *v, const uint32 *s);
+#endif
 
 /* assignment operations */
 static inline Vector8 vector8_broadcast(const uint8 c);
+#ifndef 

Re: use ARM intrinsics in pg_lfind32() where available

2022-08-27 Thread Nathan Bossart
On Sun, Aug 28, 2022 at 10:39:09AM +1200, Thomas Munro wrote:
> On Sun, Aug 28, 2022 at 10:12 AM Nathan Bossart
>  wrote:
>> Yup.  The problem is that AFAICT there's no equivalent to
>> _mm_movemask_epi8() on aarch64, so you end up with something like
>>
>> vmaxvq_u8(vandq_u8(v, vector8_broadcast(0x80))) != 0
>>
>> But for pg_lfind32(), we really just want to know if any lane is set, which
>> only requires a call to vmaxvq_u32().  I haven't had a chance to look too
>> closely, but my guess is that this ultimately results in an extra AND
>> operation in the aarch64 path, so maybe it doesn't impact performance too
>> much.  The other option would be to open-code the intrinsic function calls
>> into pg_lfind.h.  I'm trying to avoid the latter, but maybe it's the right
>> thing to do for now...  What do you think?
> 
> Ahh, this gives me a flashback to John's UTF-8 validation thread[1]
> (the beginner NEON hackery in there was just a learning exercise,
> sadly not followed up with real patches...).  He had
> _mm_movemask_epi8(v) != 0 which I first translated to
> to_bool(bitwise_and(v, vmovq_n_u8(0x80))) and he pointed out that
> vmaxvq_u8(v) > 0x7F has the right effect without the and.

I knew there had to be an easier way!  I'll give this a try.  Thanks.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-27 Thread Thomas Munro
On Sun, Aug 28, 2022 at 10:12 AM Nathan Bossart
 wrote:
> Yup.  The problem is that AFAICT there's no equivalent to
> _mm_movemask_epi8() on aarch64, so you end up with something like
>
> vmaxvq_u8(vandq_u8(v, vector8_broadcast(0x80))) != 0
>
> But for pg_lfind32(), we really just want to know if any lane is set, which
> only requires a call to vmaxvq_u32().  I haven't had a chance to look too
> closely, but my guess is that this ultimately results in an extra AND
> operation in the aarch64 path, so maybe it doesn't impact performance too
> much.  The other option would be to open-code the intrinsic function calls
> into pg_lfind.h.  I'm trying to avoid the latter, but maybe it's the right
> thing to do for now...  What do you think?

Ahh, this gives me a flashback to John's UTF-8 validation thread[1]
(the beginner NEON hackery in there was just a learning exercise,
sadly not followed up with real patches...).  He had
_mm_movemask_epi8(v) != 0 which I first translated to
to_bool(bitwise_and(v, vmovq_n_u8(0x80))) and he pointed out that
vmaxvq_u8(v) > 0x7F has the right effect without the and.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGJjyXvS6W05kRVpH6Kng50%3DuOGxyiyjgPKm707JxQYHCg%40mail.gmail.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-27 Thread Nathan Bossart
On Sat, Aug 27, 2022 at 05:18:34PM -0400, Tom Lane wrote:
> In short, I think the critical part of 0002 needs to look more like
> this:
> 
> +#elif defined(__aarch64__) && defined(__ARM_NEON)
> +/*
> + * We use the Neon instructions if the compiler provides access to them
> + * (as indicated by __ARM_NEON) and we are on aarch64.  While Neon support is
> + * technically optional for aarch64, it appears that all available 64-bit
> + * hardware does have it.  Neon exists in some 32-bit hardware too, but
> + * we could not realistically use it there without a run-time check,
> + * which seems not worth the trouble for now.
> + */
> +#include 
> +#define USE_NEON
> ...

Thank you for the analysis!  I'll do it this way in the next patch set.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-27 Thread Nathan Bossart
Thanks for taking a look.

On Sat, Aug 27, 2022 at 01:59:06PM +0700, John Naylor wrote:
> I don't forsee any use of emulating vector registers with uint64 if
> they only hold two ints. I wonder if it'd be better if all vector32
> functions were guarded with #ifndef NO_USE_SIMD. (I wonder if
> declarations without definitions cause warnings...)

Yeah.  I was a bit worried about the readability of this file with so many
#ifndefs, but after trying it out, I suppose it doesn't look _too_ bad.

> + * NB: This function assumes that each lane in the given vector either has 
> all
> + * bits set or all bits zeroed, as it is mainly intended for use with
> + * operations that produce such vectors (e.g., vector32_eq()).  If this
> + * assumption is not true, this function's behavior is undefined.
> + */
> 
> Hmm?

Yup.  The problem is that AFAICT there's no equivalent to
_mm_movemask_epi8() on aarch64, so you end up with something like

vmaxvq_u8(vandq_u8(v, vector8_broadcast(0x80))) != 0

But for pg_lfind32(), we really just want to know if any lane is set, which
only requires a call to vmaxvq_u32().  I haven't had a chance to look too
closely, but my guess is that this ultimately results in an extra AND
operation in the aarch64 path, so maybe it doesn't impact performance too
much.  The other option would be to open-code the intrinsic function calls
into pg_lfind.h.  I'm trying to avoid the latter, but maybe it's the right
thing to do for now...  What do you think?

> -#elif defined(USE_SSE2)
> +#elif defined(USE_SSE2) || defined(USE_NEON)
> 
> I think we can just say #else.

Yes.

> -#if defined(USE_SSE2)
> - __m128i sub;
> +#ifndef USE_NO_SIMD
> + Vector8 sub;
> 
> +#elif defined(USE_NEON)
> +
> + /* use the same approach as the USE_SSE2 block above */
> + sub = vqsubq_u8(v, vector8_broadcast(c));
> + result = vector8_has_zero(sub);
> 
> I think we should invent a helper that does saturating subtraction and
> call that, inlining the sub var so we don't need to mess with it
> further.

Good idea, will do.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-27 Thread Tom Lane
I spent a bit more time researching the portability implications of
this patch.  I think that we should check __ARM_NEON before #including
; there is authoritative documentation out there telling
you to, eg [1], and I can see no upside at all to not checking.
We cannot check *only* __ARM_NEON, though.  I found it to get defined
by clang 8.0.0 in my Fedora 30 32-bit image, although that does not
provide all the instructions we want (I see "undefined function"
complaints for vmaxvq_u8 etc if I try to make it use the patch).
Looking into that installation's , those functions are
defined conditionally if "__ARM_FP & 2", which is kind of interesting
--- per [1], that bit indicates support for 16-bit floating point,
which seems a mite unrelated.

It appears from the info at [2] that there are at least some 32-bit
ARM platforms that set that bit, implying (if the clang authors are
well informed) that they have the instructions we want.  But we
could not realistically make 32-bit builds that try to use those
instructions without a run-time test; such a build would fail for
too many people.  I doubt that a run-time test is worth the trouble,
so I concur with the idea of selecting NEON on aarch64 only and hoping
to thereby avoid a runtime test.

In short, I think the critical part of 0002 needs to look more like
this:

+#elif defined(__aarch64__) && defined(__ARM_NEON)
+/*
+ * We use the Neon instructions if the compiler provides access to them
+ * (as indicated by __ARM_NEON) and we are on aarch64.  While Neon support is
+ * technically optional for aarch64, it appears that all available 64-bit
+ * hardware does have it.  Neon exists in some 32-bit hardware too, but
+ * we could not realistically use it there without a run-time check,
+ * which seems not worth the trouble for now.
+ */
+#include 
+#define USE_NEON
...

Coding like this appears to work on both my Apple M1 and my Raspberry
Pi, with several different OSes checked on the latter.

regards, tom lane

[1] 
https://developer.arm.com/documentation/101754/0618/armclang-Reference/Other-Compiler-specific-Features/Predefined-macros
[2] http://micro-os-plus.github.io/develop/predefined-macros/




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-27 Thread John Naylor
On Sat, Aug 27, 2022 at 1:24 AM Nathan Bossart  wrote:
>
> Here is a rebased patch set that applies to HEAD.

0001:

 #define USE_NO_SIMD
 typedef uint64 Vector8;
+typedef uint64 Vector32;
 #endif

I don't forsee any use of emulating vector registers with uint64 if
they only hold two ints. I wonder if it'd be better if all vector32
functions were guarded with #ifndef NO_USE_SIMD. (I wonder if
declarations without definitions cause warnings...)

+ * NB: This function assumes that each lane in the given vector either has all
+ * bits set or all bits zeroed, as it is mainly intended for use with
+ * operations that produce such vectors (e.g., vector32_eq()).  If this
+ * assumption is not true, this function's behavior is undefined.
+ */

Hmm?

Also, is_highbit_set() already has uses same intrinsic and has the
same intended effect, since we only care about the boolean result.

0002:

-#elif defined(USE_SSE2)
+#elif defined(USE_SSE2) || defined(USE_NEON)

I think we can just say #else.

-#if defined(USE_SSE2)
- __m128i sub;
+#ifndef USE_NO_SIMD
+ Vector8 sub;

+#elif defined(USE_NEON)
+
+ /* use the same approach as the USE_SSE2 block above */
+ sub = vqsubq_u8(v, vector8_broadcast(c));
+ result = vector8_has_zero(sub);

I think we should invent a helper that does saturating subtraction and
call that, inlining the sub var so we don't need to mess with it
further.

Otherwise seems fine.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-26 Thread Nathan Bossart
On Thu, Aug 25, 2022 at 11:13:47PM -0700, Nathan Bossart wrote:
> Here is a new patch set that applies on top of v9-0001 in the
> json_lex_string patch set [0] and v3 of the is_valid_ascii patch [1].

Here is a rebased patch set that applies to HEAD.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 8d8afe70bccec20cd381934fae5e11e155d78129 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 25 Aug 2022 22:18:30 -0700
Subject: [PATCH v4 1/2] abstract architecture-specific implementation details
 from pg_lfind32()

---
 src/include/port/pg_lfind.h | 55 +---
 src/include/port/simd.h | 63 +
 2 files changed, 93 insertions(+), 25 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index a4e13dffec..7a851ea42c 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -91,16 +91,19 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 {
 	uint32		i = 0;
 
-#ifdef USE_SSE2
+#ifndef USE_NO_SIMD
 
 	/*
-	 * A 16-byte register only has four 4-byte lanes. For better
-	 * instruction-level parallelism, each loop iteration operates on a block
-	 * of four registers. Testing has showed this is ~40% faster than using a
-	 * block of two registers.
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.  Testing for SSE2 has showed this is ~40%
+	 * faster than using a block of two registers.
 	 */
-	const		__m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
-	uint32		iterations = nelem & ~0xF;	/* round down to multiple of 16 */
+	const Vector32 keys = vector32_broadcast(key);	/* load copies of key */
+	uint32		nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
+	uint32		nelem_per_iteration = 4 * nelem_per_vector;
+
+	/* round down to multiple of elements per iteration */
+	uint32		tail_idx = nelem & ~(nelem_per_iteration - 1);
 
 #if defined(USE_ASSERT_CHECKING)
 	bool		assert_result = false;
@@ -116,31 +119,33 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	}
 #endif
 
-	for (i = 0; i < iterations; i += 16)
+	for (i = 0; i < tail_idx; i += nelem_per_iteration)
 	{
-		/* load the next block into 4 registers holding 4 values each */
-		const		__m128i vals1 = _mm_loadu_si128((__m128i *) & base[i]);
-		const		__m128i vals2 = _mm_loadu_si128((__m128i *) & base[i + 4]);
-		const		__m128i vals3 = _mm_loadu_si128((__m128i *) & base[i + 8]);
-		const		__m128i vals4 = _mm_loadu_si128((__m128i *) & base[i + 12]);
+		Vector32	vals1, vals2, vals3, vals4,
+	result1, result2, result3, result4,
+	tmp1, tmp2, result;
+
+		/* load the next block into 4 registers */
+		vector32_load(, [i]);
+		vector32_load(, [i + nelem_per_vector]);
+		vector32_load(, [i + nelem_per_vector * 2]);
+		vector32_load(, [i + nelem_per_vector * 3]);
 
 		/* compare each value to the key */
-		const		__m128i result1 = _mm_cmpeq_epi32(keys, vals1);
-		const		__m128i result2 = _mm_cmpeq_epi32(keys, vals2);
-		const		__m128i result3 = _mm_cmpeq_epi32(keys, vals3);
-		const		__m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+		result1 = vector32_eq(keys, vals1);
+		result2 = vector32_eq(keys, vals2);
+		result3 = vector32_eq(keys, vals3);
+		result4 = vector32_eq(keys, vals4);
 
 		/* combine the results into a single variable */
-		const		__m128i tmp1 = _mm_or_si128(result1, result2);
-		const		__m128i tmp2 = _mm_or_si128(result3, result4);
-		const		__m128i result = _mm_or_si128(tmp1, tmp2);
+		tmp1 = vector32_or(result1, result2);
+		tmp2 = vector32_or(result3, result4);
+		result = vector32_or(tmp1, tmp2);
 
 		/* see if there was a match */
-		if (_mm_movemask_epi8(result) != 0)
+		if (vector32_any_lane_set(result))
 		{
-#if defined(USE_ASSERT_CHECKING)
 			Assert(assert_result == true);
-#endif
 			return true;
 		}
 	}
@@ -151,14 +156,14 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	{
 		if (key == base[i])
 		{
-#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING)
+#ifndef USE_NO_SIMD
 			Assert(assert_result == true);
 #endif
 			return true;
 		}
 	}
 
-#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING)
+#ifndef USE_NO_SIMD
 	Assert(assert_result == false);
 #endif
 	return false;
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index a425cd887b..c42dccf784 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -31,6 +31,7 @@
 #include 
 #define USE_SSE2
 typedef __m128i Vector8;
+typedef __m128i Vector32;
 
 #else
 /*
@@ -39,14 +40,17 @@ typedef __m128i Vector8;
  */
 #define USE_NO_SIMD
 typedef uint64 Vector8;
+typedef uint64 Vector32;
 #endif
 
 
 /* load/store operations */
 static inline void vector8_load(Vector8 *v, const uint8 *s);
+static inline void vector32_load(Vector32 *v, const uint32 *s);
 
 /* assignment operations */
 static inline Vector8 vector8_broadcast(const uint8 c);
+static inline Vector32 vector32_broadcast(const uint32 c);
 
 /* element-wise comparisons to a 

Re: use ARM intrinsics in pg_lfind32() where available

2022-08-26 Thread Nathan Bossart
Here is a new patch set that applies on top of v9-0001 in the
json_lex_string patch set [0] and v3 of the is_valid_ascii patch [1].

[0] 
https://postgr.es/m/CAFBsxsFV4v802idV0-Bo%3DV7wLMHRbOZ4er0hgposhyGCikmVGA%40mail.gmail.com
[1] 
https://postgr.es/m/CAFBsxsFFAZ6acUfyUALiem4DpCW%3DApXbF02zrc0G0oT9CPof0Q%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 5f973a39d67a744d514ee80e05a1c7f40bc0ebc6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 25 Aug 2022 22:18:30 -0700
Subject: [PATCH v3 1/2] abstract architecture-specific implementation details
 from pg_lfind32()

---
 src/include/port/pg_lfind.h | 55 ++
 src/include/port/simd.h | 60 +
 2 files changed, 90 insertions(+), 25 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index a4e13dffec..7a851ea42c 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -91,16 +91,19 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 {
 	uint32		i = 0;
 
-#ifdef USE_SSE2
+#ifndef USE_NO_SIMD
 
 	/*
-	 * A 16-byte register only has four 4-byte lanes. For better
-	 * instruction-level parallelism, each loop iteration operates on a block
-	 * of four registers. Testing has showed this is ~40% faster than using a
-	 * block of two registers.
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.  Testing for SSE2 has showed this is ~40%
+	 * faster than using a block of two registers.
 	 */
-	const		__m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
-	uint32		iterations = nelem & ~0xF;	/* round down to multiple of 16 */
+	const Vector32 keys = vector32_broadcast(key);	/* load copies of key */
+	uint32		nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
+	uint32		nelem_per_iteration = 4 * nelem_per_vector;
+
+	/* round down to multiple of elements per iteration */
+	uint32		tail_idx = nelem & ~(nelem_per_iteration - 1);
 
 #if defined(USE_ASSERT_CHECKING)
 	bool		assert_result = false;
@@ -116,31 +119,33 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	}
 #endif
 
-	for (i = 0; i < iterations; i += 16)
+	for (i = 0; i < tail_idx; i += nelem_per_iteration)
 	{
-		/* load the next block into 4 registers holding 4 values each */
-		const		__m128i vals1 = _mm_loadu_si128((__m128i *) & base[i]);
-		const		__m128i vals2 = _mm_loadu_si128((__m128i *) & base[i + 4]);
-		const		__m128i vals3 = _mm_loadu_si128((__m128i *) & base[i + 8]);
-		const		__m128i vals4 = _mm_loadu_si128((__m128i *) & base[i + 12]);
+		Vector32	vals1, vals2, vals3, vals4,
+	result1, result2, result3, result4,
+	tmp1, tmp2, result;
+
+		/* load the next block into 4 registers */
+		vector32_load(, [i]);
+		vector32_load(, [i + nelem_per_vector]);
+		vector32_load(, [i + nelem_per_vector * 2]);
+		vector32_load(, [i + nelem_per_vector * 3]);
 
 		/* compare each value to the key */
-		const		__m128i result1 = _mm_cmpeq_epi32(keys, vals1);
-		const		__m128i result2 = _mm_cmpeq_epi32(keys, vals2);
-		const		__m128i result3 = _mm_cmpeq_epi32(keys, vals3);
-		const		__m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+		result1 = vector32_eq(keys, vals1);
+		result2 = vector32_eq(keys, vals2);
+		result3 = vector32_eq(keys, vals3);
+		result4 = vector32_eq(keys, vals4);
 
 		/* combine the results into a single variable */
-		const		__m128i tmp1 = _mm_or_si128(result1, result2);
-		const		__m128i tmp2 = _mm_or_si128(result3, result4);
-		const		__m128i result = _mm_or_si128(tmp1, tmp2);
+		tmp1 = vector32_or(result1, result2);
+		tmp2 = vector32_or(result3, result4);
+		result = vector32_or(tmp1, tmp2);
 
 		/* see if there was a match */
-		if (_mm_movemask_epi8(result) != 0)
+		if (vector32_any_lane_set(result))
 		{
-#if defined(USE_ASSERT_CHECKING)
 			Assert(assert_result == true);
-#endif
 			return true;
 		}
 	}
@@ -151,14 +156,14 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	{
 		if (key == base[i])
 		{
-#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING)
+#ifndef USE_NO_SIMD
 			Assert(assert_result == true);
 #endif
 			return true;
 		}
 	}
 
-#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING)
+#ifndef USE_NO_SIMD
 	Assert(assert_result == false);
 #endif
 	return false;
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 8f85153110..bd4f1a3f39 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -32,6 +32,7 @@
 #include 
 #define USE_SSE2
 typedef __m128i Vector8;
+typedef __m128i Vector32;
 
 #else
 /*
@@ -40,18 +41,24 @@ typedef __m128i Vector8;
  */
 #define USE_NO_SIMD
 typedef uint64 Vector8;
+typedef uint64 Vector32;
 #endif
 
 
 static inline void vector8_load(Vector8 *v, const uint8 *s);
+static inline void vector32_load(Vector32 *v, const uint32 *s);
 static inline Vector8 vector8_broadcast(const uint8 c);
+static inline Vector32 vector32_broadcast(const uint32 c);
 static inline bool 

Re: use ARM intrinsics in pg_lfind32() where available

2022-08-25 Thread Nathan Bossart
On Fri, Aug 26, 2022 at 10:45:10AM +0700, John Naylor wrote:
> On Thu, Aug 25, 2022 at 11:57 AM Nathan Bossart
>  wrote:
>> The ARM literature appears to indicate that Neon support is pretty standard
>> on aarch64, and AFAICT it's pretty common to just assume it's available.
> 
> This doesn't exactly rise to the level of "find out for sure", so I
> went looking myself. This is the language I found [1]:
> 
> "Both floating-point and NEON are required in all standard ARMv8
> implementations. However, implementations targeting specialized
> markets may support the following combinations:
> 
> No NEON or floating-point.
> Full floating-point and SIMD support with exception trapping.
> Full floating-point and SIMD support without exception trapping."

Sorry, I should've linked to the documentation I found.  I saw similar
language in a couple of manuals, which is what led me to the conclusion
that Neon support is relatively standard.

> Since we assume floating-point, I see no reason not to assume NEON,
> but a case could be made for documenting that we require NEON on
> aarch64, in addition to exception trapping (for CRC runtime check) and
> floating point on any Arm. Or even just say "standard". I don't
> believe anyone will want to run Postgres on specialized hardware
> lacking these features, so maybe it's a moot point.

I'm okay with assuming Neon support for now.  It's probably easier to add
the __ARM_NEON check if/when someone complains than it is to justify
removing it once it's there.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-25 Thread John Naylor
On Thu, Aug 25, 2022 at 11:57 AM Nathan Bossart
 wrote:
>
> On Thu, Aug 25, 2022 at 10:38:34AM +0700, John Naylor wrote:
> > On Thu, Aug 25, 2022 at 1:01 AM Nathan Bossart  
> > wrote:
> >> On Wed, Aug 24, 2022 at 11:07:03AM +0700, John Naylor wrote:
> >> > - Can a user on ARM64 ever get a runtime fault if the machine attempts
> >> > to execute NEON instructions?
> >>
> >> IIUC yes, although I'm not sure how likely it is in practice.
> >
> > Given the quoted part above, it doesn't seem likely, but we should try
> > to find out for sure, because a runtime fault is surely not acceptable
> > even on a toy system.
>
> The ARM literature appears to indicate that Neon support is pretty standard
> on aarch64, and AFAICT it's pretty common to just assume it's available.

This doesn't exactly rise to the level of "find out for sure", so I
went looking myself. This is the language I found [1]:

"Both floating-point and NEON are required in all standard ARMv8
implementations. However, implementations targeting specialized
markets may support the following combinations:

No NEON or floating-point.
Full floating-point and SIMD support with exception trapping.
Full floating-point and SIMD support without exception trapping."

Since we assume floating-point, I see no reason not to assume NEON,
but a case could be made for documenting that we require NEON on
aarch64, in addition to exception trapping (for CRC runtime check) and
floating point on any Arm. Or even just say "standard". I don't
believe anyone will want to run Postgres on specialized hardware
lacking these features, so maybe it's a moot point.

[1] 
https://developer.arm.com/documentation/den0024/a/AArch64-Floating-point-and-NEON

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-24 Thread Nathan Bossart
On Thu, Aug 25, 2022 at 10:38:34AM +0700, John Naylor wrote:
> On Thu, Aug 25, 2022 at 1:01 AM Nathan Bossart  
> wrote:
>> On Wed, Aug 24, 2022 at 11:07:03AM +0700, John Naylor wrote:
>> > - Can a user on ARM64 ever get a runtime fault if the machine attempts
>> > to execute NEON instructions?
>>
>> IIUC yes, although I'm not sure how likely it is in practice.
> 
> Given the quoted part above, it doesn't seem likely, but we should try
> to find out for sure, because a runtime fault is surely not acceptable
> even on a toy system.

The ARM literature appears to indicate that Neon support is pretty standard
on aarch64, and AFAICT it's pretty common to just assume it's available.
As originally suspected, I believe that simply checking for __aarch64__
would be sufficient, but I don't think it would be unreasonable to also
check for __ARM_NEON to be safe.

>> Interestingly, Clang still defines __ARM_NEON__ even when
>> +nosimd is specified.
> 
> POLA violation, but if no one has complained to them, it's a good bet
> the instructions are always available.

Sorry, I should've been more specific.  In my testing, I could include or
omit __ARM_NEON using +[no]simd, but __ARM_NEON__ (with two underscores at
the end) was always there.  My brief research seems to indicate this might
be unique to Darwin, but in the end, it looks like __ARM_NEON (without the
trailing underscores) is the most widely used.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-24 Thread John Naylor
On Thu, Aug 25, 2022 at 1:01 AM Nathan Bossart  wrote:
>
> On Wed, Aug 24, 2022 at 11:07:03AM +0700, John Naylor wrote:
> > The important thing is: if we compile with __aarch64__ as a target:
> > - Will the compiler emit the intended instructions from the intrinsics
> > without extra flags?
>
> My testing with GCC and Clang did not require any extra flags.  GCC appears
> to enable it by default for aarch64 [0].  AFAICT this is the case for Clang
> as well, but that is based on the code and my testing (I couldn't find any
> documentation for this).

I guess you meant this part: "‘simd’ Enable Advanced SIMD
instructions. This also enables floating-point instructions. This is
on by default for all possible values for options -march and -mcpu."

> > - Can a user on ARM64 ever get a runtime fault if the machine attempts
> > to execute NEON instructions?
>
> IIUC yes, although I'm not sure how likely it is in practice.

Given the quoted part above, it doesn't seem likely, but we should try
to find out for sure, because a runtime fault is surely not acceptable
even on a toy system.

> > "I have been able to compile for
> > __aarch64__ without __ARM_NEON" doesn't really answer that question --
> > what exactly did this entail?
>
> Compiling with something like -march=armv8-a+nosimd prevents defining
> __ARM_NEON.

Okay, that's unsurprising.

> Interestingly, Clang still defines __ARM_NEON__ even when
> +nosimd is specified.

POLA violation, but if no one has complained to them, it's a good bet
the instructions are always available.

> > I took a quick look around at Debian code search, *BSD, Apple, and a
> > few other places, and I can't find it. Then, I looked at the
> > discussions around commit 5c7603c318872a42e "Add ARM64 (aarch64)
> > support to s_lock.h", and the proposed patch [1] only had __aarch64__
> > . When it was committed, the platform was vaporware and I suppose we
> > included "__aarch64" as a prophylactic measure because no other reason
> > was given. It doesn't seem to exist anywhere, so unless someone can
> > demonstrate otherwise, I'm going to rip it out soon.
>
> This is what I found, too, so +1.  I've attached a patch for this.

Thanks, I'll push this soon. I wondered if the same reasoning applies
to __arm__ / __arm nowadays, but a quick search does indicate that
__arm exists (existed?), at least.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-24 Thread Nathan Bossart
On Wed, Aug 24, 2022 at 11:07:03AM +0700, John Naylor wrote:
> The important thing is: if we compile with __aarch64__ as a target:
> - Will the compiler emit the intended instructions from the intrinsics
> without extra flags?

My testing with GCC and Clang did not require any extra flags.  GCC appears
to enable it by default for aarch64 [0].  AFAICT this is the case for Clang
as well, but that is based on the code and my testing (I couldn't find any
documentation for this).

> - Can a user on ARM64 ever get a runtime fault if the machine attempts
> to execute NEON instructions?

IIUC yes, although I'm not sure how likely it is in practice.

> "I have been able to compile for
> __aarch64__ without __ARM_NEON" doesn't really answer that question --
> what exactly did this entail?

Compiling with something like -march=armv8-a+nosimd prevents defining
__ARM_NEON.  Interestingly, Clang still defines __ARM_NEON__ even when
+nosimd is specified.

> I took a quick look around at Debian code search, *BSD, Apple, and a
> few other places, and I can't find it. Then, I looked at the
> discussions around commit 5c7603c318872a42e "Add ARM64 (aarch64)
> support to s_lock.h", and the proposed patch [1] only had __aarch64__
> . When it was committed, the platform was vaporware and I suppose we
> included "__aarch64" as a prophylactic measure because no other reason
> was given. It doesn't seem to exist anywhere, so unless someone can
> demonstrate otherwise, I'm going to rip it out soon.

This is what I found, too, so +1.  I've attached a patch for this.

[0] https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index f7cd0f6f20..b14ce832bf 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -63,8 +63,7 @@
  * compiler barrier.
  *
  */
-#if defined(__arm__) || defined(__arm) || \
-	defined(__aarch64__) || defined(__aarch64)
+#if defined(__arm__) || defined(__arm) || defined(__aarch64__)
 #include "port/atomics/arch-arm.h"
 #elif defined(__i386__) || defined(__i386) || defined(__x86_64__)
 #include "port/atomics/arch-x86.h"
diff --git a/src/include/port/atomics/arch-arm.h b/src/include/port/atomics/arch-arm.h
index 9fe8f1b95f..7449f8404a 100644
--- a/src/include/port/atomics/arch-arm.h
+++ b/src/include/port/atomics/arch-arm.h
@@ -21,7 +21,7 @@
  * 64 bit atomics on ARM32 are implemented using kernel fallbacks and thus
  * might be slow, so disable entirely. On ARM64 that problem doesn't exist.
  */
-#if !defined(__aarch64__) && !defined(__aarch64)
+#if !defined(__aarch64__)
 #define PG_DISABLE_64_BIT_ATOMICS
 #else
 /*
@@ -29,4 +29,4 @@
  * general purpose register is atomic.
  */
 #define PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
-#endif  /* __aarch64__ || __aarch64 */
+#endif  /* __aarch64__ */
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index cc83d561b2..65aa66c598 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -256,7 +256,7 @@ spin_delay(void)
  * We use the int-width variant of the builtin because it works on more chips
  * than other widths.
  */
-#if defined(__arm__) || defined(__arm) || defined(__aarch64__) || defined(__aarch64)
+#if defined(__arm__) || defined(__arm) || defined(__aarch64__)
 #ifdef HAVE_GCC__SYNC_INT32_TAS
 #define HAS_TEST_AND_SET
 
@@ -277,7 +277,7 @@ tas(volatile slock_t *lock)
  * high-core-count ARM64 processors.  It seems mostly a wash for smaller gear,
  * and ISB doesn't exist at all on pre-v7 ARM chips.
  */
-#if defined(__aarch64__) || defined(__aarch64)
+#if defined(__aarch64__)
 
 #define SPIN_DELAY() spin_delay()
 
@@ -288,9 +288,9 @@ spin_delay(void)
 		" isb;\n");
 }
 
-#endif	 /* __aarch64__ || __aarch64 */
+#endif	 /* __aarch64__ */
 #endif	 /* HAVE_GCC__SYNC_INT32_TAS */
-#endif	 /* __arm__ || __arm || __aarch64__ || __aarch64 */
+#endif	 /* __arm__ || __arm || __aarch64__ */
 
 
 /*


Re: use ARM intrinsics in pg_lfind32() where available

2022-08-23 Thread John Naylor
On Tue, Aug 23, 2022 at 4:15 AM Nathan Bossart  wrote:
>
> On Mon, Aug 22, 2022 at 11:50:35AM +0700, John Naylor wrote:

> > Is this also ever defined on 32-bit? If so, is it safe, meaning the
> > compiler will not emit these instructions without additional flags?
> > I'm wondering if  __aarch64__ would be clearer on that, and if we get
> > windows-on-arm support as has been proposed, could also add _M_ARM64.
>
> I haven't been able to enable __ARM_NEON on 32-bit, but if it is somehow
> possible, we should probably add an __aarch64__ check since functions like
> vmaxvq_u32() do not appear to be available on 32-bit.  I have been able to
> compile for __aarch64__ without __ARM_NEON, so it might still be a good
> idea to check for __ARM_NEON.

The important thing is: if we compile with __aarch64__ as a target:
- Will the compiler emit the intended instructions from the intrinsics
without extra flags?
- Can a user on ARM64 ever get a runtime fault if the machine attempts
to execute NEON instructions? "I have been able to compile for
__aarch64__ without __ARM_NEON" doesn't really answer that question --
what exactly did this entail?

> > I also see #if defined(__aarch64__) || defined(__aarch64) in our
> > codebase already, but I'm not sure what recognizes the latter.
>
> I'm not sure what uses the latter, either.

I took a quick look around at Debian code search, *BSD, Apple, and a
few other places, and I can't find it. Then, I looked at the
discussions around commit 5c7603c318872a42e "Add ARM64 (aarch64)
support to s_lock.h", and the proposed patch [1] only had __aarch64__
. When it was committed, the platform was vaporware and I suppose we
included "__aarch64" as a prophylactic measure because no other reason
was given. It doesn't seem to exist anywhere, so unless someone can
demonstrate otherwise, I'm going to rip it out soon.

[1] 
https://www.postgresql.org/message-id/flat/1368448758.23422.12.camel%40t520.redhat.com

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-22 Thread Nathan Bossart
On Mon, Aug 22, 2022 at 11:50:35AM +0700, John Naylor wrote:
> On Sat, Aug 20, 2022 at 5:28 AM Nathan Bossart  
> wrote:
>> Thanks for the pointer.  GCC, Clang, and the Arm compiler all seem to
>> define __ARM_NEON, so here is a patch that uses that instead.
> 
> Is this also ever defined on 32-bit? If so, is it safe, meaning the
> compiler will not emit these instructions without additional flags?
> I'm wondering if  __aarch64__ would be clearer on that, and if we get
> windows-on-arm support as has been proposed, could also add _M_ARM64.

I haven't been able to enable __ARM_NEON on 32-bit, but if it is somehow
possible, we should probably add an __aarch64__ check since functions like
vmaxvq_u32() do not appear to be available on 32-bit.  I have been able to
compile for __aarch64__ without __ARM_NEON, so it might still be a good
idea to check for __ARM_NEON.  So, to be safe, perhaps we should use
something like the following:

#if (defined(__aarch64__) || defined(__aarch64)) && defined(__ARM_NEON)

> I also see #if defined(__aarch64__) || defined(__aarch64) in our
> codebase already, but I'm not sure what recognizes the latter.

I'm not sure what uses the latter, either.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-21 Thread John Naylor
On Sat, Aug 20, 2022 at 5:28 AM Nathan Bossart  wrote:
>
> On Fri, Aug 19, 2022 at 02:26:02PM -0700, Andres Freund wrote:
> > Are you sure there's not an appropriate define for us to use here instead 
> > of a
> > configure test? E.g.
> >
> > echo|cc -dM -P -E -|grep -iE 'arm|aarch'
> > ...
> > #define __AARCH64_SIMD__ 1
> > ...
> > #define __ARM_NEON 1
> > #define __ARM_NEON_FP 0xE
> > #define __ARM_NEON__ 1
> > ..
> >
> > I strikes me as non-scalable to explicitly test all the simd instructions 
> > we'd
> > use.
>
> Thanks for the pointer.  GCC, Clang, and the Arm compiler all seem to
> define __ARM_NEON, so here is a patch that uses that instead.

Is this also ever defined on 32-bit? If so, is it safe, meaning the
compiler will not emit these instructions without additional flags?
I'm wondering if  __aarch64__ would be clearer on that, and if we get
windows-on-arm support as has been proposed, could also add _M_ARM64.

I also see #if defined(__aarch64__) || defined(__aarch64) in our
codebase already, but I'm not sure what recognizes the latter.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: use ARM intrinsics in pg_lfind32() where available

2022-08-19 Thread Nathan Bossart
On Fri, Aug 19, 2022 at 02:26:02PM -0700, Andres Freund wrote:
> Are you sure there's not an appropriate define for us to use here instead of a
> configure test? E.g.
> 
> echo|cc -dM -P -E -|grep -iE 'arm|aarch'
> ...
> #define __AARCH64_SIMD__ 1
> ...
> #define __ARM_NEON 1
> #define __ARM_NEON_FP 0xE
> #define __ARM_NEON__ 1
> ..
> 
> I strikes me as non-scalable to explicitly test all the simd instructions we'd
> use.

Thanks for the pointer.  GCC, Clang, and the Arm compiler all seem to
define __ARM_NEON, so here is a patch that uses that instead.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 5f068010d30c2a92003e43fa655eab5db8ab7ec2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 19 Aug 2022 15:23:09 -0700
Subject: [PATCH v2 1/1] Use ARM Advanced SIMD intrinsic functions in
 pg_lfind32().

Use ARM Advanced SIMD intrinsic functions to speed up the search,
where available.  Otherwise, use a simple 'for' loop as before.  As
with b6ef167, this speeds up XidInMVCCSnapshot(), but any uses of
pg_lfind32() will also benefit.

Author: Nathan Bossart
---
 src/include/port/pg_lfind.h | 34 ++
 src/include/port/simd.h |  8 
 2 files changed, 42 insertions(+)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index fb125977b2..41a371681d 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -82,6 +82,40 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	}
 #endif			/* USE_SSE2 */
 
+#ifdef __ARM_NEON
+	/*
+	 * A 16-byte register only has four 4-byte lanes. For better
+	 * instruction-level parallelism, each loop iteration operates on a block
+	 * of four registers.
+	 */
+	const		uint32x4_t keys = vdupq_n_u32(key); /* load 4 copies of key */
+	uint32		iterations = nelem & ~0xF;  /* round down to multiple of 16 */
+
+	for (i = 0; i < iterations; i += 16)
+	{
+		/* load the next block into 4 registers holding 4 values each */
+		const		uint32x4_t vals1 = vld1q_u32((const uint32 *) & base[i]);
+		const		uint32x4_t vals2 = vld1q_u32((const uint32 *) & base[i + 4]);
+		const		uint32x4_t vals3 = vld1q_u32((const uint32 *) & base[i + 8]);
+		const		uint32x4_t vals4 = vld1q_u32((const uint32 *) & base[i + 12]);
+
+		/* compare each value to the key */
+		const		uint32x4_t result1 = vceqq_u32(keys, vals1);
+		const		uint32x4_t result2 = vceqq_u32(keys, vals2);
+		const		uint32x4_t result3 = vceqq_u32(keys, vals3);
+		const		uint32x4_t result4 = vceqq_u32(keys, vals4);
+
+		/* combine the results into a single variable */
+		const		uint32x4_t tmp1 = vorrq_u32(result1, result2);
+		const		uint32x4_t tmp2 = vorrq_u32(result3, result4);
+		const		uint32x4_t result = vorrq_u32(tmp1, tmp2);
+
+		/* see if there was a match */
+		if (vmaxvq_u32(result) != 0)
+			return true;
+	}
+#endif			/* __ARM_NEON */
+
 	/* Process the remaining elements one at a time. */
 	for (; i < nelem; i++)
 	{
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index a571e79f57..67df6ef439 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -27,4 +27,12 @@
 #define USE_SSE2
 #endif
 
+/*
+ * Include arm_neon.h if the compiler is targeting an architecture that
+ * supports ARM Advanced SIMD (Neon) intrinsics.
+ */
+#ifdef __ARM_NEON
+#include 
+#endif
+
 #endif			/* SIMD_H */
-- 
2.25.1



Re: use ARM intrinsics in pg_lfind32() where available

2022-08-19 Thread Andres Freund
Hi,

On 2022-08-19 13:08:29 -0700, Nathan Bossart wrote:
> I've tested the patch on a recent macOS (M1 Pro) and Amazon Linux
> (Graviton2), and I've confirmed that the instructions aren't used on a
> Linux/Intel machine.  I did add a new configure check to see if the
> relevant intrinsics are available, but I didn't add a runtime check like
> there is for the CRC instructions since the compilers I used support these
> intrinsics by default.  (I don't think a runtime check would work very well
> with the inline function, anyway.)  AFAICT these intrinsics are pretty
> standard on aarch64, although IIUC the spec indicates that they are
> technically optional.  I suspect that a simple check for "aarch64" would be
> sufficient, but I haven't investigated the level of compiler support yet.

Are you sure there's not an appropriate define for us to use here instead of a
configure test? E.g.

echo|cc -dM -P -E -|grep -iE 'arm|aarch'
...
#define __AARCH64_SIMD__ 1
...
#define __ARM_NEON 1
#define __ARM_NEON_FP 0xE
#define __ARM_NEON__ 1
..

I strikes me as non-scalable to explicitly test all the simd instructions we'd
use.


The story for the CRC checks is different because those instructions often
aren't available with the default compilation flags and aren't guaranteed to
be available at runtime.

Regards,

Andres




use ARM intrinsics in pg_lfind32() where available

2022-08-19 Thread Nathan Bossart
Hi hackers,

This is a follow-up for recent changes that optimized [sub]xip lookups in
XidInMVCCSnapshot() on Intel hardware [0] [1].  I've attached a patch that
uses ARM Advanced SIMD (Neon) intrinsic functions where available to speed
up the search.  The approach is nearly identical to the SSE2 version, and
the usual benchmark [2] shows similar improvements.

  writers  head  simd
  8866   836
  16   849   833
  32   782   822
  64   846   833
  128  805   821
  256  722   739
  512  529   674
  768  374   608
  1024 268   522

I've tested the patch on a recent macOS (M1 Pro) and Amazon Linux
(Graviton2), and I've confirmed that the instructions aren't used on a
Linux/Intel machine.  I did add a new configure check to see if the
relevant intrinsics are available, but I didn't add a runtime check like
there is for the CRC instructions since the compilers I used support these
intrinsics by default.  (I don't think a runtime check would work very well
with the inline function, anyway.)  AFAICT these intrinsics are pretty
standard on aarch64, although IIUC the spec indicates that they are
technically optional.  I suspect that a simple check for "aarch64" would be
sufficient, but I haven't investigated the level of compiler support yet.

Thoughts?

[0] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b6ef167
[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=37a6e5d
[2] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9...@2ndquadrant.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 1295f4d6eedabec1d850893d3bc86180bd33c932 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 19 Aug 2022 10:41:07 -0700
Subject: [PATCH v1 1/1] Use ARM Advanced SIMD intrinsic functions in
 pg_lfind32().

Use ARM Advanced SIMD intrinsic functions to speed up the search,
where available.  Otherwise, use a simple 'for' loop as before.  As
with b6ef167, this speeds up XidInMVCCSnapshot(), but any uses of
pg_lfind32() will also benefit.

Author: Nathan Bossart
---
 config/c-compiler.m4| 25 +++
 configure   | 40 +
 configure.ac|  2 ++
 src/include/pg_config.h.in  |  3 +++
 src/include/port/pg_lfind.h | 35 
 src/include/port/simd.h |  4 
 6 files changed, 109 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 69efc5bb10..e8931d7059 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -650,3 +650,28 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_ARMV8_CRC32C_INTRINSICS
+
+# PGAC_ARM_ADVANCED_SIMD_INTRINSICS
+# -
+# Check if the compiler supports the vdupq_n_u32, vld1q_u32, vceqq_u32,
+# vorrq_u32, and vmaxvq_u32 intrinsic functions.  These instructions were first
+# introduced in ARMv7.
+AC_DEFUN([PGAC_ARM_ADVANCED_SIMD_INTRINSICS],
+[AC_CACHE_CHECK([for vdupq_n_u32, vld1q_u32, vceqq_u32, vorrq_u32, and vmaxvq_u32],
+pgac_cv_arm_advanced_simd_intrinsics,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [unsigned int val[[]] = {1, 2, 3, 4, 5, 6, 7, 8};
+   uint32x4_t keys = vdupq_n_u32(7);
+   uint32x4_t vals1 = vld1q_u32(val);
+   uint32x4_t vals2 = vld1q_u32([[4]]);
+   uint32x4_t tmp1 = vceqq_u32(keys, vals1);
+   uint32x4_t tmp2 = vceqq_u32(keys, vals2);
+   uint32x4_t result = vorrq_u32(tmp1, tmp2);
+   /* return computed value to prevent the above from being optimized away */
+   return vmaxvq_u32(result) != 0;])],
+[pgac_cv_arm_advanced_simd_intrinsics=yes],
+[pgac_cv_arm_advanced_simd_intrinsics=no])])
+if test x"$pgac_cv_arm_advanced_simd_intrinsics" = xyes ; then
+AC_DEFINE(USE_ARM_ADVANCED_SIMD_INTRINSICS, 1,
+  [Define to 1 to use ARM Advanced SIMD (Neon) intrinsics.])
+fi])# PGAC_ARM_ADVANCED_SIMD_INTRINSICS
diff --git a/configure b/configure
index b28fccbc47..0924e5ae8f 100755
--- a/configure
+++ b/configure
@@ -18230,6 +18230,46 @@ $as_echo "slicing-by-8" >&6; }
 fi
 
 
+# Check for ARM Advanced SIMD intrinsics.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for vdupq_n_u32, vld1q_u32, vceqq_u32, vorrq_u32, and vmaxvq_u32" >&5
+$as_echo_n "checking for vdupq_n_u32, vld1q_u32, vceqq_u32, vorrq_u32, and vmaxvq_u32... " >&6; }
+if ${pgac_cv_arm_advanced_simd_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int val[] = {1, 2, 3, 4, 5, 6, 7, 8};
+   uint32x4_t keys = vdupq_n_u32(7);
+   uint32x4_t vals1 = vld1q_u32(val);
+   uint32x4_t vals2 = vld1q_u32([4]);
+   uint32x4_t tmp1 = vceqq_u32(keys, vals1);
+   uint32x4_t tmp2 = vceqq_u32(keys, vals2);
+   uint32x4_t result = vorrq_u32(tmp1, tmp2);
+   /* return computed value to prevent the above from being optimized away */
+   return vmaxvq_u32(result) != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if