Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-02-19 Thread chiranmoy.bhattacha...@fujitsu.com
It seems that the patch doesn't compile on macOS, it is unable to map 'i'
and 'len' which are of type 'size_t' to 'uint64'. This appears to be a mac 
specific
issue. The latest patch should resolve this by casting 'size_t' to 'uint64' 
before
passing it to 'svwhilelt_b8'.
[11:04:07.478] ../src/backend/utils/adt/encode.c:356:10: error: call to 
'svwhilelt_b8' is ambiguous
[11:04:07.478] 356 | pred = svwhilelt_b8(i, len);
[11:04:07.478] | ^~~~
[11:04:07.478] 
/Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28288:10:
 note: candidate function
[11:04:07.478] 28288 | svbool_t svwhilelt_b8(uint32_t, uint32_t);
[11:04:07.478] | ^
[11:04:07.478] 
/Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28296:10:
 note: candidate function
[11:04:07.478] 28296 | svbool_t svwhilelt_b8(uint64_t, uint64_t);
[11:04:07.478] | ^
[11:04:07.478] 
/Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28304:10:
 note: candidate function
[11:04:07.478] 28304 | svbool_t svwhilelt_b8(int32_t, int32_t);
[11:04:07.478] | ^
[11:04:07.478] 
/Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28312:10:
 note: candidate function
[11:04:07.478] 28312 | svbool_t svwhilelt_b8(int64_t, int64_t);
[11:04:07.478] | ^
[11:04:07.478] ../src/backend/utils/adt/encode.c:433:10: error: call to 
'svwhilelt_b8' is ambiguous
[11:04:07.478] 433 | pred = svwhilelt_b8(i / 2, len / 2);
[11:04:07.478] | ^~~~
[11:04:07.478] 
/Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28288:10:
 note: candidate function
[11:04:07.478] 28288 | svbool_t svwhilelt_b8(uint32_t, uint32_t);
[11:04:07.478] | ^
[11:04:07.478] 
/Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28296:10:
 note: candidate function
[11:04:07.478] 28296 | svbool_t svwhilelt_b8(uint64_t, uint64_t);
[11:04:07.478] | ^
[11:04:07.478] 
/Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28304:10:
 note: candidate function
[11:04:07.478] 28304 | svbool_t svwhilelt_b8(int32_t, int32_t);
[11:04:07.478] | ^
[11:04:07.478] 
/Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28312:10:
 note: candidate function
[11:04:07.478] 28312 | svbool_t svwhilelt_b8(int64_t, int64_t);
[11:04:07.478] | ^
[11:04:07.478] 2 errors generated.

---
Chiranmoy


v4-0001-SVE-support-for-hex-encode-and-hex-decode.patch
Description: v4-0001-SVE-support-for-hex-encode-and-hex-decode.patch


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-02-03 Thread chiranmoy.bhattacha...@fujitsu.com
Inlined the hex encode/decode functions in "src/include/utils/builtins.h"
similar to pg_popcount() in pg_bitutils.h.

---
Chiranmoy


v3-0001-SVE-support-for-hex-encode-and-hex-decode.patch
Description: v3-0001-SVE-support-for-hex-encode-and-hex-decode.patch


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-25 Thread Nathan Bossart
On Wed, Jan 22, 2025 at 10:58:09AM +, chiranmoy.bhattacha...@fujitsu.com 
wrote:
>> The functions that test the length before potentially calling a function
>> pointer should probably be inlined (see pg_popcount() in pg_bitutils.h).
>> I wouldn't be surprised if some compilers are inlining this stuff
>> already, but it's probably worth being explicit about it.
> 
> Should we implement an inline function in "utils/builtins.h", similar to
> pg_popcount()? Currently, we have not modified the header file, everything
> is statically implemented in encode.c.

Yeah, that's what I'm currently thinking we should do.

-- 
nathan




Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-25 Thread Nathan Bossart
On Wed, Jan 22, 2025 at 11:10:10AM +, chiranmoy.bhattacha...@fujitsu.com 
wrote:
> I realized I didn't attach the patch.

Thanks.  Would you mind creating a commitfest entry for this one?

-- 
nathan




Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-22 Thread chiranmoy.bhattacha...@fujitsu.com
I realized I didn't attach the patch.


v2-0001-SVE-support-for-hex-encode-and-hex-decode.patch
Description: v2-0001-SVE-support-for-hex-encode-and-hex-decode.patch


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-22 Thread chiranmoy.bhattacha...@fujitsu.com
> The approach looks generally reasonable to me, but IMHO the code needs
  much more commentary to explain how it works.

Added comments to explain the SVE implementation.

> I would be interested to see how your bytea test compares with the
improvements added in commit e24d770 and with sending the data in binary.

The following are the bytea test results with commit e24d770.
The same query and tables were used.

With commit e24d770:
Query exec time: 2.324 sec
hex_encode function time: 0.72 sec

Pre-commit e24d770:
Query exec time: 2.858 sec
hex_encode function time: 1.228 sec

SVE patch:
Query exec time: 1.654 sec
hex_encode_sve function time: 0.085 sec

> The functions that test the length before potentially calling a function
> pointer should probably be inlined (see pg_popcount() in pg_bitutils.h).
> I wouldn't be surprised if some compilers are inlining this stuff
> already, but it's probably worth being explicit about it.

Should we implement an inline function in "utils/builtins.h", similar to
pg_popcount()? Currently, we have not modified the header file, everything
is statically implemented in encode.c.

---
Chiranmoy


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-17 Thread Nathan Bossart
With commit e24d770 in place, I took a closer look at hex_decode(), and I
concluded that doing anything better without intrinsics would likely
require either a huge lookup table or something with complexity rivalling
the instrinsics approach (while also not rivalling its performance).  So, I
took a closer look at the instrinsics patches and had the following
thoughts:

* The approach looks generally reasonable to me, but IMHO the code needs
  much more commentary to explain how it works.

* The functions that test the length before potentially calling a function
  pointer should probably be inlined (see pg_popcount() in pg_bitutils.h).
  I wouldn't be surprised if some compilers are inlining this stuff
  already, but it's probably worth being explicit about it.

* Finally, I think we should ensure we've established a really strong case
  for this optimization.  IME these intrinsics patches require a ton of
  time and energy, and the code is often extremely complex.  I would be
  interested to see how your bytea test compares with the improvements
  added in commit e24d770 and with sending the data in binary.

-- 
nathan




Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-15 Thread Tom Lane
David Rowley  writes:
> I agree that the evidence you (John) gathered is enough reason to use 
> memcpy().

Okay ... doesn't quite match my intuition, but intuition is a poor
guide to such things.

regards, tom lane




Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-15 Thread David Rowley
On Wed, 15 Jan 2025 at 23:57, John Naylor  wrote:
>
> On Wed, Jan 15, 2025 at 2:14 PM Tom Lane  wrote:
> > Compilers that inline memcpy() may arrive at the same machine code,
> > but why rely on the compiler to make that optimization?  If the
> > compiler fails to do so, an out-of-line memcpy() call will surely
> > be a loser.
>
> See measurements at the end. As for compilers, gcc 3.4.6 and clang
> 3.0.0 can inline the memcpy. The manual copy above only gets combined
> to a single word starting with gcc 12 and clang 15, and latest MSVC
> still can't do it (4A in the godbolt link below). Are there any
> buildfarm animals around that may not inline memcpy for word-sized
> input?
>
> > A variant could be
> >
> > +   const char *hexptr = &hextbl[2 * usrc];
> > +   *dst++ = hexptr[0];
> > +   *dst++ = hexptr[1];

I'd personally much rather see us using memcpy() for this sort of
stuff. If the compiler is too braindead to inline tiny
constant-and-power-of-two-sized memcpys then we'd probably also have
plenty of other performance issues with that compiler already. I don't
think contorting the code into something less human-readable and
something the compiler may struggle even more to optimise is a good
idea.  The nieve way to implement the above requires two MOVs of
single bytes and two increments of dst. I imagine it's easier for the
compiler to inline a small constant-sized memcpy() than to figure out
that it's safe to implement the above with a single word-sized MOV
rather than two byte-sized MOVs due to the "dst++" in between the two.

I agree that the evidence you (John) gathered is enough reason to use memcpy().

David




Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-15 Thread Ranier Vilela
Hi.

Em qua., 15 de jan. de 2025 às 07:57, John Naylor 
escreveu:

> On Wed, Jan 15, 2025 at 2:14 PM Tom Lane  wrote:
>
> > Couple of thoughts:
> >
> > 1. I was actually hoping for a comment on the constant's definition,
> > perhaps along the lines of
> >
> > /*
> >  * The hex expansion of each possible byte value (two chars per value).
> >  */
>
> Works for me. With that, did you mean we then wouldn't need a comment
> in the code?
>
> > 2. Since "src" is defined as "const char *", I'm pretty sure that
> > pickier compilers will complain that
> >
> > +   unsigned char usrc = *((unsigned char *) src);
> >
> > results in casting away const.  Recommend
> >
> > +   unsigned char usrc = *((const unsigned char *) src);
>
> Thanks for the reminder!
>
> > 3.  I really wonder if
> >
> > +   memcpy(dst, &hextbl[2 * usrc], 2);
> >
> > is faster than copying the two bytes manually, along the lines of
> >
> > +   *dst++ = hextbl[2 * usrc];
> > +   *dst++ = hextbl[2 * usrc + 1];
> >
> > Compilers that inline memcpy() may arrive at the same machine code,
> > but why rely on the compiler to make that optimization?  If the
> > compiler fails to do so, an out-of-line memcpy() call will surely
> > be a loser.
>
> See measurements at the end. As for compilers, gcc 3.4.6 and clang
> 3.0.0 can inline the memcpy. The manual copy above only gets combined
> to a single word starting with gcc 12 and clang 15, and latest MSVC
> still can't do it (4A in the godbolt link below). Are there any
> buildfarm animals around that may not inline memcpy for word-sized
> input?
>
> > A variant could be
> >
> > +   const char *hexptr = &hextbl[2 * usrc];
> > +   *dst++ = hexptr[0];
> > +   *dst++ = hexptr[1];
> >
> > but this supposes that the compiler fails to see the common
> > subexpression in the other formulation, which I believe
> > most modern compilers will see.
>
> This combines to a single word starting with clang 5, but does not
> work on gcc 14.2 or gcc trunk (4B below). I have gcc 14.2 handy, and
> on my machine bytewise load/stores are somewhere in the middle:
>
> master1158.969 ms
> v3 776.791 ms
> variant 4A 775.777 ms
> variant 4B 969.945 ms
>
> https://godbolt.org/z/ajToordKq

Your example from godbolt, has a
have an important difference, which modifies the assembler result.

-static const char hextbl[] =
"000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff"
;
+static const char hextbl[512] =
"000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff"
;


best regards,
Ranier Vilela


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-15 Thread John Naylor
On Wed, Jan 15, 2025 at 2:14 PM Tom Lane  wrote:

> Couple of thoughts:
>
> 1. I was actually hoping for a comment on the constant's definition,
> perhaps along the lines of
>
> /*
>  * The hex expansion of each possible byte value (two chars per value).
>  */

Works for me. With that, did you mean we then wouldn't need a comment
in the code?

> 2. Since "src" is defined as "const char *", I'm pretty sure that
> pickier compilers will complain that
>
> +   unsigned char usrc = *((unsigned char *) src);
>
> results in casting away const.  Recommend
>
> +   unsigned char usrc = *((const unsigned char *) src);

Thanks for the reminder!

> 3.  I really wonder if
>
> +   memcpy(dst, &hextbl[2 * usrc], 2);
>
> is faster than copying the two bytes manually, along the lines of
>
> +   *dst++ = hextbl[2 * usrc];
> +   *dst++ = hextbl[2 * usrc + 1];
>
> Compilers that inline memcpy() may arrive at the same machine code,
> but why rely on the compiler to make that optimization?  If the
> compiler fails to do so, an out-of-line memcpy() call will surely
> be a loser.

See measurements at the end. As for compilers, gcc 3.4.6 and clang
3.0.0 can inline the memcpy. The manual copy above only gets combined
to a single word starting with gcc 12 and clang 15, and latest MSVC
still can't do it (4A in the godbolt link below). Are there any
buildfarm animals around that may not inline memcpy for word-sized
input?

> A variant could be
>
> +   const char *hexptr = &hextbl[2 * usrc];
> +   *dst++ = hexptr[0];
> +   *dst++ = hexptr[1];
>
> but this supposes that the compiler fails to see the common
> subexpression in the other formulation, which I believe
> most modern compilers will see.

This combines to a single word starting with clang 5, but does not
work on gcc 14.2 or gcc trunk (4B below). I have gcc 14.2 handy, and
on my machine bytewise load/stores are somewhere in the middle:

master1158.969 ms
v3 776.791 ms
variant 4A 775.777 ms
variant 4B 969.945 ms

https://godbolt.org/z/ajToordKq

--
John Naylor
Amazon Web Services




Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-14 Thread Tom Lane
John Naylor  writes:
> Okay, I added a comment. I also agree with Michael that my quick
> one-off was a bit hard to read so I've cleaned it up a bit. I plan to
> commit the attached by Friday, along with any bikeshedding that
> happens by then.

Couple of thoughts:

1. I was actually hoping for a comment on the constant's definition,
perhaps along the lines of

/*
 * The hex expansion of each possible byte value (two chars per value).
 */

2. Since "src" is defined as "const char *", I'm pretty sure that
pickier compilers will complain that

+   unsigned char usrc = *((unsigned char *) src);

results in casting away const.  Recommend

+   unsigned char usrc = *((const unsigned char *) src);

3.  I really wonder if

+   memcpy(dst, &hextbl[2 * usrc], 2);

is faster than copying the two bytes manually, along the lines of

+   *dst++ = hextbl[2 * usrc];
+   *dst++ = hextbl[2 * usrc + 1];

Compilers that inline memcpy() may arrive at the same machine code,
but why rely on the compiler to make that optimization?  If the
compiler fails to do so, an out-of-line memcpy() call will surely
be a loser.

A variant could be

+   const char *hexptr = &hextbl[2 * usrc];
+   *dst++ = hexptr[0];
+   *dst++ = hexptr[1];

but this supposes that the compiler fails to see the common
subexpression in the other formulation, which I believe
most modern compilers will see.

regards, tom lane




Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-14 Thread John Naylor
On Tue, Jan 14, 2025 at 11:57 PM Nathan Bossart
 wrote:
>
> On Tue, Jan 14, 2025 at 12:59:04AM -0500, Tom Lane wrote:
> > John Naylor  writes:
> >> We can do about as well simply by changing the nibble lookup to a byte
> >> lookup, which works on every compiler and architecture:
>
> Nice.  I tried enabling auto-vectorization and loop unrolling on top of
> this patch, and the numbers looked the same.  I think we'd need CPU
> intrinsics or an even bigger lookup table to do any better.

Thanks for looking further! Yeah, I like that the table is still only 512 bytes.

> > I didn't attempt to verify your patch, but I do prefer addressing
> > this issue in a machine-independent fashion.  I also like the brevity
> > of the patch (though it could do with some comments perhaps, not that
> > the existing code has any).
>
> +1

Okay, I added a comment. I also agree with Michael that my quick
one-off was a bit hard to read so I've cleaned it up a bit. I plan to
commit the attached by Friday, along with any bikeshedding that
happens by then.

--
John Naylor
Amazon Web Services
From a62aea5fdbfbd215435ddc4c294897caa292b6f7 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 15 Jan 2025 13:28:26 +0700
Subject: [PATCH v3] Speed up hex_encode with bytewise lookup

Previously, hex_encode looked up each nibble of the input
separately. We now use a larger lookup table containing the two-byte
encoding of every possible input byte, resulting in a 1/3 reduction
in encoding time.

Reviewed by Michael Paquier, Tom Lane, and Nathan Bossart

Discussion: https://postgr.es/m/CANWCAZZvXuJMgqMN4u068Yqa19CEjS31tQKZp_qFFFbgYfaXqQ%40mail.gmail.com
---
 src/backend/utils/adt/encode.c | 29 ++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 4a6fcb56cd..7fee154b0d 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -145,7 +145,23 @@ binary_decode(PG_FUNCTION_ARGS)
  * HEX
  */
 
-static const char hextbl[] = "0123456789abcdef";
+static const char hextbl[512] =
+"000102030405060708090a0b0c0d0e0f"
+"101112131415161718191a1b1c1d1e1f"
+"202122232425262728292a2b2c2d2e2f"
+"303132333435363738393a3b3c3d3e3f"
+"404142434445464748494a4b4c4d4e4f"
+"505152535455565758595a5b5c5d5e5f"
+"606162636465666768696a6b6c6d6e6f"
+"707172737475767778797a7b7c7d7e7f"
+"808182838485868788898a8b8c8d8e8f"
+"909192939495969798999a9b9c9d9e9f"
+"a0a1a2a3a4a5a6a7a8a9aaabacadaeaf"
+"b0b1b2b3b4b5b6b7b8b9babbbcbdbebf"
+"c0c1c2c3c4c5c6c7c8c9cacbcccdcecf"
+"d0d1d2d3d4d5d6d7d8d9dadbdcdddedf"
+"e0e1e2e3e4e5e6e7e8e9eaebecedeeef"
+"f0f1f2f3f4f5f6f7f8f9fafbfcfdfeff";
 
 static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
@@ -165,9 +181,16 @@ hex_encode(const char *src, size_t len, char *dst)
 
 	while (src < end)
 	{
-		*dst++ = hextbl[(*src >> 4) & 0xF];
-		*dst++ = hextbl[*src & 0xF];
+		unsigned char usrc = *((unsigned char *) src);
+
+		/*
+		 * Each input byte results in two output bytes, so we use the unsigned
+		 * input byte multiplied by two as the lookup key.
+		 */
+		memcpy(dst, &hextbl[2 * usrc], 2);
+
 		src++;
+		dst += 2;
 	}
 	return (uint64) len * 2;
 }
-- 
2.47.1



Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-14 Thread Nathan Bossart
On Tue, Jan 14, 2025 at 12:59:04AM -0500, Tom Lane wrote:
> John Naylor  writes:
>> We can do about as well simply by changing the nibble lookup to a byte
>> lookup, which works on every compiler and architecture:

Nice.  I tried enabling auto-vectorization and loop unrolling on top of
this patch, and the numbers looked the same.  I think we'd need CPU
intrinsics or an even bigger lookup table to do any better.

> I didn't attempt to verify your patch, but I do prefer addressing
> this issue in a machine-independent fashion.  I also like the brevity
> of the patch (though it could do with some comments perhaps, not that
> the existing code has any).

+1

-- 
nathan




Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-13 Thread Tom Lane
John Naylor  writes:
> We can do about as well simply by changing the nibble lookup to a byte
> lookup, which works on every compiler and architecture:

I didn't attempt to verify your patch, but I do prefer addressing
this issue in a machine-independent fashion.  I also like the brevity
of the patch (though it could do with some comments perhaps, not that
the existing code has any).

regards, tom lane




Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-13 Thread Michael Paquier
On Tue, Jan 14, 2025 at 12:27:30PM +0700, John Naylor wrote:
> We can do about as well simply by changing the nibble lookup to a byte
> lookup, which works on every compiler and architecture:
> 
> select hex_encode_test(100, 1024);
> master:
> Time: 1158.700 ms
> v2:
> Time:  777.443 ms
> 
> If we need to do much better than this, it seems better to send the
> data to the client as binary, if possible.

That's pretty cool.  Complex to parse, still really cool.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-13 Thread John Naylor
On Sat, Jan 11, 2025 at 3:46 AM Nathan Bossart  wrote:
>
> I was able to get auto-vectorization to take effect on Apple clang 16 with
> the following addition to src/backend/utils/adt/Makefile:
>
> encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -mllvm -force-vector-width=8
>
> This gave the following results with your hex_encode_test() function:
>
> buf  | HEAD  | patch | % diff
>   ---+---+---+
>   16 |21 |16 |   24
>   64 |54 |41 |   24
>  256 |   138 |   100 |   28
> 1024 |   441 |   300 |   32
> 4096 |  1671 |  1106 |   34
>16384 |  6890 |  4570 |   34
>65536 | 27393 | 18054 |   34

We can do about as well simply by changing the nibble lookup to a byte
lookup, which works on every compiler and architecture:

select hex_encode_test(100, 1024);
master:
Time: 1158.700 ms
v2:
Time:  777.443 ms

If we need to do much better than this, it seems better to send the
data to the client as binary, if possible.

-- 
John Naylor
Amazon Web Services
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 4a6fcb56cd..8b059bc834 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -145,7 +145,7 @@ binary_decode(PG_FUNCTION_ARGS)
  * HEX
  */
 
-static const char hextbl[] = "0123456789abcdef";
+static const char hextbl[512] = "000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff";
 
 static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
@@ -165,9 +165,8 @@ hex_encode(const char *src, size_t len, char *dst)
 
 	while (src < end)
 	{
-		*dst++ = hextbl[(*src >> 4) & 0xF];
-		*dst++ = hextbl[*src & 0xF];
-		src++;
+		memcpy(dst, &hextbl[(* ((unsigned char *) src)) * 2], 2);
+		src++; dst+=2;
 	}
 	return (uint64) len * 2;
 }


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-13 Thread Nathan Bossart
On Mon, Jan 13, 2025 at 03:48:49PM +, chiranmoy.bhattacha...@fujitsu.com 
wrote:
> There is a 30% improvement using auto-vectorization.

It might be worth enabling auto-vectorization independently of any patches
that use intrinsics, then.

> Currently, it is assumed that all aarch64 machine support NEON, but for
> newer advanced SIMD like SVE (and AVX512 for x86) this assumption may not
> hold. We need a runtime check to be sure.. Using src/include/port/simd.h
> to abstract away these advanced SIMD implementations may be difficult.

Yeah, moving simd.h to anything beyond Neon/SSE2 might be tricky at the
moment.  Besides the need for additional runtime checks, using wider
registers can mean that you need more data before an optimization takes
effect, which is effectively a regression.  I ran into this when I tried to
add AVX2 support to simd.h [0].  My question about using simd.h was
ultimately about abstracting the relevant Neon/SSE2 instructions and using
those for hex_encode/decode().  If that's possible, I think it'd be
interesting to see how that compares to the SVE version.

[0] https://postgr.es/m/20231129171526.GA857928%40nathanxps13

-- 
nathan




Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-13 Thread chiranmoy.bhattacha...@fujitsu.com
On Fri, Jan 10, 2025 at 09:38:14AM -0600, Nathan Bossart wrote:
> Do you mean that the auto-vectorization worked and you observed no
> performance improvement, or the auto-vectorization had no effect on the
> code generated?

Auto-vectorization is working now with the following addition on Graviton 3 
(m7g.4xlarge) with GCC 11.4, and the results match yours. Previously, 
auto-vectorization had no effect because we missed the -march=native option.

  encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -march=native

There is a 30% improvement using auto-vectorization.

 buf   | default | auto_vec | SVE
+---++---
 16 | 16  |  12  |8
 64 | 58  |  40  |9
256 |223  | 152  |   18
   1024 |934  | 613  |   54
   4096 |   3533  |2430  |  202
  16384 |  14081  |9831  |  800
  65536 |  56374  |   38702  | 3202

Auto-vectorization had no effect on hex_decode due to the presence of control 
flow.

-
Here is a comment snippet from src/include/port/simd.h

"While Neon support is technically optional for aarch64, it appears that all 
available 64-bit hardware does have it."

Currently, it is assumed that all aarch64 machine support NEON, but for newer 
advanced SIMD like SVE (and AVX512 for x86) this assumption may not hold. We 
need a runtime check to be sure.. Using src/include/port/simd.h to abstract 
away these advanced SIMD implementations may be difficult.

We will update the thread once a solution is found.

-
Chiranmoy


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-10 Thread Nathan Bossart
On Fri, Jan 10, 2025 at 09:38:14AM -0600, Nathan Bossart wrote:
> On Fri, Jan 10, 2025 at 11:10:03AM +, chiranmoy.bhattacha...@fujitsu.com 
> wrote:
>> We tried auto-vectorization and observed no performance improvement.
> 
> Do you mean that the auto-vectorization worked and you observed no
> performance improvement, or the auto-vectorization had no effect on the
> code generated?

I was able to get auto-vectorization to take effect on Apple clang 16 with
the following addition to src/backend/utils/adt/Makefile:

encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -mllvm -force-vector-width=8

This gave the following results with your hex_encode_test() function:

buf  | HEAD  | patch | % diff
  ---+---+---+
  16 |21 |16 |   24
  64 |54 |41 |   24
 256 |   138 |   100 |   28
1024 |   441 |   300 |   32
4096 |  1671 |  1106 |   34
   16384 |  6890 |  4570 |   34
   65536 | 27393 | 18054 |   34

This doesn't compare with the gains you are claiming to see with
intrinsics, but it's not bad for a one line change.  I bet there are ways
to adjust the code so that the auto-vectorization is more effective, too.

-- 
nathan




Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-10 Thread Nathan Bossart
On Fri, Jan 10, 2025 at 11:10:03AM +, chiranmoy.bhattacha...@fujitsu.com 
wrote:
> We tried auto-vectorization and observed no performance improvement.

Do you mean that the auto-vectorization worked and you observed no
performance improvement, or the auto-vectorization had no effect on the
code generated?

> The instructions in src/include/port/simd.h are based on older SIMD
> architectures like NEON, whereas the patch uses the newer SVE, so some of
> the instructions used in the patch may not have direct equivalents in
> NEON. We will check the feasibility of integrating SVE in
> "src/include/port/simd.h" and get back to you.

Thanks!

-- 
nathan




Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-10 Thread chiranmoy.bhattacha...@fujitsu.com
Hello Nathan,

We tried auto-vectorization and observed no performance improvement.
The instructions in src/include/port/simd.h are based on older SIMD 
architectures like NEON, whereas the patch uses the newer SVE, so some of the 
instructions used in the patch may not have direct equivalents in NEON. We will 
check the feasibility of integrating SVE in "src/include/port/simd.h" and get 
back to you.
The actual encoding/decoding implementation takes less than 100 lines. The rest 
of the code is related to config and the "choose" logic. One option is to move 
the implementation to a new file, making src/backend/utils/adt/encode.c less 
bloated.

Thanks,
Chiranmoy


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-09 Thread Nathan Bossart
On Thu, Jan 09, 2025 at 11:22:05AM +, devanga.susmi...@fujitsu.com wrote:
> This email aims to discuss the contribution of optimized hex_encode and
> hex_decode functions for ARM (aarch64) machines. These functions are
> widely used for encoding and decoding binary data in the bytea data type.

Thank you for sharing this work!  I'm not able to review this in depth at
the moment, but I am curious if you considered trying to enable
auto-vectorization on the code or using the higher-level SIMD support in
src/include/port/simd.h.  Those may not show as impressive of gains as your
patch, but they would likely require much less code and apply to a wider
set of architectures.

-- 
nathan