Re: [PATCH] Hex-coding optimizations using SVE on ARM.
It seems that the patch doesn't compile on macOS, it is unable to map 'i' and 'len' which are of type 'size_t' to 'uint64'. This appears to be a mac specific issue. The latest patch should resolve this by casting 'size_t' to 'uint64' before passing it to 'svwhilelt_b8'. [11:04:07.478] ../src/backend/utils/adt/encode.c:356:10: error: call to 'svwhilelt_b8' is ambiguous [11:04:07.478] 356 | pred = svwhilelt_b8(i, len); [11:04:07.478] | ^~~~ [11:04:07.478] /Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28288:10: note: candidate function [11:04:07.478] 28288 | svbool_t svwhilelt_b8(uint32_t, uint32_t); [11:04:07.478] | ^ [11:04:07.478] /Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28296:10: note: candidate function [11:04:07.478] 28296 | svbool_t svwhilelt_b8(uint64_t, uint64_t); [11:04:07.478] | ^ [11:04:07.478] /Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28304:10: note: candidate function [11:04:07.478] 28304 | svbool_t svwhilelt_b8(int32_t, int32_t); [11:04:07.478] | ^ [11:04:07.478] /Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28312:10: note: candidate function [11:04:07.478] 28312 | svbool_t svwhilelt_b8(int64_t, int64_t); [11:04:07.478] | ^ [11:04:07.478] ../src/backend/utils/adt/encode.c:433:10: error: call to 'svwhilelt_b8' is ambiguous [11:04:07.478] 433 | pred = svwhilelt_b8(i / 2, len / 2); [11:04:07.478] | ^~~~ [11:04:07.478] /Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28288:10: note: candidate function [11:04:07.478] 28288 | svbool_t svwhilelt_b8(uint32_t, uint32_t); [11:04:07.478] | ^ [11:04:07.478] /Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28296:10: note: candidate function [11:04:07.478] 28296 | svbool_t svwhilelt_b8(uint64_t, uint64_t); [11:04:07.478] | ^ [11:04:07.478] /Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28304:10: note: candidate function [11:04:07.478] 28304 | svbool_t svwhilelt_b8(int32_t, int32_t); [11:04:07.478] | ^ [11:04:07.478] /Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28312:10: note: candidate function [11:04:07.478] 28312 | svbool_t svwhilelt_b8(int64_t, int64_t); [11:04:07.478] | ^ [11:04:07.478] 2 errors generated. --- Chiranmoy v4-0001-SVE-support-for-hex-encode-and-hex-decode.patch Description: v4-0001-SVE-support-for-hex-encode-and-hex-decode.patch
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
Inlined the hex encode/decode functions in "src/include/utils/builtins.h" similar to pg_popcount() in pg_bitutils.h. --- Chiranmoy v3-0001-SVE-support-for-hex-encode-and-hex-decode.patch Description: v3-0001-SVE-support-for-hex-encode-and-hex-decode.patch
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
On Wed, Jan 22, 2025 at 10:58:09AM +, chiranmoy.bhattacha...@fujitsu.com wrote: >> The functions that test the length before potentially calling a function >> pointer should probably be inlined (see pg_popcount() in pg_bitutils.h). >> I wouldn't be surprised if some compilers are inlining this stuff >> already, but it's probably worth being explicit about it. > > Should we implement an inline function in "utils/builtins.h", similar to > pg_popcount()? Currently, we have not modified the header file, everything > is statically implemented in encode.c. Yeah, that's what I'm currently thinking we should do. -- nathan
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
On Wed, Jan 22, 2025 at 11:10:10AM +, chiranmoy.bhattacha...@fujitsu.com wrote: > I realized I didn't attach the patch. Thanks. Would you mind creating a commitfest entry for this one? -- nathan
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
I realized I didn't attach the patch. v2-0001-SVE-support-for-hex-encode-and-hex-decode.patch Description: v2-0001-SVE-support-for-hex-encode-and-hex-decode.patch
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
> The approach looks generally reasonable to me, but IMHO the code needs much more commentary to explain how it works. Added comments to explain the SVE implementation. > I would be interested to see how your bytea test compares with the improvements added in commit e24d770 and with sending the data in binary. The following are the bytea test results with commit e24d770. The same query and tables were used. With commit e24d770: Query exec time: 2.324 sec hex_encode function time: 0.72 sec Pre-commit e24d770: Query exec time: 2.858 sec hex_encode function time: 1.228 sec SVE patch: Query exec time: 1.654 sec hex_encode_sve function time: 0.085 sec > The functions that test the length before potentially calling a function > pointer should probably be inlined (see pg_popcount() in pg_bitutils.h). > I wouldn't be surprised if some compilers are inlining this stuff > already, but it's probably worth being explicit about it. Should we implement an inline function in "utils/builtins.h", similar to pg_popcount()? Currently, we have not modified the header file, everything is statically implemented in encode.c. --- Chiranmoy
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
With commit e24d770 in place, I took a closer look at hex_decode(), and I concluded that doing anything better without intrinsics would likely require either a huge lookup table or something with complexity rivalling the instrinsics approach (while also not rivalling its performance). So, I took a closer look at the instrinsics patches and had the following thoughts: * The approach looks generally reasonable to me, but IMHO the code needs much more commentary to explain how it works. * The functions that test the length before potentially calling a function pointer should probably be inlined (see pg_popcount() in pg_bitutils.h). I wouldn't be surprised if some compilers are inlining this stuff already, but it's probably worth being explicit about it. * Finally, I think we should ensure we've established a really strong case for this optimization. IME these intrinsics patches require a ton of time and energy, and the code is often extremely complex. I would be interested to see how your bytea test compares with the improvements added in commit e24d770 and with sending the data in binary. -- nathan
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
David Rowley writes: > I agree that the evidence you (John) gathered is enough reason to use > memcpy(). Okay ... doesn't quite match my intuition, but intuition is a poor guide to such things. regards, tom lane
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
On Wed, 15 Jan 2025 at 23:57, John Naylor wrote: > > On Wed, Jan 15, 2025 at 2:14 PM Tom Lane wrote: > > Compilers that inline memcpy() may arrive at the same machine code, > > but why rely on the compiler to make that optimization? If the > > compiler fails to do so, an out-of-line memcpy() call will surely > > be a loser. > > See measurements at the end. As for compilers, gcc 3.4.6 and clang > 3.0.0 can inline the memcpy. The manual copy above only gets combined > to a single word starting with gcc 12 and clang 15, and latest MSVC > still can't do it (4A in the godbolt link below). Are there any > buildfarm animals around that may not inline memcpy for word-sized > input? > > > A variant could be > > > > + const char *hexptr = &hextbl[2 * usrc]; > > + *dst++ = hexptr[0]; > > + *dst++ = hexptr[1]; I'd personally much rather see us using memcpy() for this sort of stuff. If the compiler is too braindead to inline tiny constant-and-power-of-two-sized memcpys then we'd probably also have plenty of other performance issues with that compiler already. I don't think contorting the code into something less human-readable and something the compiler may struggle even more to optimise is a good idea. The nieve way to implement the above requires two MOVs of single bytes and two increments of dst. I imagine it's easier for the compiler to inline a small constant-sized memcpy() than to figure out that it's safe to implement the above with a single word-sized MOV rather than two byte-sized MOVs due to the "dst++" in between the two. I agree that the evidence you (John) gathered is enough reason to use memcpy(). David
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
Hi. Em qua., 15 de jan. de 2025 às 07:57, John Naylor escreveu: > On Wed, Jan 15, 2025 at 2:14 PM Tom Lane wrote: > > > Couple of thoughts: > > > > 1. I was actually hoping for a comment on the constant's definition, > > perhaps along the lines of > > > > /* > > * The hex expansion of each possible byte value (two chars per value). > > */ > > Works for me. With that, did you mean we then wouldn't need a comment > in the code? > > > 2. Since "src" is defined as "const char *", I'm pretty sure that > > pickier compilers will complain that > > > > + unsigned char usrc = *((unsigned char *) src); > > > > results in casting away const. Recommend > > > > + unsigned char usrc = *((const unsigned char *) src); > > Thanks for the reminder! > > > 3. I really wonder if > > > > + memcpy(dst, &hextbl[2 * usrc], 2); > > > > is faster than copying the two bytes manually, along the lines of > > > > + *dst++ = hextbl[2 * usrc]; > > + *dst++ = hextbl[2 * usrc + 1]; > > > > Compilers that inline memcpy() may arrive at the same machine code, > > but why rely on the compiler to make that optimization? If the > > compiler fails to do so, an out-of-line memcpy() call will surely > > be a loser. > > See measurements at the end. As for compilers, gcc 3.4.6 and clang > 3.0.0 can inline the memcpy. The manual copy above only gets combined > to a single word starting with gcc 12 and clang 15, and latest MSVC > still can't do it (4A in the godbolt link below). Are there any > buildfarm animals around that may not inline memcpy for word-sized > input? > > > A variant could be > > > > + const char *hexptr = &hextbl[2 * usrc]; > > + *dst++ = hexptr[0]; > > + *dst++ = hexptr[1]; > > > > but this supposes that the compiler fails to see the common > > subexpression in the other formulation, which I believe > > most modern compilers will see. > > This combines to a single word starting with clang 5, but does not > work on gcc 14.2 or gcc trunk (4B below). I have gcc 14.2 handy, and > on my machine bytewise load/stores are somewhere in the middle: > > master1158.969 ms > v3 776.791 ms > variant 4A 775.777 ms > variant 4B 969.945 ms > > https://godbolt.org/z/ajToordKq Your example from godbolt, has a have an important difference, which modifies the assembler result. -static const char hextbl[] = "000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff" ; +static const char hextbl[512] = "000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff" ; best regards, Ranier Vilela
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
On Wed, Jan 15, 2025 at 2:14 PM Tom Lane wrote: > Couple of thoughts: > > 1. I was actually hoping for a comment on the constant's definition, > perhaps along the lines of > > /* > * The hex expansion of each possible byte value (two chars per value). > */ Works for me. With that, did you mean we then wouldn't need a comment in the code? > 2. Since "src" is defined as "const char *", I'm pretty sure that > pickier compilers will complain that > > + unsigned char usrc = *((unsigned char *) src); > > results in casting away const. Recommend > > + unsigned char usrc = *((const unsigned char *) src); Thanks for the reminder! > 3. I really wonder if > > + memcpy(dst, &hextbl[2 * usrc], 2); > > is faster than copying the two bytes manually, along the lines of > > + *dst++ = hextbl[2 * usrc]; > + *dst++ = hextbl[2 * usrc + 1]; > > Compilers that inline memcpy() may arrive at the same machine code, > but why rely on the compiler to make that optimization? If the > compiler fails to do so, an out-of-line memcpy() call will surely > be a loser. See measurements at the end. As for compilers, gcc 3.4.6 and clang 3.0.0 can inline the memcpy. The manual copy above only gets combined to a single word starting with gcc 12 and clang 15, and latest MSVC still can't do it (4A in the godbolt link below). Are there any buildfarm animals around that may not inline memcpy for word-sized input? > A variant could be > > + const char *hexptr = &hextbl[2 * usrc]; > + *dst++ = hexptr[0]; > + *dst++ = hexptr[1]; > > but this supposes that the compiler fails to see the common > subexpression in the other formulation, which I believe > most modern compilers will see. This combines to a single word starting with clang 5, but does not work on gcc 14.2 or gcc trunk (4B below). I have gcc 14.2 handy, and on my machine bytewise load/stores are somewhere in the middle: master1158.969 ms v3 776.791 ms variant 4A 775.777 ms variant 4B 969.945 ms https://godbolt.org/z/ajToordKq -- John Naylor Amazon Web Services
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
John Naylor writes: > Okay, I added a comment. I also agree with Michael that my quick > one-off was a bit hard to read so I've cleaned it up a bit. I plan to > commit the attached by Friday, along with any bikeshedding that > happens by then. Couple of thoughts: 1. I was actually hoping for a comment on the constant's definition, perhaps along the lines of /* * The hex expansion of each possible byte value (two chars per value). */ 2. Since "src" is defined as "const char *", I'm pretty sure that pickier compilers will complain that + unsigned char usrc = *((unsigned char *) src); results in casting away const. Recommend + unsigned char usrc = *((const unsigned char *) src); 3. I really wonder if + memcpy(dst, &hextbl[2 * usrc], 2); is faster than copying the two bytes manually, along the lines of + *dst++ = hextbl[2 * usrc]; + *dst++ = hextbl[2 * usrc + 1]; Compilers that inline memcpy() may arrive at the same machine code, but why rely on the compiler to make that optimization? If the compiler fails to do so, an out-of-line memcpy() call will surely be a loser. A variant could be + const char *hexptr = &hextbl[2 * usrc]; + *dst++ = hexptr[0]; + *dst++ = hexptr[1]; but this supposes that the compiler fails to see the common subexpression in the other formulation, which I believe most modern compilers will see. regards, tom lane
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
On Tue, Jan 14, 2025 at 11:57 PM Nathan Bossart wrote: > > On Tue, Jan 14, 2025 at 12:59:04AM -0500, Tom Lane wrote: > > John Naylor writes: > >> We can do about as well simply by changing the nibble lookup to a byte > >> lookup, which works on every compiler and architecture: > > Nice. I tried enabling auto-vectorization and loop unrolling on top of > this patch, and the numbers looked the same. I think we'd need CPU > intrinsics or an even bigger lookup table to do any better. Thanks for looking further! Yeah, I like that the table is still only 512 bytes. > > I didn't attempt to verify your patch, but I do prefer addressing > > this issue in a machine-independent fashion. I also like the brevity > > of the patch (though it could do with some comments perhaps, not that > > the existing code has any). > > +1 Okay, I added a comment. I also agree with Michael that my quick one-off was a bit hard to read so I've cleaned it up a bit. I plan to commit the attached by Friday, along with any bikeshedding that happens by then. -- John Naylor Amazon Web Services From a62aea5fdbfbd215435ddc4c294897caa292b6f7 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Wed, 15 Jan 2025 13:28:26 +0700 Subject: [PATCH v3] Speed up hex_encode with bytewise lookup Previously, hex_encode looked up each nibble of the input separately. We now use a larger lookup table containing the two-byte encoding of every possible input byte, resulting in a 1/3 reduction in encoding time. Reviewed by Michael Paquier, Tom Lane, and Nathan Bossart Discussion: https://postgr.es/m/CANWCAZZvXuJMgqMN4u068Yqa19CEjS31tQKZp_qFFFbgYfaXqQ%40mail.gmail.com --- src/backend/utils/adt/encode.c | 29 ++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c index 4a6fcb56cd..7fee154b0d 100644 --- a/src/backend/utils/adt/encode.c +++ b/src/backend/utils/adt/encode.c @@ -145,7 +145,23 @@ binary_decode(PG_FUNCTION_ARGS) * HEX */ -static const char hextbl[] = "0123456789abcdef"; +static const char hextbl[512] = +"000102030405060708090a0b0c0d0e0f" +"101112131415161718191a1b1c1d1e1f" +"202122232425262728292a2b2c2d2e2f" +"303132333435363738393a3b3c3d3e3f" +"404142434445464748494a4b4c4d4e4f" +"505152535455565758595a5b5c5d5e5f" +"606162636465666768696a6b6c6d6e6f" +"707172737475767778797a7b7c7d7e7f" +"808182838485868788898a8b8c8d8e8f" +"909192939495969798999a9b9c9d9e9f" +"a0a1a2a3a4a5a6a7a8a9aaabacadaeaf" +"b0b1b2b3b4b5b6b7b8b9babbbcbdbebf" +"c0c1c2c3c4c5c6c7c8c9cacbcccdcecf" +"d0d1d2d3d4d5d6d7d8d9dadbdcdddedf" +"e0e1e2e3e4e5e6e7e8e9eaebecedeeef" +"f0f1f2f3f4f5f6f7f8f9fafbfcfdfeff"; static const int8 hexlookup[128] = { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, @@ -165,9 +181,16 @@ hex_encode(const char *src, size_t len, char *dst) while (src < end) { - *dst++ = hextbl[(*src >> 4) & 0xF]; - *dst++ = hextbl[*src & 0xF]; + unsigned char usrc = *((unsigned char *) src); + + /* + * Each input byte results in two output bytes, so we use the unsigned + * input byte multiplied by two as the lookup key. + */ + memcpy(dst, &hextbl[2 * usrc], 2); + src++; + dst += 2; } return (uint64) len * 2; } -- 2.47.1
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
On Tue, Jan 14, 2025 at 12:59:04AM -0500, Tom Lane wrote: > John Naylor writes: >> We can do about as well simply by changing the nibble lookup to a byte >> lookup, which works on every compiler and architecture: Nice. I tried enabling auto-vectorization and loop unrolling on top of this patch, and the numbers looked the same. I think we'd need CPU intrinsics or an even bigger lookup table to do any better. > I didn't attempt to verify your patch, but I do prefer addressing > this issue in a machine-independent fashion. I also like the brevity > of the patch (though it could do with some comments perhaps, not that > the existing code has any). +1 -- nathan
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
John Naylor writes: > We can do about as well simply by changing the nibble lookup to a byte > lookup, which works on every compiler and architecture: I didn't attempt to verify your patch, but I do prefer addressing this issue in a machine-independent fashion. I also like the brevity of the patch (though it could do with some comments perhaps, not that the existing code has any). regards, tom lane
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
On Tue, Jan 14, 2025 at 12:27:30PM +0700, John Naylor wrote: > We can do about as well simply by changing the nibble lookup to a byte > lookup, which works on every compiler and architecture: > > select hex_encode_test(100, 1024); > master: > Time: 1158.700 ms > v2: > Time: 777.443 ms > > If we need to do much better than this, it seems better to send the > data to the client as binary, if possible. That's pretty cool. Complex to parse, still really cool. -- Michael signature.asc Description: PGP signature
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
On Sat, Jan 11, 2025 at 3:46 AM Nathan Bossart wrote: > > I was able to get auto-vectorization to take effect on Apple clang 16 with > the following addition to src/backend/utils/adt/Makefile: > > encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -mllvm -force-vector-width=8 > > This gave the following results with your hex_encode_test() function: > > buf | HEAD | patch | % diff > ---+---+---+ > 16 |21 |16 | 24 > 64 |54 |41 | 24 > 256 | 138 | 100 | 28 > 1024 | 441 | 300 | 32 > 4096 | 1671 | 1106 | 34 >16384 | 6890 | 4570 | 34 >65536 | 27393 | 18054 | 34 We can do about as well simply by changing the nibble lookup to a byte lookup, which works on every compiler and architecture: select hex_encode_test(100, 1024); master: Time: 1158.700 ms v2: Time: 777.443 ms If we need to do much better than this, it seems better to send the data to the client as binary, if possible. -- John Naylor Amazon Web Services diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c index 4a6fcb56cd..8b059bc834 100644 --- a/src/backend/utils/adt/encode.c +++ b/src/backend/utils/adt/encode.c @@ -145,7 +145,7 @@ binary_decode(PG_FUNCTION_ARGS) * HEX */ -static const char hextbl[] = "0123456789abcdef"; +static const char hextbl[512] = "000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff"; static const int8 hexlookup[128] = { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, @@ -165,9 +165,8 @@ hex_encode(const char *src, size_t len, char *dst) while (src < end) { - *dst++ = hextbl[(*src >> 4) & 0xF]; - *dst++ = hextbl[*src & 0xF]; - src++; + memcpy(dst, &hextbl[(* ((unsigned char *) src)) * 2], 2); + src++; dst+=2; } return (uint64) len * 2; }
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
On Mon, Jan 13, 2025 at 03:48:49PM +, chiranmoy.bhattacha...@fujitsu.com wrote: > There is a 30% improvement using auto-vectorization. It might be worth enabling auto-vectorization independently of any patches that use intrinsics, then. > Currently, it is assumed that all aarch64 machine support NEON, but for > newer advanced SIMD like SVE (and AVX512 for x86) this assumption may not > hold. We need a runtime check to be sure.. Using src/include/port/simd.h > to abstract away these advanced SIMD implementations may be difficult. Yeah, moving simd.h to anything beyond Neon/SSE2 might be tricky at the moment. Besides the need for additional runtime checks, using wider registers can mean that you need more data before an optimization takes effect, which is effectively a regression. I ran into this when I tried to add AVX2 support to simd.h [0]. My question about using simd.h was ultimately about abstracting the relevant Neon/SSE2 instructions and using those for hex_encode/decode(). If that's possible, I think it'd be interesting to see how that compares to the SVE version. [0] https://postgr.es/m/20231129171526.GA857928%40nathanxps13 -- nathan
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
On Fri, Jan 10, 2025 at 09:38:14AM -0600, Nathan Bossart wrote: > Do you mean that the auto-vectorization worked and you observed no > performance improvement, or the auto-vectorization had no effect on the > code generated? Auto-vectorization is working now with the following addition on Graviton 3 (m7g.4xlarge) with GCC 11.4, and the results match yours. Previously, auto-vectorization had no effect because we missed the -march=native option. encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -march=native There is a 30% improvement using auto-vectorization. buf | default | auto_vec | SVE +---++--- 16 | 16 | 12 |8 64 | 58 | 40 |9 256 |223 | 152 | 18 1024 |934 | 613 | 54 4096 | 3533 |2430 | 202 16384 | 14081 |9831 | 800 65536 | 56374 | 38702 | 3202 Auto-vectorization had no effect on hex_decode due to the presence of control flow. - Here is a comment snippet from src/include/port/simd.h "While Neon support is technically optional for aarch64, it appears that all available 64-bit hardware does have it." Currently, it is assumed that all aarch64 machine support NEON, but for newer advanced SIMD like SVE (and AVX512 for x86) this assumption may not hold. We need a runtime check to be sure.. Using src/include/port/simd.h to abstract away these advanced SIMD implementations may be difficult. We will update the thread once a solution is found. - Chiranmoy
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
On Fri, Jan 10, 2025 at 09:38:14AM -0600, Nathan Bossart wrote: > On Fri, Jan 10, 2025 at 11:10:03AM +, chiranmoy.bhattacha...@fujitsu.com > wrote: >> We tried auto-vectorization and observed no performance improvement. > > Do you mean that the auto-vectorization worked and you observed no > performance improvement, or the auto-vectorization had no effect on the > code generated? I was able to get auto-vectorization to take effect on Apple clang 16 with the following addition to src/backend/utils/adt/Makefile: encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -mllvm -force-vector-width=8 This gave the following results with your hex_encode_test() function: buf | HEAD | patch | % diff ---+---+---+ 16 |21 |16 | 24 64 |54 |41 | 24 256 | 138 | 100 | 28 1024 | 441 | 300 | 32 4096 | 1671 | 1106 | 34 16384 | 6890 | 4570 | 34 65536 | 27393 | 18054 | 34 This doesn't compare with the gains you are claiming to see with intrinsics, but it's not bad for a one line change. I bet there are ways to adjust the code so that the auto-vectorization is more effective, too. -- nathan
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
On Fri, Jan 10, 2025 at 11:10:03AM +, chiranmoy.bhattacha...@fujitsu.com wrote: > We tried auto-vectorization and observed no performance improvement. Do you mean that the auto-vectorization worked and you observed no performance improvement, or the auto-vectorization had no effect on the code generated? > The instructions in src/include/port/simd.h are based on older SIMD > architectures like NEON, whereas the patch uses the newer SVE, so some of > the instructions used in the patch may not have direct equivalents in > NEON. We will check the feasibility of integrating SVE in > "src/include/port/simd.h" and get back to you. Thanks! -- nathan
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
Hello Nathan, We tried auto-vectorization and observed no performance improvement. The instructions in src/include/port/simd.h are based on older SIMD architectures like NEON, whereas the patch uses the newer SVE, so some of the instructions used in the patch may not have direct equivalents in NEON. We will check the feasibility of integrating SVE in "src/include/port/simd.h" and get back to you. The actual encoding/decoding implementation takes less than 100 lines. The rest of the code is related to config and the "choose" logic. One option is to move the implementation to a new file, making src/backend/utils/adt/encode.c less bloated. Thanks, Chiranmoy
Re: [PATCH] Hex-coding optimizations using SVE on ARM.
On Thu, Jan 09, 2025 at 11:22:05AM +, devanga.susmi...@fujitsu.com wrote: > This email aims to discuss the contribution of optimized hex_encode and > hex_decode functions for ARM (aarch64) machines. These functions are > widely used for encoding and decoding binary data in the bytea data type. Thank you for sharing this work! I'm not able to review this in depth at the moment, but I am curious if you considered trying to enable auto-vectorization on the code or using the higher-level SIMD support in src/include/port/simd.h. Those may not show as impressive of gains as your patch, but they would likely require much less code and apply to a wider set of architectures. -- nathan