Re: Optimize Arm64 crc32c implementation in Postgresql
Andrew Gierthwrites: > "Tom" == Tom Lane writes: > Tom> I also noticed that we'd been sloppy about making the file safe to > Tom> compile for both frontend and backend, so I cleaned that up. > In a frontend, wouldn't it be more kosher to restore the previous SIGILL > handler rather than unconditionally reset it to SIG_DFL? If we had any other code that was setting the SIGILL trap, I might worry about that, but we don't. The whole thing is really a bit questionable to run in arbitrary environments -- for instance, it'd be pretty unsafe inside a threaded application. So if we had code in libpq or ecpg that computed CRCs, I'd be worrying about this approach quite a bit more. But it seems all right for current and foreseen uses. regards, tom lane
Re: Optimize Arm64 crc32c implementation in Postgresql
> "Tom" == Tom Lanewrites: Tom> I also noticed that we'd been sloppy about making the file safe to Tom> compile for both frontend and backend, so I cleaned that up. In a frontend, wouldn't it be more kosher to restore the previous SIGILL handler rather than unconditionally reset it to SIG_DFL? -- Andrew (irc:RhodiumToad)
Re: Optimize Arm64 crc32c implementation in Postgresql
Thomas Munrowrites: > Let me try that again with that stupid typo (crc2) fixed... I didn't like that too much as-is, because it was capable of calling elog(ERROR) without having reset the SIGILL trap first. That's just trouble waiting to happen, so I rearranged to avoid it. I also noticed that we'd been sloppy about making the file safe to compile for both frontend and backend, so I cleaned that up. Also, I had thought that maybe the postmaster should do something to ensure that it sets up the function pointer, so that child processes inherit the correct pointer via fork() and don't need to repeat the test (and then possibly spam the postmaster log). On closer inspection, no new code is needed because ReadControlFile runs a CRC check, but I felt it was worth documenting that. regards, tom lane
Re: Optimize Arm64 crc32c implementation in Postgresql
On Thu, May 3, 2018 at 5:18 PM, Thomas Munrowrote: > 2018-05-03 05:07:25.904 UTC [19677] DEBUG: using armv8 crc2 hardware = 1 Let me try that again with that stupid typo (crc2) fixed... -- Thomas Munro http://www.enterprisedb.com 0001-Fix-endianness-bug-in-ARMv8-CRC32-detection-v2.patch Description: Binary data
Re: Optimize Arm64 crc32c implementation in Postgresql
On Thu, May 3, 2018 at 4:48 PM, Tom Lanewrote: > Thomas Munro writes: >> On Thu, May 3, 2018 at 4:04 PM, Tom Lane wrote: >>> It strikes me also that, at least for debugging purposes, it's seriously >>> awful that you can't tell from outside what result this function got. > >> I don't think *broken* CPUs are something we need to handle, are they? > > I'm not worried so much about broken hardware as about scenarios like > "Munro got the magic constant wrong and nobody ever noticed", or more > likely "somebody broke it later and we didn't notice". We absolutely > do not expect the code path with function-returns-the-wrong-answer to be > taken, and I think it would be appropriate to complain loudly if it is. Ok. Here is a patch that compares hw and sw results and calls elog(ERROR) if they don't match. It also does elog(DEBUG1) with its result just before returning. Here's what I see at startup on my ARMv8 machine when I set log_min_messages = debug1 in my .conf (it's the very first line emitted): 2018-05-03 05:07:25.904 UTC [19677] DEBUG: using armv8 crc2 hardware = 1 Here's what I see if I hack the _armv8() function to do kill(getpid(), SIGILL): 2018-05-03 05:09:47.012 UTC [21079] DEBUG: using armv8 crc2 hardware = 0 Here's what I see if I hack the _armv8() function to add 1 to its result: 2018-05-03 05:11:07.366 UTC [22218] FATAL: crc32 hardware and software results disagree 2018-05-03 05:11:07.367 UTC [22218] LOG: database system is shut down -- Thomas Munro http://www.enterprisedb.com 0001-Fix-endianness-bug-in-ARMv8-CRC32-detection.patch Description: Binary data
Re: Optimize Arm64 crc32c implementation in Postgresql
Thomas Munrowrites: > On Thu, May 3, 2018 at 4:04 PM, Tom Lane wrote: >> It strikes me also that, at least for debugging purposes, it's seriously >> awful that you can't tell from outside what result this function got. > I don't think *broken* CPUs are something we need to handle, are they? I'm not worried so much about broken hardware as about scenarios like "Munro got the magic constant wrong and nobody ever noticed", or more likely "somebody broke it later and we didn't notice". We absolutely do not expect the code path with function-returns-the-wrong-answer to be taken, and I think it would be appropriate to complain loudly if it is. regards, tom lane
Re: Optimize Arm64 crc32c implementation in Postgresql
On Thu, May 3, 2018 at 4:04 PM, Tom Lanewrote: > I wrote: >> Andrew Gierth writes: >>> Isn't there a hidden assumption about endianness there? Right, thanks. >> Yeah, I was wondering about that too, but does anyone actually run >> ARMs big-endian? I think all the distros I follow dropped big endian support in recent years, but that's no excuse. > After a bit more thought ... we could remove the magic constant, > along with any assumptions about endianness, by doing it like this: > > result = (pg_comp_crc32c_armv8(0, , sizeof(data)) == > pg_comp_crc32c_sb8(0, , sizeof(data))); > > Of course we'd eat a few more cycles during the test, but it's hard > to see that mattering. I was thinking of proposing this: - uint64 data = 42; + uint64 data = UINT64CONST(0x4242424242424242); ... and: - result = (pg_comp_crc32c_armv8(0, , sizeof(data)) == 0xdd439b0d); + result = (pg_comp_crc32c_armv8(0, , sizeof(data)) == 0x54eb145b); No strong preference though. > It strikes me also that, at least for debugging purposes, it's seriously > awful that you can't tell from outside what result this function got. > Certainly the outcome that "pg_comp_crc32c_armv8 executed but returned > the wrong answer" is not something that ought to go unremarked. > We could elog the results, but I'm not very sure what log level is > appropriate --- also, I doubt we want another log entry from every > launched process, so how to prevent that? I don't think *broken* CPUs are something we need to handle, are they? Hmm, I suppose there might be a hypothetical operating system or ARM chip that somehow doesn't raise SIGILL and returns garbage, and then it's nice to fall back to the software version (unless you're 1/2^32 unlucky). For debugging I was just putting a temporary elog in. Leaving one in there at one of the DEBUG levels seems reasonable though. -- Thomas Munro http://www.enterprisedb.com
Re: Optimize Arm64 crc32c implementation in Postgresql
I wrote: > Andrew Gierthwrites: >> Isn't there a hidden assumption about endianness there? > Yeah, I was wondering about that too, but does anyone actually run > ARMs big-endian? After a bit more thought ... we could remove the magic constant, along with any assumptions about endianness, by doing it like this: result = (pg_comp_crc32c_armv8(0, , sizeof(data)) == pg_comp_crc32c_sb8(0, , sizeof(data))); Of course we'd eat a few more cycles during the test, but it's hard to see that mattering. It strikes me also that, at least for debugging purposes, it's seriously awful that you can't tell from outside what result this function got. Certainly the outcome that "pg_comp_crc32c_armv8 executed but returned the wrong answer" is not something that ought to go unremarked. We could elog the results, but I'm not very sure what log level is appropriate --- also, I doubt we want another log entry from every launched process, so how to prevent that? regards, tom lane
Re: Optimize Arm64 crc32c implementation in Postgresql
Andrew Gierthwrites: > "Thomas" == Thomas Munro writes: > + uint64 data = 42; > Isn't there a hidden assumption about endianness there? Yeah, I was wondering about that too, but does anyone actually run ARMs big-endian? regards, tom lane
Re: Optimize Arm64 crc32c implementation in Postgresql
> "Thomas" == Thomas Munrowrites: + uint64 data = 42; Isn't there a hidden assumption about endianness there? -- Andrew (irc:RhodiumToad)
Re: Optimize Arm64 crc32c implementation in Postgresql
On Thu, May 3, 2018 at 10:08 AM, Tom Lanewrote: > I don't have any way to test this, but it looks plausible, so pushed. > > (I note you forgot to run autoheader, btw.) Thanks! -- Thomas Munro http://www.enterprisedb.com
Re: Optimize Arm64 crc32c implementation in Postgresql
Thomas Munrowrites: > On Thu, May 3, 2018 at 2:30 AM, Tom Lane wrote: >> Do you really need the pg_crc32c_armv8_choose_dummy global variable? > True. Of course I needed an interesting length too... I don't have any way to test this, but it looks plausible, so pushed. (I note you forgot to run autoheader, btw.) regards, tom lane
Re: Optimize Arm64 crc32c implementation in Postgresql
Thomas Munrowrites: > Ahh, OpenSSL's armcap.c shows how to do this. You need to > siglongjmp() out of there. Here's a patch that does it that way. > Isn't this better? Do you really need the pg_crc32c_armv8_choose_dummy global variable? That seems pretty ugly. If you're concerned about the compiler optimizing away the call to the crc function, you could write it like result = (pg_comp_crc32c_armv8(0, 0, 0) == expected-value); which'd provide a bit of extra checking that the code's not broken, too. regards, tom lane
Re: Optimize Arm64 crc32c implementation in Postgresql
On Thu, Apr 5, 2018 at 12:00 AM, Thomas Munrowrote: > ... trying to figure out how to detect the instruction portably using SIGILL > ... Ahh, OpenSSL's armcap.c shows how to do this. You need to siglongjmp() out of there. Here's a patch that does it that way. Isn't this better? I tested this on a Linux ARM system that has the instruction, and I put a kill(getpid(), SIGILL) in there to test the negative case because I don't have access to an ARM system without the instruction. I don't have a FreeBSD/ARM system to test on either but I checked that the flow control technique works fine on FreeBSD on another architecture when it hits an instruction it doesn't support. -- Thomas Munro http://www.enterprisedb.com 0001-Use-a-portable-way-to-detect-ARMv8-CRC32-hardware.patch Description: Binary data
Re: Optimize Arm64 crc32c implementation in Postgresql
On Wed, Apr 4, 2018 at 11:47 PM, Thomas Munrowrote: > BTW I did some googling just now and found out that some libraries use > a technique they call "CPU probing": just try it and see if you get > SIGILL. Is that a bad idea for some reason? Here is a quick hack -- > anyone got an ARM system without crc that they could test it on? Hmm. I suppose there must be some horrendous non-portable 'step over the instruction' bit missing in the signal handler. Probably not a great idea. -- Thomas Munro http://www.enterprisedb.com
Re: Optimize Arm64 crc32c implementation in Postgresql
On Wed, Apr 4, 2018 at 11:21 PM, Heikki Linnakangaswrote: > Yep. I got the code before the main loop, to handle the first 1-7 unaligned > bytes, wrong. Apparently those are the only tests that call the CRC function > with very short and unaligned input. BTW I did some googling just now and found out that some libraries use a technique they call "CPU probing": just try it and see if you get SIGILL. Is that a bad idea for some reason? Here is a quick hack -- anyone got an ARM system without crc that they could test it on? -- Thomas Munro http://www.enterprisedb.com cpu-probe-hack.patch Description: Binary data
Re: Optimize Arm64 crc32c implementation in Postgresql
On 04/04/18 14:13, Thomas Munro wrote: On Wed, Apr 4, 2018 at 9:23 PM, Heikki Linnakangaswrote: Pushed, thanks everyone! On eelpout two test_decoding tests failed: test ddl ... FAILED test rewrite ... FAILED The output has several cases where pg_logical_slot_get_changes() is invoked and then fails like this: SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); ! ERROR: incorrect resource manager data checksum in record at 0/1BD6600 Not sure why it would break when called by pg_logical_slot_get_changes()... https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=eelpout=2018-04-04%2009%3A58%3A56 Yep. I got the code before the main loop, to handle the first 1-7 unaligned bytes, wrong. Apparently those are the only tests that call the CRC function with very short and unaligned input. I've got a fix ready, and I'm running "make check-world" on my ARM box now, to make sure there aren't any more failures. I'll push the fix once that's finished. Should've run the full test suite before pushing in the first place.. - Heikki
Re: Optimize Arm64 crc32c implementation in Postgresql
On Wed, Apr 4, 2018 at 9:23 PM, Heikki Linnakangaswrote: > Pushed, thanks everyone! On eelpout two test_decoding tests failed: test ddl ... FAILED test rewrite ... FAILED The output has several cases where pg_logical_slot_get_changes() is invoked and then fails like this: SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); ! ERROR: incorrect resource manager data checksum in record at 0/1BD6600 Not sure why it would break when called by pg_logical_slot_get_changes()... https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=eelpout=2018-04-04%2009%3A58%3A56 -- Thomas Munro http://www.enterprisedb.com
Re: Optimize Arm64 crc32c implementation in Postgresql
On 03/04/18 19:43, Andres Freund wrote: Architecture manual time? They're available freely IIRC and should answer this. Yeah. The best reference I could find was "ARM Cortex-A Series Programmer’s Guide for ARMv8-A" (http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/ch08s01.html). In the "Porting to A64" section, it says: Data and code must be aligned to appropriate boundaries. The alignment of accesses can affect performance on ARM cores and can represent a portability problem when moving code from an earlier architecture to ARMv8-A. It is worth being aware of alignment issues for performance reasons, or when porting code that makes assumptions about pointers or 32-bit and 64-bit integer variables. I was a bit surprised by the "must be aligned to appropriate boundaries" statement. Googling around, the strict alignment requirement was removed in ARMv7, and since then, unaligned access works similarly to Intel. I think there are some special instructions, like atomic ops, that require alignment though. Perhaps that's what that sentence refers to. On 03/04/18 20:47, Tom Lane wrote: I'm pretty sure that some ARM platforms emulate unaligned access through kernel trap handlers, which would certainly make this a lot slower than handling the unaligned bytes manually. Maybe that doesn't apply to any ARM CPU that has this instruction ... but as you said, it'd be better to consider the presence of the instruction as orthogonal to other CPU features. I did some quick testing, and found that unaligned access is about 2x slower than aligned. I don't think it's being trapped by the kernel, I think that would be even slower, but clearly there is an effect there. So I added code to process the first 1-7 bytes separately, so that the main loop runs on 8-byte aligned addresses. Pushed, thanks everyone! - Heikki
Re: Optimize Arm64 crc32c implementation in Postgresql
On 03/04/18 21:56, Daniel Gustafsson wrote: The following line should say SSE and not SSSE (and the same typo is in src/include/pg_config.h.in and src/include/pg_config.h.win32). While not introduced in this patch, it’s adjacent to the patched codepath and on topic so may well be fixed while in there. AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSSE 4.2 CRC instructions with a runtime check.]) I pushed that as a separate commit, as that goes back to 9.5. Also, I noticed that the description of USE_SLICING_BY_8_CRC32C was completely wrong, fixed that too. Thanks! - Heikki
Re: Optimize Arm64 crc32c implementation in Postgresql
On Wed, Apr 4, 2018 at 4:38 AM, Heikki Linnakangaswrote: > [armv8-crc32c-2.patch] This tests OK on my Debian buster aarch64 system (the machine that runs "eelpout" in the buildfarm), configure tests as expected and I confirmed that pg_comp_crc32c_armv8 is reached at runtime. I hope we can figure out a more portable way to detect the instructions, or failing that a way to detect them on FreeBSD in userspace. I'll try to send a patch for PG12 if I get a chance. No opinion on the unaligned memory access question. -- Thomas Munro http://www.enterprisedb.com
Re: Optimize Arm64 crc32c implementation in Postgresql
> On 03 Apr 2018, at 18:38, Heikki Linnakangaswrote: > > On 03/04/18 19:09, Andres Freund wrote: >> Hi, >> On 2018-04-03 19:05:19 +0300, Heikki Linnakangas wrote: >>> On 01/04/18 20:32, Andres Freund wrote: On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote: > * I tested this on Linux, with gcc and clang, on an ARM64 virtual machine > that I had available (not an emulator, but a VM on a shared ARM64 server). Have you seen actual postgres performance benefits with the patch? >>> >>> I just ran a small test with pg_waldump, similar to what Abhijit Menon-Sen >>> ran with the Slicing-by-8 and Intel SSE patches, when we added those >>> (https://www.postgresql.org/message-id/20141119155811.GA32492%40toroid.org). >>> I ran pgbench, with scale factor 5, until it had generated about 1 GB of >>> WAL, and then I ran pg_waldump -z on that WAL. With slicing-by-8, it took >>> about 7 s, and with the special CPU instructions, about 5 s. 'perf' showed >>> that the CRC computation took about 30% of the CPU time before, and about >>> 12% after, which sounds about right. That's not as big a speedup as we saw >>> with the corresponding Intel SSE instructions back in 2014, but still quite >>> worthwhile. >> Cool. Based on a skim the patch looks reasonable. > > Thanks. > > I bikeshedded with myself on the naming of things, and decided to use "ARMv8" > in the variable and file names, instead of ARM64 or ARMCE or ARM64CE. The CRC > instructions were introduced in ARM v8 (as an optional feature), it's not > really related to the 64-bitness, even though the 64-bit instruction set was > also introduced in ARM v8. Other than that, and some comment fixes, this is > the same as the previous patch version. Noticed two minor documentation issues in a quick skim of the attached patch: The following line should say SSE and not SSSE (and the same typo is in src/include/pg_config.h.in and src/include/pg_config.h.win32). While not introduced in this patch, it’s adjacent to the patched codepath and on topic so may well be fixed while in there. AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSSE 4.2 CRC instructions with a runtime check.]) The documentation in configure.in use “runtime” rather than "run time” in all lines except this one: +# uses the CRC instructions, compile both, and select at run time. cheers ./daniel
Re: Optimize Arm64 crc32c implementation in Postgresql
Heikki Linnakangaswrites: > I was just about to commit this, when I started to wonder: Do we need to > worry about alignment? As the patch stands, it will merrily do unaligned > 8-byte loads. Is that OK on ARM? It seems to work on the system I've > been testing on, but I don't know. And even if it's OK, would it perform > better if we did 1-byte loads in the beginning, until we reach the > 8-byte boundary? I'm pretty sure that some ARM platforms emulate unaligned access through kernel trap handlers, which would certainly make this a lot slower than handling the unaligned bytes manually. Maybe that doesn't apply to any ARM CPU that has this instruction ... but as you said, it'd be better to consider the presence of the instruction as orthogonal to other CPU features. regards, tom lane
Re: Optimize Arm64 crc32c implementation in Postgresql
Hi, On 2018-04-03 19:38:42 +0300, Heikki Linnakangas wrote: > I was just about to commit this, when I started to wonder: Do we need to > worry about alignment? As the patch stands, it will merrily do unaligned > 8-byte loads. Is that OK on ARM? It seems to work on the system I've been > testing on, but I don't know. And even if it's OK, would it perform better > if we did 1-byte loads in the beginning, until we reach the 8-byte boundary? Architecture manual time? They're available freely IIRC and should answer this. - Andres
Re: Optimize Arm64 crc32c implementation in Postgresql
On 03/04/18 19:09, Andres Freund wrote: Hi, On 2018-04-03 19:05:19 +0300, Heikki Linnakangas wrote: On 01/04/18 20:32, Andres Freund wrote: On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote: * I tested this on Linux, with gcc and clang, on an ARM64 virtual machine that I had available (not an emulator, but a VM on a shared ARM64 server). Have you seen actual postgres performance benefits with the patch? I just ran a small test with pg_waldump, similar to what Abhijit Menon-Sen ran with the Slicing-by-8 and Intel SSE patches, when we added those (https://www.postgresql.org/message-id/20141119155811.GA32492%40toroid.org). I ran pgbench, with scale factor 5, until it had generated about 1 GB of WAL, and then I ran pg_waldump -z on that WAL. With slicing-by-8, it took about 7 s, and with the special CPU instructions, about 5 s. 'perf' showed that the CRC computation took about 30% of the CPU time before, and about 12% after, which sounds about right. That's not as big a speedup as we saw with the corresponding Intel SSE instructions back in 2014, but still quite worthwhile. Cool. Based on a skim the patch looks reasonable. Thanks. I bikeshedded with myself on the naming of things, and decided to use "ARMv8" in the variable and file names, instead of ARM64 or ARMCE or ARM64CE. The CRC instructions were introduced in ARM v8 (as an optional feature), it's not really related to the 64-bitness, even though the 64-bit instruction set was also introduced in ARM v8. Other than that, and some comment fixes, this is the same as the previous patch version. I was just about to commit this, when I started to wonder: Do we need to worry about alignment? As the patch stands, it will merrily do unaligned 8-byte loads. Is that OK on ARM? It seems to work on the system I've been testing on, but I don't know. And even if it's OK, would it perform better if we did 1-byte loads in the beginning, until we reach the 8-byte boundary? - Heikki diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index b518851441..ba5c40db01 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -667,3 +667,37 @@ if test x"$Ac_cachevar" = x"yes"; then fi undefine([Ac_cachevar])dnl ])# PGAC_SSE42_CRC32_INTRINSICS + + +# PGAC_ARMV8_CRC32C_INTRINSICS +# --- +# Check if the compiler supports the CRC32C instructions using the __crc32cb, +# __crc32ch, __crc32cw, and __crc32cd intrinsic functions. These instructions +# were first introduced in ARMv8 in the optional CRC Extension, and became +# mandatory in ARMv8.1. +# +# An optional compiler flag can be passed as argument (e.g. +# -march=armv8-a+crc). If the intrinsics are supported, sets +# pgac_armv8_crc32c_intrinsics, and CFLAGS_ARMV8_CRC32C. +AC_DEFUN([PGAC_ARMV8_CRC32C_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_armv8_crc32c_intrinsics_$1])])dnl +AC_CACHE_CHECK([for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [unsigned int crc = 0; + crc = __crc32cb(crc, 0); + crc = __crc32ch(crc, 0); + crc = __crc32cw(crc, 0); + crc = __crc32cd(crc, 0); + /* return computed value, to prevent the above being optimized away */ + return crc == 0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_ARMV8_CRC32C="$1" + pgac_armv8_crc32c_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_ARMV8_CRC32C_INTRINSICS diff --git a/configure b/configure index 5c56f21282..561b5c419e 100755 --- a/configure +++ b/configure @@ -646,6 +646,7 @@ MSGMERGE MSGFMT_FLAGS MSGFMT PG_CRC32C_OBJS +CFLAGS_ARMV8_CRC32C CFLAGS_SSE42 have_win32_dbghelp HAVE_IPV6 @@ -17254,28 +17255,175 @@ if ac_fn_c_try_compile "$LINENO"; then : fi rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +# Check for ARMv8 CRC Extension intrinsics to do CRC calculations. +# +# First check if __crc32c* intrinsics can be used with the default compiler +# flags. If not, check if adding -march=armv8-a+crc flag helps. +# CFLAGS_ARMV8_CRC32C is set if the extra flag is required. +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=" >&5 +$as_echo_n "checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=... " >&6; } +if ${pgac_cv_armv8_crc32c_intrinsics_+:} false; then : + $as_echo_n "(cached) " >&6 +else + pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS " +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include +int +main () +{ +unsigned int crc = 0; + crc = __crc32cb(crc, 0); + crc = __crc32ch(crc, 0); + crc = __crc32cw(crc, 0); + crc = __crc32cd(crc, 0); + /* return computed value, to prevent the above being optimized away */ + return crc == 0; + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : +
Re: Optimize Arm64 crc32c implementation in Postgresql
Hi, On 2018-04-03 19:05:19 +0300, Heikki Linnakangas wrote: > On 01/04/18 20:32, Andres Freund wrote: > > On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote: > > > * I tested this on Linux, with gcc and clang, on an ARM64 virtual machine > > > that I had available (not an emulator, but a VM on a shared ARM64 server). > > > > Have you seen actual postgres performance benefits with the patch? > > I just ran a small test with pg_waldump, similar to what Abhijit Menon-Sen > ran with the Slicing-by-8 and Intel SSE patches, when we added those > (https://www.postgresql.org/message-id/20141119155811.GA32492%40toroid.org). > I ran pgbench, with scale factor 5, until it had generated about 1 GB of > WAL, and then I ran pg_waldump -z on that WAL. With slicing-by-8, it took > about 7 s, and with the special CPU instructions, about 5 s. 'perf' showed > that the CRC computation took about 30% of the CPU time before, and about > 12% after, which sounds about right. That's not as big a speedup as we saw > with the corresponding Intel SSE instructions back in 2014, but still quite > worthwhile. Cool. Based on a skim the patch looks reasonable. It's a bit sad that it's effecively linux specific. But I'm not sure we can do anything about that atm, given the state of the "discovery" APIs on various platforms. Greetings, Andres Freund
Re: Optimize Arm64 crc32c implementation in Postgresql
On 01/04/18 20:32, Andres Freund wrote: On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote: * I tested this on Linux, with gcc and clang, on an ARM64 virtual machine that I had available (not an emulator, but a VM on a shared ARM64 server). Have you seen actual postgres performance benefits with the patch? I just ran a small test with pg_waldump, similar to what Abhijit Menon-Sen ran with the Slicing-by-8 and Intel SSE patches, when we added those (https://www.postgresql.org/message-id/20141119155811.GA32492%40toroid.org). I ran pgbench, with scale factor 5, until it had generated about 1 GB of WAL, and then I ran pg_waldump -z on that WAL. With slicing-by-8, it took about 7 s, and with the special CPU instructions, about 5 s. 'perf' showed that the CRC computation took about 30% of the CPU time before, and about 12% after, which sounds about right. That's not as big a speedup as we saw with the corresponding Intel SSE instructions back in 2014, but still quite worthwhile. - Heikki
Re: Optimize Arm64 crc32c implementation in Postgresql
Hi, On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote: > On 02/03/18 06:42, Andres Freund wrote: > > On 2018-03-02 11:37:52 +1300, Thomas Munro wrote: > > > So... that stuff probably needs either a configure check for the > > > getauxval function and/or those headers, or an OS check? > > > > It'd probably be better to not rely on os specific headers, and instead > > directly access the capabilities. > > Anyone got an idea on how to do that? I googled around a bit, but couldn't > find any examples. Similar... > * Use compiler intrinsics instead of inline assembly. +many > * I tested this on Linux, with gcc and clang, on an ARM64 virtual machine > that I had available (not an emulator, but a VM on a shared ARM64 server). Have you seen actual postgres performance benefits with the patch? - Andres
Re: Optimize Arm64 crc32c implementation in Postgresql
On 02/03/18 06:42, Andres Freund wrote: On 2018-03-02 11:37:52 +1300, Thomas Munro wrote: So... that stuff probably needs either a configure check for the getauxval function and/or those headers, or an OS check? It'd probably be better to not rely on os specific headers, and instead directly access the capabilities. Anyone got an idea on how to do that? I googled around a bit, but couldn't find any examples. All the examples I could find very Linux-specific, and used getauxval(), except for this in the FreeBSD kernel itself: https://github.com/freebsd/freebsd/blob/master/sys/libkern/crc32.c#L775. I'm no expert on FreeBSD, but that doesn't seem suitable for use in a user program. In any case, I reworked this patch to follow the example of the existing code more closely. Notable changes: * Use compiler intrinsics instead of inline assembly. * If the target architecture has them, use the CRC instructions without a runtime check. You'll get that if you use "CFLAGS=armv8.1-a", for example, as the CRC Extension was made mandatory in ARM v8.1. This should work even on FreeBSD or other non-Linux systems, where getauxval() is not available. * I removed the loop to handle two uint64's at a time, using the LDP instruction. I couldn't find a compiler intrinsic for that, and it was actually slower, at least on the system I have access to, than a straightforward loop that processes 8 bytes at a time. * I tested this on Linux, with gcc and clang, on an ARM64 virtual machine that I had available (not an emulator, but a VM on a shared ARM64 server). - Heikki diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 689bb7f181..d530cf92c0 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -627,3 +627,34 @@ if test x"$Ac_cachevar" = x"yes"; then fi undefine([Ac_cachevar])dnl ])# PGAC_SSE42_CRC32_INTRINSICS + + +# PGAC_ARM64CE_CRC32C_INTRINSICS +# --- +# Check if the compiler supports the ARM64CE CRC32C instructions added in XXX +# using the __crc32cb, __crc32ch, __crc32cw, and __crc32cd intrinsic functions. +# +# An optional compiler flag can be passed as argument (e.g. -march=+crc). If the +# intrinsics are supported, sets pgac_arm64ce_crc32c_intrinsics, and CFLAGS_ARM64CE_CRC32C. +AC_DEFUN([PGAC_ARM64CE_CRC32C_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_arm64ce_crc32c_intrinsics_$1])])dnl +AC_CACHE_CHECK([for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [unsigned int crc = 0; + crc = __crc32cb(crc, 0); + crc = __crc32ch(crc, 0); + crc = __crc32cw(crc, 0); + crc = __crc32cd(crc, 0); + /* return computed value, to prevent the above being optimized away */ + return crc == 0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_ARM64CE_CRC32C="$1" + pgac_arm64ce_crc32c_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_ARM64CE_CRC32C_INTRINSICS diff --git a/configure b/configure index 1242e310b4..9b1389df92 100755 --- a/configure +++ b/configure @@ -646,6 +646,7 @@ MSGMERGE MSGFMT_FLAGS MSGFMT PG_CRC32C_OBJS +CFLAGS_ARM64CE_CRC32C CFLAGS_SSE42 have_win32_dbghelp HAVE_IPV6 @@ -15509,28 +15510,175 @@ if ac_fn_c_try_compile "$LINENO"; then : fi rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +# Check for ARM64 CRC Extensions intrinsics to do CRC calculations. +# +# First check if __crc32c* intrinsics can be used with the default compiler +# flags. If not, check if adding -march=v8-a+crc flag helps. +# CFLAGS_ARM64CE_CRC32C is set if that's required. +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=" >&5 +$as_echo_n "checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=... " >&6; } +if ${pgac_cv_arm64ce_crc32c_intrinsics_+:} false; then : + $as_echo_n "(cached) " >&6 +else + pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS " +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include +int +main () +{ +unsigned int crc = 0; + crc = __crc32cb(crc, 0); + crc = __crc32ch(crc, 0); + crc = __crc32cw(crc, 0); + crc = __crc32cd(crc, 0); + /* return computed value, to prevent the above being optimized away */ + return crc == 0; + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv_arm64ce_crc32c_intrinsics_=yes +else + pgac_cv_arm64ce_crc32c_intrinsics_=no +fi +rm -f core conftest.err conftest.$ac_objext \ +conftest$ac_exeext conftest.$ac_ext +CFLAGS="$pgac_save_CFLAGS" +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm64ce_crc32c_intrinsics_" >&5 +$as_echo "$pgac_cv_arm64ce_crc32c_intrinsics_" >&6; } +if test x"$pgac_cv_arm64ce_crc32c_intrinsics_" = x"yes"; then + CFLAGS_ARM64CE_CRC32C="" + pgac_arm64ce_crc32c_intrinsics=yes +fi + +if
Re: Optimize Arm64 crc32c implementation in Postgresql
Hi, On 2018-03-02 09:42:22 +, Yuqi Gu wrote: > > Could you show whether it actually improves performance? Usually bulk > > loading data with parallel COPYs is a good way to hit this code path. > > The mini benchmark code: I'd be more interested in a benchmark using postgres itself... > > What's the availability of these headers and functions on non-linux > > platforms? > > This Arm64 optimization code only supports linux os so far. That's not ok for postgres, unfortunately... > What does it mean "stock autoconf" ? > > Should the configure script be made by specific version autoconf ? Yes, it's the version from the autoconf project, rather than with modifications by distributions. Don't worry about it, the committer can take care of that. - Andres
RE: Optimize Arm64 crc32c implementation in Postgresql
Hi, Thanks for all your comments. > Could you show whether it actually improves performance? Usually bulk loading > data with parallel COPYs is a good way to hit this code path. The mini benchmark code: #include #include #include #include #include "port/pg_crc32c.h" long int GetTickCount() { struct timeval tv; gettimeofday(, NULL); return tv.tv_sec * 100 + tv.tv_usec; } int main() { static const uint64_t kSize = 1024 * 1024 + 29; uint8_t* buf = (uint8_t *)malloc(sizeof(uint8_t) * kSize); uint32_t i; srand(0); for (i = 0; i < kSize; i++) { buf[i] = (uint8_t)(rand() % 256u); } uint32_t kLoop = 1024; long int start, end; uint32_t crc = 0x; start = GetTickCount(); for (i = 0; i < kLoop; i++) { COMP_CRC32C(crc, buf, kSize); } end = GetTickCount(); crc ^= 0x; if (kSize < 20) { for (i = 0; i < kSize; i++) { printf("%3u,", (uint32_t)buf[i]); } printf("\n"); } printf("crc result = %x, time cost per loop:%f ms\n", crc, (double)(end - start) / kLoop); free(buf); return 0; } The result shows that optimization get x4.5 speedups on Cortex-A72. > What's the availability of these headers and functions on non-linux platforms? This Arm64 optimization code only supports linux os so far. >asm(".arch_extension crc"); >#define CRC32CB(crc,value) asm("crc32cb %w[c], %w[c], %w[v]" : [c]"+r"(*) >: [v]"r"(+value)) It just adds crc flag for Arm gnu gcc compiler. 'crc32cb' might not be recognized in windows or xxxBSD. So, is it reasonable to add an OS check for Arm64 crc32 optimization. It means the application calls Arm64 crc32 interface only in linux based OS. > From the department of trivialities, our coding style has braces like this: > > if (length & sizeof(uint16)) > { > CRC32CH(crc32_c, *(uint16*)p_buf); >p_buf += sizeof(uint16); > } > if (length & sizeof(uint8)) > CRC32CB(crc32_c, *p_buf); Got it. I'll modify it. > + --runstatedir=DIR modifiable per-process data [LOCALSTATEDIR/run] > > What is this for? >Those probably are damage from using a distribution autoconf rather than stock >autoconf. What does it mean "stock autoconf" ? Should the configure script be made by specific version autoconf ? Thanks! BRs, Yuqi -Original Message- From: Andres Freund [mailto:and...@anarazel.de] Sent: Friday, March 2, 2018 5:36 AM To: Yuqi Gu <yuqi...@arm.com> Cc: pgsql-hack...@postgresql.org Subject: Re: Optimize Arm64 crc32c implementation in Postgresql Hi, On 2018-01-10 05:58:19 +, Yuqi Gu wrote: > Currently PostgreSQL only implements hardware support for CRC32 checksums for > the x86_64 architecture. > Some ARMv8 (AArch64) CPUs implement the CRC32 extension which is > implemented by inline assembly, so they can also benefit from hardware > acceleration in IO-intensive workloads. > > I would like to propose the patch to optimize crc32c calculation with Arm64 > specific instructions. > The hardware-specific code implementation is used under #if defined > USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK. > And the performance is improved on platforms: cortex-A57, cortex-A72, > cortex-A73, etc. > > I'll create a CommitFests ticket for this submission. > Any comments or feedback are welcome. Could you show whether it actually improves performance? Usually bulk loading data with parallel COPYs is a good way to hit this codepath. > + > +AC_DEFUN([PGAC_ARM64CE_CRC32_INTRINSICS], > +[AC_CACHE_CHECK([for Arm64ce CRC32], > +[pgac_cv_arm64ce_crc32_intrinsics], > +[AC_LINK_IFELSE([AC_LANG_PROGRAM([], > + [unsigned int arm_flag = 0; > +#if defined(__ARM_ARCH) && (__ARM_ARCH > 7) > + arm_flag = 1; > +#endif > + return arm_flag == 1;])], > + [pgac_cv_arm64ce_crc32_intrinsics="yes"], > + [pgac_cv_arm64ce_crc32_intrinsics="no"])]) > +if test x"$pgac_cv_arm64ce_crc32_intrinsics" = x"yes"; then > + pgac_arm64ce_crc32_intrinsics=yes > +fi > +])# PGAC_ARM64CE_CRC32_INTRINSICS I don't understand what this check is supposed to be doing? AC_LINK_IFELSE doesn't run the program, so I don't see this test achieving anything at all? > * Use slicing-by-8 algorithm. > diff --git a/src/port/pg_crc32c_choose.c b/src/port/pg_crc32c_choose.c > index 40bee67..d3682ad 100644 > --- a/src/port/pg_crc32c_choose.
Re: Optimize Arm64 crc32c implementation in Postgresql
On Thu, Mar 01, 2018 at 01:36:23PM -0800, Andres Freund wrote: > On 2018-01-10 15:09:16 +0900, Michael Paquier wrote: >> There are not enough patches for ARM. > > Nah, not needing arch specific patches is good ;) You know already that I am always impressed by your skills in normalizing things where needed. -- Michael signature.asc Description: PGP signature
Re: Optimize Arm64 crc32c implementation in Postgresql
On 2018-03-02 11:37:52 +1300, Thomas Munro wrote: > So... that stuff probably needs either a configure check for the > getauxval function and/or those headers, or an OS check? It'd probably be better to not rely on os specific headers, and instead directly access the capabilities. > While I'm looking at this: > > -#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62)) > +#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << > 31)) > > Why? Doesn't something << 62 have the same value and type as > (something << 31) << 31? > + --runstatedir=DIR modifiable per-process data [LOCALSTATEDIR/run] > > What is this for? Those probably are damage from using a distribution autoconf rather than stock autoconf. Greetings, Andres Freund
Re: Optimize Arm64 crc32c implementation in Postgresql
On Fri, Mar 2, 2018 at 10:36 AM, Andres Freundwrote: > On 2018-01-10 05:58:19 +, Yuqi Gu wrote: >> +#ifdef USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK >> +#include >> +#include >> +#ifndef HWCAP_CRC32 >> +#define HWCAP_CRC32 (1 << 7) >> +#endif > >> +static bool >> +pg_crc32c_arm64ce_available(void) { >> + unsigned long auxv = getauxval(AT_HWCAP); >> + return (auxv & HWCAP_CRC32) != 0; >> +} >> + >> +#else > > What's the availability of these headers and functions on non-linux platforms? FWIW I don't think that'll work on FreeBSD. I don't have an arm64 system to test on right now, but I can see that there is no getauxval() like glibc's. FreeBSD *might* provide the same sort of information via procstat_getauxv() from libprocstat, but I think maybe not because I don't see any trace of HWCAP_CRC32 in the tree and I see a different approach to testing the CPU ID registers in eg libkern/crc32.c. So... that stuff probably needs either a configure check for the getauxval function and/or those headers, or an OS check? While I'm looking at this: -#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62)) +#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31)) Why? Doesn't something << 62 have the same value and type as (something << 31) << 31? + --runstatedir=DIR modifiable per-process data [LOCALSTATEDIR/run] What is this for? +if (length & sizeof(uint16)) { +CRC32CH(crc32_c, *(uint16*)p_buf); +p_buf += sizeof(uint16); +} + +if (length & sizeof(uint8)) { +CRC32CB(crc32_c, *p_buf); +} >From the department of trivialities, our coding style has braces like this: if (length & sizeof(uint16)) { CRC32CH(crc32_c, *(uint16*)p_buf); p_buf += sizeof(uint16); } if (length & sizeof(uint8)) CRC32CB(crc32_c, *p_buf); -- Thomas Munro http://www.enterprisedb.com
Re: Optimize Arm64 crc32c implementation in Postgresql
Hi, On 2018-01-10 05:58:19 +, Yuqi Gu wrote: > Currently PostgreSQL only implements hardware support for CRC32 checksums for > the x86_64 architecture. > Some ARMv8 (AArch64) CPUs implement the CRC32 extension which is implemented > by inline assembly, > so they can also benefit from hardware acceleration in IO-intensive workloads. > > I would like to propose the patch to optimize crc32c calculation with Arm64 > specific instructions. > The hardware-specific code implementation is used under #if defined > USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK. > And the performance is improved on platforms: cortex-A57, cortex-A72, > cortex-A73, etc. > > I'll create a CommitFests ticket for this submission. > Any comments or feedback are welcome. Could you show whether it actually improves performance? Usually bulk loading data with parallel COPYs is a good way to hit this codepath. > + > +AC_DEFUN([PGAC_ARM64CE_CRC32_INTRINSICS], > +[AC_CACHE_CHECK([for Arm64ce CRC32], [pgac_cv_arm64ce_crc32_intrinsics], > +[AC_LINK_IFELSE([AC_LANG_PROGRAM([], > + [unsigned int arm_flag = 0; > +#if defined(__ARM_ARCH) && (__ARM_ARCH > 7) > + arm_flag = 1; > +#endif > + return arm_flag == 1;])], > + [pgac_cv_arm64ce_crc32_intrinsics="yes"], > + [pgac_cv_arm64ce_crc32_intrinsics="no"])]) > +if test x"$pgac_cv_arm64ce_crc32_intrinsics" = x"yes"; then > + pgac_arm64ce_crc32_intrinsics=yes > +fi > +])# PGAC_ARM64CE_CRC32_INTRINSICS I don't understand what this check is supposed to be doing? AC_LINK_IFELSE doesn't run the program, so I don't see this test achieving anything at all? > * Use slicing-by-8 algorithm. > diff --git a/src/port/pg_crc32c_choose.c b/src/port/pg_crc32c_choose.c > index 40bee67..d3682ad 100644 > --- a/src/port/pg_crc32c_choose.c > +++ b/src/port/pg_crc32c_choose.c > @@ -29,6 +29,20 @@ > > #include "port/pg_crc32c.h" > > +#ifdef USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK > +#include > +#include > +#ifndef HWCAP_CRC32 > +#define HWCAP_CRC32 (1 << 7) > +#endif > +static bool > +pg_crc32c_arm64ce_available(void) { > + unsigned long auxv = getauxval(AT_HWCAP); > + return (auxv & HWCAP_CRC32) != 0; > +} > + > +#else What's the availability of these headers and functions on non-linux platforms? > +#if defined(USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK) > +asm(".arch_extension crc"); So this emitted globally? > +#define LDP(x,y,p) asm("ldp %x[a], %x[b], [%x[c]], #16" : > [a]"=r"(x),[b]"=r"(y),[c]"+r"(p)) > +/* CRC32C: Castagnoli polynomial 0x1EDC6F41 */ > +#define CRC32CX(crc,value) asm("crc32cx %w[c], %w[c], %x[v]" : > [c]"+r"(*) : [v]"r"(+value)) > +#define CRC32CW(crc,value) asm("crc32cw %w[c], %w[c], %w[v]" : > [c]"+r"(*) : [v]"r"(+value)) > +#define CRC32CH(crc,value) asm("crc32ch %w[c], %w[c], %w[v]" : > [c]"+r"(*) : [v]"r"(+value)) > +#define CRC32CB(crc,value) asm("crc32cb %w[c], %w[c], %w[v]" : > [c]"+r"(*) : [v]"r"(+value)) > + > +pg_crc32c > +pg_comp_crc32c_arm64(pg_crc32c crc, const void* data, size_t len) { > + uint64 p0, p1; > + pg_crc32c crc32_c = crc; > + long length = len; > + const unsigned char *p_buf = data; > + > + /* Allow crc instructions in asm */ > + asm(".cpu generic+crc"); Hm, this switches it for the rest of the function, program, ...? Greetings, Andres Freund
Re: Optimize Arm64 crc32c implementation in Postgresql
On Wed, Jan 10, 2018 at 05:58:19AM +, Yuqi Gu wrote: > I would like to propose the patch to optimize crc32c calculation with > Arm64 specific instructions. The hardware-specific code > implementation is used under #if defined > USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK. And the performance is improved > on platforms: cortex-A57, cortex-A72, cortex-A73, etc. > > I'll create a CommitFests ticket for this submission. > Any comments or feedback are welcome. Nice! I have not looked at your patch, but +1. There are not enough patches for ARM. -- Michael signature.asc Description: PGP signature