Re: Popcount optimization using AVX512

2024-04-23 Thread Nathan Bossart
On Thu, Apr 18, 2024 at 05:13:58PM -0500, Nathan Bossart wrote: > Makes sense, thanks. I'm planning to commit this fix sometime early next > week. Committed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-18 Thread Nathan Bossart
On Thu, Apr 18, 2024 at 10:11:08PM +, Devulapalli, Raghuveer wrote: >> On that note, is it necessary to also check for avx512f? At the moment, >> we are assuming that's supported if the other AVX-512 instructions are >> available. > > No, it's not needed. There are no CPU's with

RE: Popcount optimization using AVX512

2024-04-18 Thread Devulapalli, Raghuveer
> On that note, is it necessary to also check for avx512f? At the moment, we > are assuming that's supported if the other AVX-512 instructions are available. No, it's not needed. There are no CPU's with avx512bw/avx512popcnt without avx512f. Unfortunately though, avx512popcnt does not mean

Re: Popcount optimization using AVX512

2024-04-18 Thread Nathan Bossart
On Thu, Apr 18, 2024 at 09:29:55PM +, Devulapalli, Raghuveer wrote: > (1) Shouldn't it be: return (_xgetbv(0) & 0xe6) == 0xe6; ? Otherwise > zmm_regs_available() will return false.. Yes, that's a mistake. I fixed that in v3. > (2) Nitpick: avx512_popcnt_available and avx512_bw_available()

RE: Popcount optimization using AVX512

2024-04-18 Thread Devulapalli, Raghuveer
> Thanks for the feedback. I've attached an updated patch. (1) Shouldn't it be: return (_xgetbv(0) & 0xe6) == 0xe6; ? Otherwise zmm_regs_available() will return false. (2) Nitpick: avx512_popcnt_available and avx512_bw_available() run the same cpuid leaf. You could combine them into one to

Re: Popcount optimization using AVX512

2024-04-18 Thread Nathan Bossart
On Thu, Apr 18, 2024 at 08:24:03PM +, Devulapalli, Raghuveer wrote: >> This seems to contradict the note about doing step 3 at any point, and >> given step 1 is the OSXSAVE check, I'm not following what this means, >> anyway. > > It is recommended that you run the xgetbv code before you check

Re: Popcount optimization using AVX512

2024-04-18 Thread Nathan Bossart
On Thu, Apr 18, 2024 at 06:12:22PM +, Shankaran, Akash wrote: > Good find. I confirmed after speaking with an intel expert, and from the > intel AVX-512 manual [0] section 14.3, which recommends to check bit27. From > the manual: > > "Prior to using Intel AVX, the application must identify

RE: Popcount optimization using AVX512

2024-04-18 Thread Shankaran, Akash
> It was brought to my attention [0] that we probably should be checking for > the OSXSAVE bit instead of the XSAVE bit when determining whether there's > support for the XGETBV instruction. IIUC that should indicate that both the > OS and the processor have XGETBV support (not just the

Re: Popcount optimization using AVX512

2024-04-17 Thread Nathan Bossart
It was brought to my attention [0] that we probably should be checking for the OSXSAVE bit instead of the XSAVE bit when determining whether there's support for the XGETBV instruction. IIUC that should indicate that both the OS and the processor have XGETBV support (not just the processor). I've

Re: Popcount optimization using AVX512

2024-04-07 Thread Tom Lane
Nathan Bossart writes: > On Sun, Apr 07, 2024 at 08:23:32PM -0500, Nathan Bossart wrote: >> The Intel documentation for _mm256_undefined_si256() [0] >> indicates that it is intended to return "undefined elements," so it seems >> like the use of an uninitialized variable might be intentional. >

Re: Popcount optimization using AVX512

2024-04-07 Thread Nathan Bossart
On Sun, Apr 07, 2024 at 08:23:32PM -0500, Nathan Bossart wrote: > The Intel documentation for _mm256_undefined_si256() [0] > indicates that it is intended to return "undefined elements," so it seems > like the use of an uninitialized variable might be intentional. See also

Re: Popcount optimization using AVX512

2024-04-07 Thread Nathan Bossart
On Sun, Apr 07, 2024 at 08:42:12PM -0400, Tom Lane wrote: > Today's Coverity run produced this warning, which seemingly was > triggered by one of these commits, but I can't make much sense > of it: > > *** CID 1596255: Uninitialized variables (UNINIT) >

Re: Popcount optimization using AVX512

2024-04-07 Thread Tom Lane
Nathan Bossart writes: > Here is what I have staged for commit, which I intend to do shortly. Today's Coverity run produced this warning, which seemingly was triggered by one of these commits, but I can't make much sense of it: *** CID 1596255: Uninitialized variables (UNINIT)

Re: Popcount optimization using AVX512

2024-04-06 Thread Nathan Bossart
On Sat, Apr 06, 2024 at 02:41:01PM -0500, Nathan Bossart wrote: > Here is what I have staged for commit, which I intend to do shortly. Committed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-06 Thread Nathan Bossart
On Sat, Apr 06, 2024 at 02:51:39PM +1300, David Rowley wrote: > On Sat, 6 Apr 2024 at 14:17, Nathan Bossart wrote: >> On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote: >> > Won't Valgrind complain about this? >> > >> > +pg_popcount_avx512(const char *buf, int bytes) >> > >> > + buf =

Re: Popcount optimization using AVX512

2024-04-05 Thread David Rowley
On Sat, 6 Apr 2024 at 14:17, Nathan Bossart wrote: > > On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote: > > Won't Valgrind complain about this? > > > > +pg_popcount_avx512(const char *buf, int bytes) > > > > + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf); > > > > + val =

Re: Popcount optimization using AVX512

2024-04-05 Thread Nathan Bossart
On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote: > Won't Valgrind complain about this? > > +pg_popcount_avx512(const char *buf, int bytes) > > + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf); > > + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); I haven't

Re: Popcount optimization using AVX512

2024-04-05 Thread David Rowley
On Sat, 6 Apr 2024 at 04:38, Nathan Bossart wrote: > This seems to provide a small performance boost, so I've incorporated it > into v27. Won't Valgrind complain about this? +pg_popcount_avx512(const char *buf, int bytes) + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf); + val =

Re: Popcount optimization using AVX512

2024-04-05 Thread Nathan Bossart
On Fri, Apr 05, 2024 at 07:58:44AM -0500, Nathan Bossart wrote: > On Fri, Apr 05, 2024 at 10:33:27AM +0300, Ants Aasma wrote: >> The main issue I saw was that clang was able to peel off the first >> iteration of the loop and then eliminate the mask assignment and >> replace masked load with a

Re: Popcount optimization using AVX512

2024-04-05 Thread Nathan Bossart
On Fri, Apr 05, 2024 at 10:33:27AM +0300, Ants Aasma wrote: > The main issue I saw was that clang was able to peel off the first > iteration of the loop and then eliminate the mask assignment and > replace masked load with a memory operand for vpopcnt. I was not able > to convince gcc to do that

Re: Popcount optimization using AVX512

2024-04-05 Thread Ants Aasma
On Fri, 5 Apr 2024 at 07:15, Nathan Bossart wrote: > Here is an updated patch set. IMHO this is in decent shape and is > approaching committable. I checked the code generation on various gcc and clang versions. It looks mostly fine starting from versions where avx512 is supported, gcc-7.1 and

Re: Popcount optimization using AVX512

2024-04-04 Thread Nathan Bossart
Here is an updated patch set. IMHO this is in decent shape and is approaching committable. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From df59d3e78604e4530f5096bafc08ac94e13d82d2 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject:

Re: Popcount optimization using AVX512

2024-04-04 Thread Nathan Bossart
On Thu, Apr 04, 2024 at 04:02:53PM +0300, Ants Aasma wrote: > Speaking of which, what does bumping up the inlined version threshold > to 16 do with and without AVX-512 available? Linearly extrapolating > the 2 and 4 byte numbers it might just come ahead in both cases, > making the choice easy.

Re: Popcount optimization using AVX512

2024-04-04 Thread Nathan Bossart
On Thu, Apr 04, 2024 at 04:28:58PM +1300, David Rowley wrote: > On Thu, 4 Apr 2024 at 11:50, Nathan Bossart wrote: >> If we can verify this approach won't cause segfaults and can stomach the >> regression between 8 and 16 bytes, I'd happily pivot to this approach so >> that we can avoid the

Re: Popcount optimization using AVX512

2024-04-04 Thread Ants Aasma
On Thu, 4 Apr 2024 at 01:50, Nathan Bossart wrote: > If we can verify this approach won't cause segfaults and can stomach the > regression between 8 and 16 bytes, I'd happily pivot to this approach so > that we can avoid the function call dance that I have in v25. The approach I posted does not

Re: Popcount optimization using AVX512

2024-04-03 Thread David Rowley
On Thu, 4 Apr 2024 at 11:50, Nathan Bossart wrote: > If we can verify this approach won't cause segfaults and can stomach the > regression between 8 and 16 bytes, I'd happily pivot to this approach so > that we can avoid the function call dance that I have in v25. > > Thoughts? If we're worried

Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 11:30:39PM +0300, Ants Aasma wrote: > On Tue, 2 Apr 2024 at 00:31, Nathan Bossart wrote: >> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: >> > What about using the masking capabilities of AVX-512 to handle the >> > tail in the same code path? Masked out

Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart
On Wed, Apr 03, 2024 at 12:41:27PM -0500, Nathan Bossart wrote: > I committed v23-0001. Here is a rebased version of the remaining patches. > I intend to test the masking idea from Ants next. 0002 was missing a cast that is needed for the 32-bit builds. I've fixed that in v25. -- Nathan

Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart
I committed v23-0001. Here is a rebased version of the remaining patches. I intend to test the masking idea from Ants next. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 295b03530de5f42fe876b4489191da2f8dc83194 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 05:20:20PM -0500, Nathan Bossart wrote: > Sorry for the noise. I noticed a couple of silly mistakes immediately > after sending v21. Sigh... I missed a line while rebasing these patches, which seems to have grossly offended cfbot. Apologies again for the noise. --

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 05:01:32PM -0500, Nathan Bossart wrote: > In v21, 0001 is just the above inlining idea, which seems worth doing > independent of $SUBJECT. 0002 and 0003 are the AVX-512 patches, which I've > modified similarly to 0001, i.e., I've inlined the "fast" version in the >

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 01:40:21PM -0500, Nathan Bossart wrote: > On Tue, Apr 02, 2024 at 01:43:48PM -0400, Tom Lane wrote: >> I don't like the double evaluation of the macro argument. Seems like >> you could get the same results more safely with >> >> static inline uint64 >>

Re: Popcount optimization using AVX512

2024-04-02 Thread Ants Aasma
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart wrote: > On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: > > What about using the masking capabilities of AVX-512 to handle the > > tail in the same code path? Masked out portions of a load instruction > > will not generate an exception. To

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 01:43:48PM -0400, Tom Lane wrote: > Alvaro Herrera writes: >> On 2024-Apr-02, Nathan Bossart wrote: >>> Another idea I had is to turn pg_popcount() into a macro that just uses the >>> pg_number_of_ones array when called for few bytes: >>> >>> static inline uint64 >>>

Re: Popcount optimization using AVX512

2024-04-02 Thread Tom Lane
Alvaro Herrera writes: > On 2024-Apr-02, Nathan Bossart wrote: >> Another idea I had is to turn pg_popcount() into a macro that just uses the >> pg_number_of_ones array when called for few bytes: >> >> static inline uint64 >> pg_popcount_inline(const char *buf, int bytes) >> { >>

Re: Popcount optimization using AVX512

2024-04-02 Thread Alvaro Herrera
On 2024-Apr-02, Nathan Bossart wrote: > Another idea I had is to turn pg_popcount() into a macro that just uses the > pg_number_of_ones array when called for few bytes: > > static inline uint64 > pg_popcount_inline(const char *buf, int bytes) > { > uint64

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Mon, Apr 01, 2024 at 05:11:17PM -0500, Nathan Bossart wrote: > Here is a v19 of the patch set. I moved out the refactoring of the > function pointer selection code to 0001. I think this is a good change > independent of $SUBJECT, and I plan to commit this soon. In 0002, I > changed the

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 01:09:57AM +0300, Ants Aasma wrote: > On Tue, 2 Apr 2024 at 00:31, Nathan Bossart wrote: >> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: >> > What about using the masking capabilities of AVX-512 to handle the >> > tail in the same code path? Masked out

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart
Here is a v19 of the patch set. I moved out the refactoring of the function pointer selection code to 0001. I think this is a good change independent of $SUBJECT, and I plan to commit this soon. In 0002, I changed the syslogger.c usage of pg_popcount() to use pg_number_of_ones instead. This is

Re: Popcount optimization using AVX512

2024-04-01 Thread Ants Aasma
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart wrote: > > On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: > > What about using the masking capabilities of AVX-512 to handle the > > tail in the same code path? Masked out portions of a load instruction > > will not generate an exception.

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: > What about using the masking capabilities of AVX-512 to handle the > tail in the same code path? Masked out portions of a load instruction > will not generate an exception. To allow byte level granularity > masking, -mavx512bw is

Re: Popcount optimization using AVX512

2024-04-01 Thread Ants Aasma
On Mon, 1 Apr 2024 at 18:53, Nathan Bossart wrote: > > On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote: > > On 2024-Mar-31, Nathan Bossart wrote: > >> +popcnt = _mm512_reduce_add_epi64(accum); > >> +return popcnt + pg_popcount_fast(buf, bytes); > > > > Hmm, doesn't this

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart
On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote: > On 2024-Mar-31, Nathan Bossart wrote: >> +popcnt = _mm512_reduce_add_epi64(accum); >> +return popcnt + pg_popcount_fast(buf, bytes); > > Hmm, doesn't this arrangement cause an extra function call to > pg_popcount_fast to be

Re: Popcount optimization using AVX512

2024-04-01 Thread Alvaro Herrera
On 2024-Mar-31, Nathan Bossart wrote: > +uint64 > +pg_popcount_avx512(const char *buf, int bytes) > +{ > + uint64 popcnt; > + __m512i accum = _mm512_setzero_si512(); > + > + for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i)) > + { > + const

Re: Popcount optimization using AVX512

2024-03-31 Thread Nathan Bossart
On Sat, Mar 30, 2024 at 03:03:29PM -0500, Nathan Bossart wrote: > My current plan is to add some new tests for > pg_popcount() with many bytes, and then I'll give it a few more days for > any additional feedback before committing. Here is a v18 with a couple of new tests. Otherwise, it is the

Re: Popcount optimization using AVX512

2024-03-30 Thread Nathan Bossart
I used John Naylor's test_popcount module [0] to put together the attached graphs (note that the "small arrays" one is semi-logarithmic). For both graphs, the X-axis is the number of 64-bit words in the array, and Y-axis is the amount of time in milliseconds to run pg_popcount() on it 100,000

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
Here's a v17 of the patch. This one has configure checks for everything (i.e., CPUID, XGETBV, and the AVX512 intrinsics) as well as the relevant runtime checks (i.e., we call CPUID to check for XGETBV and AVX512 POPCNT availability, and we call XGETBV to ensure the ZMM registers are enabled). I

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Fri, Mar 29, 2024 at 03:08:28PM -0500, Nathan Bossart wrote: >> +#if defined(HAVE__GET_CPUID) >> +__get_cpuid_count(7, 0, [0], [1], [2], [3]); >> +#elif defined(HAVE__CPUID) >> +__cpuidex(exx, 7, 0); > > Is there any reason we can't use __get_cpuid() and __cpuid() here, given > the

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Fri, Mar 29, 2024 at 02:13:12PM -0500, Nathan Bossart wrote: > * If the compiler understands AVX512 intrinsics, we assume that it also > knows about the required CPUID and XGETBV intrinsics, and we assume that > the conditions for TRY_POPCNT_FAST are true. Bleh, cfbot's 32-bit build is

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
Okay, here is a slightly different approach that I've dubbed the "maximum assumption" approach. In short, I wanted to see how much we could simplify the patch by making all possibly-reasonable assumptions about the compiler and CPU. These include: * If the compiler understands AVX512

RE: Popcount optimization using AVX512

2024-03-29 Thread Amonson, Paul D
> A counterexample is the CRC32C code. AFAICT we assume the presence of > CPUID in that code (and #error otherwise). I imagine its probably safe to > assume the compiler understands CPUID if it understands AVX512 intrinsics, > but that is still mostly a guess. If AVX-512 intrinsics are

RE: Popcount optimization using AVX512

2024-03-29 Thread Amonson, Paul D
> On Thu, Mar 28, 2024 at 11:10:33PM +0100, Alvaro Herrera wrote: > > We don't do MSVC via autoconf/Make. We used to have a special build > > framework for MSVC which parsed Makefiles to produce "solution" files, > > but it was removed as soon as Meson was mature enough to build. See > > commit

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Fri, Mar 29, 2024 at 12:30:14PM -0400, Tom Lane wrote: > Nathan Bossart writes: >>> I see google web references to the xgetbv instruction as far back as 2009 >>> for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for >>> _xgetbv() MSVC built-in. How far back do you need to go?

RE: Popcount optimization using AVX512

2024-03-29 Thread Shankaran, Akash
> From: Nathan Bossart > Sent: Friday, March 29, 2024 9:17 AM > To: Amonson, Paul D > On Fri, Mar 29, 2024 at 04:06:17PM +, Amonson, Paul D wrote: >> Yeah, I understand that much, but I want to know how portable the >> XGETBV instruction is. Unless I can assume that all x86_64 systems

Re: Popcount optimization using AVX512

2024-03-29 Thread Tom Lane
Nathan Bossart writes: >> I see google web references to the xgetbv instruction as far back as 2009 >> for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for >> _xgetbv() MSVC built-in. How far back do you need to go? > Hm. It seems unlikely that a compiler would understand

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Fri, Mar 29, 2024 at 10:59:40AM -0500, Nathan Bossart wrote: > It might be nice if we conditionally built pg_popcount_avx512.o in autoconf > builds, too, but AFAICT we still need to wrap most of that code with > macros, so I'm not sure it's worth the trouble. I'll take another look at >

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Fri, Mar 29, 2024 at 04:06:17PM +, Amonson, Paul D wrote: >> Yeah, I understand that much, but I want to know how portable the XGETBV >> instruction is. Unless I can assume that all x86_64 systems and compilers >> support that instruction, we might need an additional configure check and/or

RE: Popcount optimization using AVX512

2024-03-29 Thread Amonson, Paul D
> -Original Message- > > Cool. I think we should run the benchmarks again to be safe, though. Ok, sure go ahead. :) > >> I forgot to mention that I also want to understand whether we can > >> actually assume availability of XGETBV when CPUID says we support > >> AVX512: > > > > You

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Thu, Mar 28, 2024 at 10:29:47PM +, Amonson, Paul D wrote: > I see in the meson.build you added the new file twice? > > @@ -7,6 +7,7 @@ pgport_sources = [ >'noblock.c', >'path.c', >'pg_bitutils.c', > + 'pg_popcount_avx512.c', >'pg_strong_random.c', >'pgcheckdir.c', >

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Thu, Mar 28, 2024 at 11:10:33PM +0100, Alvaro Herrera wrote: > We don't do MSVC via autoconf/Make. We used to have a special build > framework for MSVC which parsed Makefiles to produce "solution" files, > but it was removed as soon as Meson was mature enough to build. See > commit

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Thu, Mar 28, 2024 at 10:03:04PM +, Amonson, Paul D wrote: >> * I think we need to verify there isn't a huge performance regression for >> smaller arrays. IIUC those will still require an AVX512 instruction or >> two as well as a function call, which might add some noticeable overhead.

RE: Popcount optimization using AVX512

2024-03-28 Thread Amonson, Paul D
> -Original Message- > From: Amonson, Paul D > Sent: Thursday, March 28, 2024 3:03 PM > To: Nathan Bossart > ... > I will review the new patch to see if there are anything that jumps out at me. I see in the meson.build you added the new file twice? @@ -7,6 +7,7 @@ pgport_sources = [

Re: Popcount optimization using AVX512

2024-03-28 Thread Alvaro Herrera
On 2024-Mar-28, Amonson, Paul D wrote: > > -Original Message- > > From: Nathan Bossart > > Sent: Thursday, March 28, 2024 2:39 PM > > To: Amonson, Paul D > > > > * The latest patch set from Paul Amonson appeared to support MSVC in the > > meson build, but not the autoconf one. I

RE: Popcount optimization using AVX512

2024-03-28 Thread Amonson, Paul D
> -Original Message- > From: Nathan Bossart > Sent: Thursday, March 28, 2024 2:39 PM > To: Amonson, Paul D > > * The latest patch set from Paul Amonson appeared to support MSVC in the > meson build, but not the autoconf one. I don't have much expertise here, > so the v14 patch

Re: Popcount optimization using AVX512

2024-03-28 Thread Nathan Bossart
On Thu, Mar 28, 2024 at 04:38:54PM -0500, Nathan Bossart wrote: > Here is a v14 of the patch that I think is beginning to approach something > committable. Besides general review and testing, there are two things that > I'd like to bring up: > > * The latest patch set from Paul Amonson appeared

Re: Popcount optimization using AVX512

2024-03-28 Thread Nathan Bossart
Here is a v14 of the patch that I think is beginning to approach something committable. Besides general review and testing, there are two things that I'd like to bring up: * The latest patch set from Paul Amonson appeared to support MSVC in the meson build, but not the autoconf one. I don't

RE: Popcount optimization using AVX512

2024-03-27 Thread Amonson, Paul D
> -Original Message- > From: Nathan Bossart > Sent: Wednesday, March 27, 2024 3:00 PM > To: Amonson, Paul D > > ... (I realize that I'm essentially > recanting much of my previous feedback, which I apologize for.) It happens. LOL As long as the algorithm for AVX-512 is not altered I

Re: Popcount optimization using AVX512

2024-03-27 Thread Nathan Bossart
On Mon, Mar 25, 2024 at 03:05:51PM -0500, Nathan Bossart wrote: > On Mon, Mar 25, 2024 at 06:42:36PM +, Amonson, Paul D wrote: >> Ok, CI turned green after my re-post of the patches. Can this please get >> merged? > > Thanks for the new patches. I intend to take another look soon. Thanks

Re: Popcount optimization using AVX512

2024-03-25 Thread Nathan Bossart
On Mon, Mar 25, 2024 at 06:42:36PM +, Amonson, Paul D wrote: > Ok, CI turned green after my re-post of the patches. Can this please get > merged? Thanks for the new patches. I intend to take another look soon. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com

RE: Popcount optimization using AVX512

2024-03-25 Thread Amonson, Paul D
s.postgresql.org > Subject: RE: Popcount optimization using AVX512 > Ok, CI turned green after my re-post of the patches. Can this please get merged? Thanks, Paul

Re: Popcount optimization using AVX512

2024-03-25 Thread Joe Conway
On 3/25/24 11:12, Tom Lane wrote: "Amonson, Paul D" writes: I am re-posting the patches as CI for Mac failed (CI error not code/test error). The patches are the same as last time. Just for a note --- the cfbot will re-test existing patches every so often without needing a bump. The current

RE: Popcount optimization using AVX512

2024-03-25 Thread Amonson, Paul D
> -Original Message- > From: Tom Lane > Sent: Monday, March 25, 2024 8:12 AM > To: Amonson, Paul D > Cc: David Rowley ; Nathan Bossart > Subject: Re: Popcount optimization using AVX512 >... > Just for a note --- the cfbot will re-test existing patches every so of

Re: Popcount optimization using AVX512

2024-03-25 Thread Tom Lane
"Amonson, Paul D" writes: > I am re-posting the patches as CI for Mac failed (CI error not code/test > error). The patches are the same as last time. Just for a note --- the cfbot will re-test existing patches every so often without needing a bump. The current cycle period seems to be about

RE: Popcount optimization using AVX512

2024-03-25 Thread Amonson, Paul D
> -Original Message- > From: Amonson, Paul D > Sent: Thursday, March 21, 2024 12:18 PM > To: David Rowley > Cc: Nathan Bossart ; Andres Freund I am re-posting the patches as CI for Mac failed (CI error not code/test error). The patches are the same as last time. Thanks, Paul

RE: Popcount optimization using AVX512

2024-03-21 Thread Amonson, Paul D
> -Original Message- > From: David Rowley > Sent: Wednesday, March 20, 2024 5:28 PM > To: Amonson, Paul D > Cc: Nathan Bossart ; Andres Freund > > I'm not sure about this "extern negates inline" comment. It seems to me the > compiler is perfectly free to inline a static function into an

Re: Popcount optimization using AVX512

2024-03-20 Thread David Rowley
On Wed, 20 Mar 2024 at 11:56, Amonson, Paul D wrote: > Changed in this patch set. > > * Rebased. > * Direct *slow* calls via macros as shown in example patch. > * Changed the choose filename to be platform specific as suggested. > * Falls back to intermediate "Fast" methods if AVX512 is not

RE: Popcount optimization using AVX512

2024-03-20 Thread Amonson, Paul D
> -Original Message- > From: David Rowley > Sent: Tuesday, March 19, 2024 9:26 PM > To: Amonson, Paul D > > AMD's Zen4 also has AVX512, so it's misleading to indicate it's an Intel only > instruction. Also, writing the date isn't necessary as we have "git blame" Fixed. Thanks, Paul

Re: Popcount optimization using AVX512

2024-03-19 Thread David Rowley
On Wed, 20 Mar 2024 at 11:56, Amonson, Paul D wrote: > Changed in this patch set. Thanks for rebasing. I don't think there's any need to mention Intel in each of the following comments: +# Check for Intel AVX512 intrinsics to do POPCNT calculations. +# Newer Intel processors can use AVX-512

RE: Popcount optimization using AVX512

2024-03-19 Thread Amonson, Paul D
> -Original Message- > From: Nathan Bossart > > Committed. Thanks for the suggestion and for reviewing! > > Paul, I suspect your patches will need to be rebased after commit cc4826d. > Would you mind doing so? Changed in this patch set. * Rebased. * Direct *slow* calls via macros as

Re: Popcount optimization using AVX512

2024-03-19 Thread Nathan Bossart
On Tue, Mar 19, 2024 at 12:30:50PM +1300, David Rowley wrote: > Looks good. Committed. Thanks for the suggestion and for reviewing! Paul, I suspect your patches will need to be rebased after commit cc4826d. Would you mind doing so? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-03-18 Thread David Rowley
On Tue, 19 Mar 2024 at 11:08, Nathan Bossart wrote: > > On Mon, Mar 18, 2024 at 04:29:19PM -0500, Nathan Bossart wrote: > > Agreed. Will send an updated patch shortly. > > As promised... Looks good. David

Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 04:29:19PM -0500, Nathan Bossart wrote: > Agreed. Will send an updated patch shortly. As promised... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From b673663b1d1344549cbd0912220f96ba1712afc6 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Mon, 18

Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Tue, Mar 19, 2024 at 10:27:58AM +1300, David Rowley wrote: > On Tue, 19 Mar 2024 at 10:08, Nathan Bossart wrote: >> On Tue, Mar 19, 2024 at 10:02:18AM +1300, David Rowley wrote: >> > The only thing I'd question in the patch is in pg_popcount_fast(). It >> > looks like you've opted to not do

Re: Popcount optimization using AVX512

2024-03-18 Thread David Rowley
On Tue, 19 Mar 2024 at 10:08, Nathan Bossart wrote: > > On Tue, Mar 19, 2024 at 10:02:18AM +1300, David Rowley wrote: > > The only thing I'd question in the patch is in pg_popcount_fast(). It > > looks like you've opted to not do the 32-bit processing on 32-bit > > machines. I think that's likely

Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 09:22:43PM +, Amonson, Paul D wrote: >> The only reason I left it out was because I couldn't convince myself that it >> wasn't dead code, given we assume that popcntq is available in >> pg_popcount64_fast() today. But I don't see any harm in adding that just in >>

RE: Popcount optimization using AVX512

2024-03-18 Thread Amonson, Paul D
> -Original Message- > From: Nathan Bossart > Sent: Monday, March 18, 2024 2:08 PM > To: David Rowley > Cc: Amonson, Paul D ; Andres Freund >... > > The only reason I left it out was because I couldn't convince myself that it > wasn't dead code, given we assume that popcntq is available

Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Tue, Mar 19, 2024 at 10:02:18AM +1300, David Rowley wrote: > I looked at your latest patch and tried out the performance on a Zen4 > running windows and a Zen2 running on Linux. As follows: Thanks for taking a look. > The only thing I'd question in the patch is in pg_popcount_fast(). It >

Re: Popcount optimization using AVX512

2024-03-18 Thread David Rowley
On Tue, 19 Mar 2024 at 06:30, Nathan Bossart wrote: > Here is a more fleshed-out version of what I believe David is proposing. > On my machine, the gains aren't quite as impressive (~8.8s to ~5.2s for the > test_popcount benchmark). I assume this is because this patch turns > pg_popcount() into

Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 12:30:04PM -0500, Nathan Bossart wrote: > Here is a more fleshed-out version of what I believe David is proposing. > On my machine, the gains aren't quite as impressive (~8.8s to ~5.2s for the > test_popcount benchmark). I assume this is because this patch turns >

Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 05:28:32PM +, Amonson, Paul D wrote: > Question: I applied the patch for the drive_popcount* functions and > rebuilt. The resultant server complains that the function is missing. > What is the trick to make this work? You probably need to install the test_popcount

Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 11:20:18AM -0500, Nathan Bossart wrote: > I don't think David was suggesting that we need to remove the runtime > checks for AVX512. IIUC he was pointing out that most of the performance > gain is from removing the function call overhead, which your v8-0002 patch > already

RE: Popcount optimization using AVX512

2024-03-18 Thread Amonson, Paul D
> -Original Message- > From: Nathan Bossart > Sent: Monday, March 18, 2024 9:20 AM > ... > I don't think David was suggesting that we need to remove the runtime checks > for AVX512. IIUC he was pointing out that most of the performance gain is > from removing the function call overhead,

Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 04:07:40PM +, Amonson, Paul D wrote: > Won't I still need the runtime checks? If I compile with a compiler > supporting the HW "feature" but run on HW without that feature, I will > want to avoid faults due to illegal operations. Won't that also affect > performance?

RE: Popcount optimization using AVX512

2024-03-18 Thread Amonson, Paul D
Bossart > Sent: Monday, March 18, 2024 8:29 AM > To: David Rowley > Cc: Amonson, Paul D ; Andres Freund > ; Alvaro Herrera ; Shankaran, > Akash ; Noah Misch ; > Tom Lane ; Matthias van de Meent > ; pgsql-hackers@lists.postgresql.org > Subject: Re: Popcount optimization using A

Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 09:56:32AM +1300, David Rowley wrote: > Maybe it's worth exploring something along the lines of the attached > before doing the AVX512 stuff. It seems like a pretty good speed-up > and will apply for CPUs without AVX512 support. +1 -- Nathan Bossart Amazon Web Services:

Re: Popcount optimization using AVX512

2024-03-17 Thread David Rowley
On Sat, 16 Mar 2024 at 04:06, Nathan Bossart wrote: > I ran John Naylor's test_popcount module [0] with the following command on > an i7-1195G7: > > time psql postgres -c 'select drive_popcount(1000, 1024)' > > Without your patches, this seems to take somewhere around 8.8 seconds. >

RE: Popcount optimization using AVX512

2024-03-15 Thread Amonson, Paul D
> -Original Message- > From: Amonson, Paul D > Sent: Friday, March 15, 2024 8:31 AM > To: Nathan Bossart ... > When I tested the code outside postgres in a micro benchmark I got 200- > 300% improvements. Your results are interesting, as it implies more than > 300% improvement. Let me do

RE: Popcount optimization using AVX512

2024-03-15 Thread Amonson, Paul D
org > Subject: Re: Popcount optimization using AVX512 > > Which test suite did you run? Those numbers seem potentially > indistinguishable from noise, which probably isn't great for such a large > patch > set. I ran... psql -c "select bitcount(column) from table;&qu

Re: Popcount optimization using AVX512

2024-03-15 Thread Nathan Bossart
On Thu, Mar 14, 2024 at 07:50:46PM +, Amonson, Paul D wrote: > As for new performance numbers: I just ran a full suite like I did > earlier in the process. My latest results an equivalent to a pgbench > scale factor 10 DB with the target column having varying column widths > and appropriate

RE: Popcount optimization using AVX512

2024-03-14 Thread Amonson, Paul D
> -Original Message- > From: Nathan Bossart > Sent: Monday, March 11, 2024 6:35 PM > To: Amonson, Paul D > Thanks. There's no need to wait to post the AVX portion. I recommend using > "git format-patch" to construct the patch set for the lists. After exploring git format-patch

  1   2   >