Re: svn commit: r280279 - head/sys/sys
On Mon, Apr 13, 2015 at 04:04:45PM -0400, Jung-uk Kim wrote: Please try the attached patch. Jung-uk Kim -BEGIN PGP SIGNATURE- Version: GnuPG v2 iQEcBAEBCAAGBQJVLCFZAAoJEHyflib82/FGOp0H/1+Jr+cKUn/MnV5O5SghPw9f XzTM4+BV9BcWabLRjFe1LR065SfLDXqKLuU4h5lmVSlXQaxElAXxaMeyO3mrMzR4 Sb1xr0rf+ZfUARJeEJWI65Wpn+gEH+7XxXAIAetYGMwwclBOBgbZIoDXITnCaUFa /pi3zQIey8EzbvlzhQcffLDV8oF4f8HNEMoSxMRtOiZNNPu/8ECnyGeHZhOd++kh pwZNsSbcCw3RXMheuErTpKPrJSEXgMNmWG3G00aP7L8IjcObgOqMUQt+8eT8Ge8B tEv40kgm2G/OG2akONh4/6bX3hyodW3IHcb6AYhqZogiDIqd/eXD4jDup/kkVxU= =1Ca9 -END PGP SIGNATURE- Index: sys/amd64/amd64/pmap.c === --- sys/amd64/amd64/pmap.c(revision 281496) +++ sys/amd64/amd64/pmap.c(working copy) @@ -412,7 +412,7 @@ static caddr_t crashdumpmap; static void free_pv_chunk(struct pv_chunk *pc); static void free_pv_entry(pmap_t pmap, pv_entry_t pv); static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp); -static int popcnt_pc_map_elem_pq(uint64_t elem); +static int popcnt_pc_map(uint64_t *pc_map); static vm_page_t reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp); static void reserve_pv_entries(pmap_t pmap, int needed, struct rwlock **lockp); @@ -2979,7 +2979,7 @@ retry: } /* - * Returns the number of one bits within the given PV chunk map element. + * Returns the number of one bits within the given PV chunk map. * * The erratas for Intel processors state that POPCNT Instruction May * Take Longer to Execute Than Expected. It is believed that the @@ -2994,12 +2994,21 @@ retry: * 5th Gen Core: BDM85 */ static int -popcnt_pc_map_elem_pq(uint64_t elem) +popcnt_pc_map(uint64_t *pc_map) { - u_long result; + u_long count, result; + int field; - __asm __volatile(xorl %k0,%k0;popcntq %1,%0 - : =r (result) : rm (elem)); + result = 0; + if ((cpu_feature2 CPUID2_POPCNT) != 0) + for (field = 0; field _NPCM; field++) { + __asm __volatile(xorl %k0, %k0; popcntq %1, %0 + : =r (count) : m (pc_map[field])); + result += count; + } + else + for (field = 0; field _NPCM; field++) + result += bitcount64(pc_map[field]); return (result); } @@ -3031,15 +3040,7 @@ reserve_pv_entries(pmap_t pmap, int needed, struct retry: avail = 0; TAILQ_FOREACH(pc, pmap-pm_pvchunk, pc_list) { - if ((cpu_feature2 CPUID2_POPCNT) == 0) { - free = bitcount64(pc-pc_map[0]); - free += bitcount64(pc-pc_map[1]); - free += bitcount64(pc-pc_map[2]); - } else { - free = popcnt_pc_map_elem_pq(pc-pc_map[0]); - free += popcnt_pc_map_elem_pq(pc-pc_map[1]); - free += popcnt_pc_map_elem_pq(pc-pc_map[2]); - } + free = popcnt_pc_map(pc-pc_map); if (free == 0) break; avail += free; Yes, this worked for me the same way as for you, the argument is taken directly from memory, without temporary spill. Is this due to silly inliner ? Whatever the reason is, I think a comment should be added noting the subtlety. Otherwise, looks fine. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r280279 - head/sys/sys
On Mon, 20 Apr 2015, Konstantin Belousov wrote: On Mon, Apr 13, 2015 at 04:04:45PM -0400, Jung-uk Kim wrote: Please try the attached patch. Index: sys/amd64/amd64/pmap.c === --- sys/amd64/amd64/pmap.c (revision 281496) +++ sys/amd64/amd64/pmap.c (working copy) @@ -412,7 +412,7 @@ static caddr_t crashdumpmap; static voidfree_pv_chunk(struct pv_chunk *pc); static voidfree_pv_entry(pmap_t pmap, pv_entry_t pv); static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp); -static int popcnt_pc_map_elem_pq(uint64_t elem); +static int popcnt_pc_map(uint64_t *pc_map); static vm_page_t reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp); static voidreserve_pv_entries(pmap_t pmap, int needed, struct rwlock **lockp); @@ -2979,7 +2979,7 @@ retry: } /* - * Returns the number of one bits within the given PV chunk map element. + * Returns the number of one bits within the given PV chunk map. * * The erratas for Intel processors state that POPCNT Instruction May * Take Longer to Execute Than Expected. It is believed that the @@ -2994,12 +2994,21 @@ retry: * 5th Gen Core: BDM85 */ static int -popcnt_pc_map_elem_pq(uint64_t elem) +popcnt_pc_map(uint64_t *pc_map) { - u_long result; + u_long count, result; + int field; - __asm __volatile(xorl %k0,%k0;popcntq %1,%0 - : =r (result) : rm (elem)); + result = 0; + if ((cpu_feature2 CPUID2_POPCNT) != 0) + for (field = 0; field _NPCM; field++) { + __asm __volatile(xorl %k0, %k0; popcntq %1, %0 + : =r (count) : m (pc_map[field])); + result += count; + } + else + for (field = 0; field _NPCM; field++) + result += bitcount64(pc_map[field]); return (result); } @@ -3031,15 +3040,7 @@ reserve_pv_entries(pmap_t pmap, int needed, struct retry: avail = 0; TAILQ_FOREACH(pc, pmap-pm_pvchunk, pc_list) { - if ((cpu_feature2 CPUID2_POPCNT) == 0) { - free = bitcount64(pc-pc_map[0]); - free += bitcount64(pc-pc_map[1]); - free += bitcount64(pc-pc_map[2]); - } else { - free = popcnt_pc_map_elem_pq(pc-pc_map[0]); - free += popcnt_pc_map_elem_pq(pc-pc_map[1]); - free += popcnt_pc_map_elem_pq(pc-pc_map[2]); - } + free = popcnt_pc_map(pc-pc_map); if (free == 0) break; avail += free; Yes, this worked for me the same way as for you, the argument is taken directly from memory, without temporary spill. Is this due to silly inliner ? Whatever the reason is, I think a comment should be added noting the subtlety. Otherwise, looks fine. Erm, this looks silly. It apparently works by making things too complicated for the compiler to optimize (where one of the optimizations actually gives pessimal spills). Its main changes are: - change the constraint from rm to m. This alone is not enough to get the compiler to use the natural memory operands without a trip through a register - un-unroll a loop. This alone makes no difference in my simpler test environment. The compiler un-un-unrolls. (The compiler also inlines all the static function whether asked to or not. gcc shouldn't do this inlining for the 3 calls, and gcc -fno-inline-functions-called-once would never do it. clang is missing support for -fno-inline-functions- called-once, and gcc48 vanished with a recent upgrade on freefall, so I couldn't test this case easily. -fno-inline-functions-called-once should be the default for kernels. - some other changes apparently confused clang into doing the right thing. It works better to change the constraint to r: X #include sys/types.h X X static __inline u_long X popcntq(u_long mask) X { X u_long result; X X __asm __volatile(xorl %k0,%k0; popcntq %1,%0 : X =r (result) : r (mask)); X return (result); X } X X u_long x[3]; X X int X test(void) X { X return (popcntq(x[0]) + popcntq(x[1]) + popcntq(x[2])); X } X ... X .cfi_startproc X # BB#0: # %entry X pushq %rbp X .Ltmp0: X .cfi_def_cfa_offset 16 X .Ltmp1: X .cfi_offset %rbp, -16 X movq%rsp, %rbp X .Ltmp2: X .cfi_def_cfa_register %rbp X movqx(%rip), %rax X #APP X xorl%ecx, %ecx X popcntq %rax, %rcx X #NO_APP X movqx+8(%rip), %rax X #APP X xorl%edx, %edx X popcntq %rax, %rdx X #NO_APP X addl%ecx, %edx X movqx+16(%rip), %rcx X #APP X xorl%eax, %eax X popcntq %rcx, %rax X #NO_APP X addl%edx,
Re: svn commit: r280279 - head/sys/sys
On Tue, 21 Apr 2015, Bruce Evans wrote: On Mon, 20 Apr 2015, Konstantin Belousov wrote: On Mon, Apr 13, 2015 at 04:04:45PM -0400, Jung-uk Kim wrote: Please try the attached patch. ... - __asm __volatile(xorl %k0,%k0;popcntq %1,%0 - : =r (result) : rm (elem)); ... + __asm __volatile(xorl %k0, %k0; popcntq %1, %0 + : =r (count) : m (pc_map[field])); ... Yes, this worked for me the same way as for you, the argument is taken directly from memory, without temporary spill. Is this due to silly inliner ? Whatever the reason is, I think a comment should be added noting the subtlety. Otherwise, looks fine. Erm, this looks silly. It apparently works by making things too complicated for the compiler to optimize (where one of the optimizations actually gives pessimal spills). Its main changes are: ... It works better to change the constraint to r: It's even sillier than that. The problem is not limited to this function. clang seems to prefer memory whenever you use the rm constraint. The silliest case is when you have a chain of simple asm functions. Say the original popcntq (without the xorl): return (popcntq(popcntq(popcntq(popcntq(popcntq(x)); gcc compiles this to 5 sequential popcntq instructions, but clang spills the results of the first 4. This is an old bug. clang does this on FreeBSD[9-11]. cc does this on FreeBSD[10-11] (not on FreeBSD-9 since cc = gcc there. Asms should always use rm if m works. Ones in cpufunc.h always do except for lidt(), lldt() and ltr(). These 3 are fixed in my version. So cpufunc.h almost always asks for the pessimization. Bruce ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r280279 - head/sys/sys
On Mon, Apr 13, 2015 at 12:36:50PM -0500, Alan Cox wrote: However, in the popcnt case, we are spilling the bit map to memory in order to popcnt it. That's rather silly: 3570: 48 8b 48 18 mov0x18(%rax),%rcx 3574: f6 04 25 00 00 00 00testb $0x80,0x0 357b: 80 357c: 74 42 je 35c0 pmap_demote_pde_locked+0x2f0 357e: 48 89 4d b8 mov%rcx,-0x48(%rbp) 3582: 31 c9 xor%ecx,%ecx 3584: f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx 358a: 48 8b 50 20 mov0x20(%rax),%rdx 358e: 48 89 55 b0 mov%rdx,-0x50(%rbp) 3592: 31 d2 xor%edx,%edx 3594: f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx 359a: 01 ca add%ecx,%edx 359c: 48 8b 48 28 mov0x28(%rax),%rcx 35a0: 48 89 4d a8 mov%rcx,-0x58(%rbp) 35a4: 31 c9 xor%ecx,%ecx 35a6: f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx 35ac: 01 d1 add%edx,%ecx 35ae: e9 12 01 00 00 jmpq 36c5 pmap_demote_pde_locked+0x3f5 Caveat: I'm still using clang 3.5. Maybe the newer clang doesn't spill? 3.6.0 generates similar code. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r280279 - head/sys/sys
Hi, These CPUs are supposed to have loop unwinder / streaming hardware. Is it not unwinding/streaming this loop for us? -a On 13 April 2015 at 10:36, Alan Cox a...@rice.edu wrote: On 03/30/2015 10:50, John Baldwin wrote: On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote: On Sat, 21 Mar 2015, John Baldwin wrote: On 3/21/15 12:35 PM, Konstantin Belousov wrote: On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote: On 3/20/15 9:02 AM, Konstantin Belousov wrote: On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote: Author: jhb Date: Fri Mar 20 10:27:06 2015 New Revision: 280279 URL: https://svnweb.freebsd.org/changeset/base/280279 Log: Expand the bitcount* API to support 64-bit integers, plain ints and longs and create a hidden API that can be used in other system headers without adding namespace pollution. - If the POPCNT instruction is enabled at compile time, use __builtin_popcount*() to implement __bitcount*(), otherwise fall back to software implementations. Are you aware of the Haswell errata HSD146 ? I see the described behaviour on machines back to SandyBridge, but not on Nehalems. HSD146. POPCNT Instruction May Take Longer to Execute Than Expected Problem: POPCNT instruction execution with a 32 or 64 bit operand may be delayed until previous non-dependent instructions have executed. Jilles noted that gcc head and 4.9.2 already provides a workaround by xoring the dst register. I have some patch for amd64 pmap, see the end of the message. No, I was not aware, but I think it's hard to fix this anywhere but the compiler. I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang uses POPCOUNT for this function from ACPI-CA: static UINT8 AcpiRsCountSetBits ( UINT16 BitField) { UINT8 BitsSet; ACPI_FUNCTION_ENTRY (); for (BitsSet = 0; BitField; BitsSet++) { /* Zero the least significant bit that is set */ BitField = (UINT16) (BitField - 1); } return (BitsSet); } (I ran into this accidentally because a kernel built on my system failed to boot in older qemu because the kernel paniced with an illegal instruction fault in this function.) Does it do the same for the similar home made popcount in pmap?: Yes: 807658d4: f6 04 25 46 e2 d6 80testb $0x80,0x80d6e246 807658db: 80 807658dc: 74 32 je 80765910 pmap_demote_pde_locked+0x4d0 807658de: 48 89 4d b8 mov%rcx,-0x48(%rbp) 807658e2: f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx 807658e8: 48 8b 50 20 mov0x20(%rax),%rdx 807658ec: 48 89 55 b0 mov%rdx,-0x50(%rbp) 807658f0: f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx 807658f6: 01 ca add%ecx,%edx 807658f8: 48 8b 48 28 mov0x28(%rax),%rcx 807658fc: 48 89 4d a8 mov%rcx,-0x58(%rbp) 80765900: f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx 80765906: eb 1b jmp80765923 pmap_demote_pde_locked+0x4e3 80765908: 0f 1f 84 00 00 00 00nopl 0x0(%rax,%rax,1) 8076590f: 00 80765910: f3 48 0f b8 c9 popcnt %rcx,%rcx 80765915: f3 48 0f b8 50 20 popcnt 0x20(%rax),%rdx 8076591b: 01 ca add%ecx,%edx 8076591d: f3 48 0f b8 48 28 popcnt 0x28(%rax),%rcx 80765923: 01 d1 add%edx,%ecx It also uses popcnt for this in blist_fill() and blist_meta_fill(): 742 /* Count the number of blocks we're about to allocate */ 743 bitmap = scan-u.bmu_bitmap mask; 744 for (nblks = 0; bitmap != 0; nblks++) 745 bitmap = bitmap - 1; Always using new API would lose the micro-optimizations given by the runtime decision for default CFLAGS (used by distributions for portability). To keep them, it seems best to keep the inline asm but replace popcnt_pc_map_elem(elem) by __bitcount64(elem). -mno-popcount can then be used to work around slowness in the software (that is actually hardware) case. I'm not sure if bitcount64() is strictly better than the loop in this case even though it is O(1) given the claimed nature of the values in the comment. I checked. Even with zeroes being more common than ones, bitcount64() is faster than the simple loop. Using bitcount64, reserve_pv_entries() takes on average 4265 cycles during buildworld on my test machine. In contrast, with the simple loop, it takes on average 4507 cycles. Even though bitcount64 is a lot larger than the simple loop, we do the 3 bit count operations many times in a loop, so the extra i-cache
Re: svn commit: r280279 - head/sys/sys
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On 04/13/2015 13:36, Alan Cox wrote: On 03/30/2015 10:50, John Baldwin wrote: On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote: On Sat, 21 Mar 2015, John Baldwin wrote: On 3/21/15 12:35 PM, Konstantin Belousov wrote: On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote: On 3/20/15 9:02 AM, Konstantin Belousov wrote: On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote: Author: jhb Date: Fri Mar 20 10:27:06 2015 New Revision: 280279 URL: https://svnweb.freebsd.org/changeset/base/280279 Log: Expand the bitcount* API to support 64-bit integers, plain ints and longs and create a hidden API that can be used in other system headers without adding namespace pollution. - If the POPCNT instruction is enabled at compile time, use __builtin_popcount*() to implement __bitcount*(), otherwise fall back to software implementations. Are you aware of the Haswell errata HSD146 ? I see the described behaviour on machines back to SandyBridge, but not on Nehalems. HSD146. POPCNT Instruction May Take Longer to Execute Than Expected Problem: POPCNT instruction execution with a 32 or 64 bit operand may be delayed until previous non-dependent instructions have executed. Jilles noted that gcc head and 4.9.2 already provides a workaround by xoring the dst register. I have some patch for amd64 pmap, see the end of the message. No, I was not aware, but I think it's hard to fix this anywhere but the compiler. I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang uses POPCOUNT for this function from ACPI-CA: static UINT8 AcpiRsCountSetBits ( UINT16 BitField) { UINT8 BitsSet; ACPI_FUNCTION_ENTRY (); for (BitsSet = 0; BitField; BitsSet++) { /* Zero the least significant bit that is set */ BitField = (UINT16) (BitField - 1); } return (BitsSet); } (I ran into this accidentally because a kernel built on my system failed to boot in older qemu because the kernel paniced with an illegal instruction fault in this function.) Does it do the same for the similar home made popcount in pmap?: Yes: 807658d4: f6 04 25 46 e2 d6 80testb $0x80,0x80d6e246 807658db: 80 807658dc: 74 32 je 80765910 pmap_demote_pde_locked+0x4d0 807658de: 48 89 4d b8 mov%rcx,-0x48(%rbp) 807658e2: f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx 807658e8: 48 8b 50 20 mov0x20(%rax),%rdx 807658ec: 48 89 55 b0 mov%rdx,-0x50(%rbp) 807658f0: f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx 807658f6: 01 ca add%ecx,%edx 807658f8: 48 8b 48 28 mov0x28(%rax),%rcx 807658fc: 48 89 4d a8 mov%rcx,-0x58(%rbp) 80765900: f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx 80765906: eb 1b jmp80765923 pmap_demote_pde_locked+0x4e3 80765908: 0f 1f 84 00 00 00 00nopl 0x0(%rax,%rax,1) 8076590f: 00 80765910: f3 48 0f b8 c9 popcnt %rcx,%rcx 80765915: f3 48 0f b8 50 20 popcnt 0x20(%rax),%rdx 8076591b: 01 ca add%ecx,%edx 8076591d: f3 48 0f b8 48 28 popcnt 0x28(%rax),%rcx 80765923: 01 d1 add%edx,%ecx It also uses popcnt for this in blist_fill() and blist_meta_fill(): 742 /* Count the number of blocks we're about to allocate */ 743 bitmap = scan-u.bmu_bitmap mask; 744 for (nblks = 0; bitmap != 0; nblks++) 745 bitmap = bitmap - 1; Always using new API would lose the micro-optimizations given by the runtime decision for default CFLAGS (used by distributions for portability). To keep them, it seems best to keep the inline asm but replace popcnt_pc_map_elem(elem) by __bitcount64(elem). -mno-popcount can then be used to work around slowness in the software (that is actually hardware) case. I'm not sure if bitcount64() is strictly better than the loop in this case even though it is O(1) given the claimed nature of the values in the comment. I checked. Even with zeroes being more common than ones, bitcount64() is faster than the simple loop. Using bitcount64, reserve_pv_entries() takes on average 4265 cycles during buildworld on my test machine. In contrast, with the simple loop, it takes on average 4507 cycles. Even though bitcount64 is a lot larger than the simple loop, we do the 3 bit count operations many times in a loop, so the extra i-cache misses are being made up for by the repeated execution of the faster code. However, in the popcnt case, we are spilling the bit map to memory in order to popcnt it. That's rather silly: 3570: 48 8b 48 18 mov0x18(%rax),%rcx 3574: f6 04 25 00 00 00 00testb
Re: svn commit: r280279 - head/sys/sys
On 03/30/2015 10:50, John Baldwin wrote: On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote: On Sat, 21 Mar 2015, John Baldwin wrote: On 3/21/15 12:35 PM, Konstantin Belousov wrote: On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote: On 3/20/15 9:02 AM, Konstantin Belousov wrote: On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote: Author: jhb Date: Fri Mar 20 10:27:06 2015 New Revision: 280279 URL: https://svnweb.freebsd.org/changeset/base/280279 Log: Expand the bitcount* API to support 64-bit integers, plain ints and longs and create a hidden API that can be used in other system headers without adding namespace pollution. - If the POPCNT instruction is enabled at compile time, use __builtin_popcount*() to implement __bitcount*(), otherwise fall back to software implementations. Are you aware of the Haswell errata HSD146 ? I see the described behaviour on machines back to SandyBridge, but not on Nehalems. HSD146. POPCNT Instruction May Take Longer to Execute Than Expected Problem: POPCNT instruction execution with a 32 or 64 bit operand may be delayed until previous non-dependent instructions have executed. Jilles noted that gcc head and 4.9.2 already provides a workaround by xoring the dst register. I have some patch for amd64 pmap, see the end of the message. No, I was not aware, but I think it's hard to fix this anywhere but the compiler. I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang uses POPCOUNT for this function from ACPI-CA: static UINT8 AcpiRsCountSetBits ( UINT16 BitField) { UINT8 BitsSet; ACPI_FUNCTION_ENTRY (); for (BitsSet = 0; BitField; BitsSet++) { /* Zero the least significant bit that is set */ BitField = (UINT16) (BitField - 1); } return (BitsSet); } (I ran into this accidentally because a kernel built on my system failed to boot in older qemu because the kernel paniced with an illegal instruction fault in this function.) Does it do the same for the similar home made popcount in pmap?: Yes: 807658d4: f6 04 25 46 e2 d6 80testb $0x80,0x80d6e246 807658db: 80 807658dc: 74 32 je 80765910 pmap_demote_pde_locked+0x4d0 807658de: 48 89 4d b8 mov%rcx,-0x48(%rbp) 807658e2: f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx 807658e8: 48 8b 50 20 mov0x20(%rax),%rdx 807658ec: 48 89 55 b0 mov%rdx,-0x50(%rbp) 807658f0: f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx 807658f6: 01 ca add%ecx,%edx 807658f8: 48 8b 48 28 mov0x28(%rax),%rcx 807658fc: 48 89 4d a8 mov%rcx,-0x58(%rbp) 80765900: f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx 80765906: eb 1b jmp80765923 pmap_demote_pde_locked+0x4e3 80765908: 0f 1f 84 00 00 00 00nopl 0x0(%rax,%rax,1) 8076590f: 00 80765910: f3 48 0f b8 c9 popcnt %rcx,%rcx 80765915: f3 48 0f b8 50 20 popcnt 0x20(%rax),%rdx 8076591b: 01 ca add%ecx,%edx 8076591d: f3 48 0f b8 48 28 popcnt 0x28(%rax),%rcx 80765923: 01 d1 add%edx,%ecx It also uses popcnt for this in blist_fill() and blist_meta_fill(): 742 /* Count the number of blocks we're about to allocate */ 743 bitmap = scan-u.bmu_bitmap mask; 744 for (nblks = 0; bitmap != 0; nblks++) 745 bitmap = bitmap - 1; Always using new API would lose the micro-optimizations given by the runtime decision for default CFLAGS (used by distributions for portability). To keep them, it seems best to keep the inline asm but replace popcnt_pc_map_elem(elem) by __bitcount64(elem). -mno-popcount can then be used to work around slowness in the software (that is actually hardware) case. I'm not sure if bitcount64() is strictly better than the loop in this case even though it is O(1) given the claimed nature of the values in the comment. I checked. Even with zeroes being more common than ones, bitcount64() is faster than the simple loop. Using bitcount64, reserve_pv_entries() takes on average 4265 cycles during buildworld on my test machine. In contrast, with the simple loop, it takes on average 4507 cycles. Even though bitcount64 is a lot larger than the simple loop, we do the 3 bit count operations many times in a loop, so the extra i-cache misses are being made up for by the repeated execution of the faster code. However, in the popcnt case, we are spilling the bit map to memory in order to popcnt it. That's rather silly:
Re: svn commit: r280279 - head/sys/sys
On Tue, Mar 31, 2015 at 03:49:28PM +1100, Bruce Evans wrote: It looks a bit overengineered to me. A bit like my function pointers for the bcopy() family on i386. bcopy() is a bulk operation, so in theory you can do it much faster by selecting the best available version at runtime. In practice, the gains were not large and are too machine-dependent to maintain. It is even harder to get large gains and maintain them by selecting individual instructions at runtime. Yes, it is similar to bcopy. The difference in motivation is that the IFUNCs are for features, not for speed. The existing patch already demostrates this WRT self-snoop and different methods of saving FPU state. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r280279 - head/sys/sys
On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote: On Sat, 21 Mar 2015, John Baldwin wrote: On 3/21/15 12:35 PM, Konstantin Belousov wrote: On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote: On 3/20/15 9:02 AM, Konstantin Belousov wrote: On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote: Author: jhb Date: Fri Mar 20 10:27:06 2015 New Revision: 280279 URL: https://svnweb.freebsd.org/changeset/base/280279 Log: Expand the bitcount* API to support 64-bit integers, plain ints and longs and create a hidden API that can be used in other system headers without adding namespace pollution. - If the POPCNT instruction is enabled at compile time, use __builtin_popcount*() to implement __bitcount*(), otherwise fall back to software implementations. Are you aware of the Haswell errata HSD146 ? I see the described behaviour on machines back to SandyBridge, but not on Nehalems. HSD146. POPCNT Instruction May Take Longer to Execute Than Expected Problem: POPCNT instruction execution with a 32 or 64 bit operand may be delayed until previous non-dependent instructions have executed. Jilles noted that gcc head and 4.9.2 already provides a workaround by xoring the dst register. I have some patch for amd64 pmap, see the end of the message. No, I was not aware, but I think it's hard to fix this anywhere but the compiler. I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang uses POPCOUNT for this function from ACPI-CA: static UINT8 AcpiRsCountSetBits ( UINT16 BitField) { UINT8 BitsSet; ACPI_FUNCTION_ENTRY (); for (BitsSet = 0; BitField; BitsSet++) { /* Zero the least significant bit that is set */ BitField = (UINT16) (BitField - 1); } return (BitsSet); } (I ran into this accidentally because a kernel built on my system failed to boot in older qemu because the kernel paniced with an illegal instruction fault in this function.) Does it do the same for the similar home made popcount in pmap?: Yes: 807658d4: f6 04 25 46 e2 d6 80testb $0x80,0x80d6e246 807658db: 80 807658dc: 74 32 je 80765910 pmap_demote_pde_locked+0x4d0 807658de: 48 89 4d b8 mov%rcx,-0x48(%rbp) 807658e2: f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx 807658e8: 48 8b 50 20 mov0x20(%rax),%rdx 807658ec: 48 89 55 b0 mov%rdx,-0x50(%rbp) 807658f0: f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx 807658f6: 01 ca add%ecx,%edx 807658f8: 48 8b 48 28 mov0x28(%rax),%rcx 807658fc: 48 89 4d a8 mov%rcx,-0x58(%rbp) 80765900: f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx 80765906: eb 1b jmp80765923 pmap_demote_pde_locked+0x4e3 80765908: 0f 1f 84 00 00 00 00nopl 0x0(%rax,%rax,1) 8076590f: 00 80765910: f3 48 0f b8 c9 popcnt %rcx,%rcx 80765915: f3 48 0f b8 50 20 popcnt 0x20(%rax),%rdx 8076591b: 01 ca add%ecx,%edx 8076591d: f3 48 0f b8 48 28 popcnt 0x28(%rax),%rcx 80765923: 01 d1 add%edx,%ecx It also uses popcnt for this in blist_fill() and blist_meta_fill(): 742 /* Count the number of blocks we're about to allocate */ 743 bitmap = scan-u.bmu_bitmap mask; 744 for (nblks = 0; bitmap != 0; nblks++) 745 bitmap = bitmap - 1; Always using new API would lose the micro-optimizations given by the runtime decision for default CFLAGS (used by distributions for portability). To keep them, it seems best to keep the inline asm but replace popcnt_pc_map_elem(elem) by __bitcount64(elem). -mno-popcount can then be used to work around slowness in the software (that is actually hardware) case. I'm not sure if bitcount64() is strictly better than the loop in this case even though it is O(1) given the claimed nature of the values in the comment. -- John Baldwin ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r280279 - head/sys/sys
On Sunday, March 22, 2015 11:32:51 AM Konstantin Belousov wrote: On Sun, Mar 22, 2015 at 09:41:53AM +1100, Bruce Evans wrote: Always using new API would lose the micro-optimizations given by the runtime decision for default CFLAGS (used by distributions for portability). To keep them, it seems best to keep the inline asm but replace popcnt_pc_map_elem(elem) by __bitcount64(elem). -mno-popcount can then be used to work around slowness in the software (that is actually hardware) case. So anybody has to compile his own kernel to get popcnt optimization ? We do care about trivial things that improve time. That is not what Bruce said. He suggested using bitcount64() for the fallback if the cpuid check fails. He did not say to remove the runtime check to use popcnt if it is available: Always using [bitcount64] would lose the micro-optimization... [to] keep [it], it seems best to keep the inline asm but replace popcnt_pc_map_elem(elem) by [bitcount64(elem)]. BTW, I have the following WIP change, which popcnt xorl is a piece of. It emulates the ifuncs with some preprocessing mess. It is much better than runtime patching, and is a prerequisite to properly support more things, like SMAP. I did not published it earlier, since I wanted to convert TLB flush code to this. This looks fine to me. It seems to be manually converting certain symbols to use a dynamic lookup that must be explicitly resolved before first use? I agree that this seems cleaner than runtime patching. -- John Baldwin ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r280279 - head/sys/sys
On Mon, Mar 30, 2015 at 11:57:08AM -0400, John Baldwin wrote: On Sunday, March 22, 2015 11:32:51 AM Konstantin Belousov wrote: On Sun, Mar 22, 2015 at 09:41:53AM +1100, Bruce Evans wrote: Always using new API would lose the micro-optimizations given by the runtime decision for default CFLAGS (used by distributions for portability). To keep them, it seems best to keep the inline asm but replace popcnt_pc_map_elem(elem) by __bitcount64(elem). -mno-popcount can then be used to work around slowness in the software (that is actually hardware) case. So anybody has to compile his own kernel to get popcnt optimization ? We do care about trivial things that improve time. That is not what Bruce said. He suggested using bitcount64() for the fallback if the cpuid check fails. He did not say to remove the runtime check to use popcnt if it is available: Always using [bitcount64] would lose the micro-optimization... [to] keep [it], it seems best to keep the inline asm but replace popcnt_pc_map_elem(elem) by [bitcount64(elem)]. Ok, thank you for the clarification. I updated the pmap patch, see the end of the message. BTW, I have the following WIP change, which popcnt xorl is a piece of. It emulates the ifuncs with some preprocessing mess. It is much better than runtime patching, and is a prerequisite to properly support more things, like SMAP. I did not published it earlier, since I wanted to convert TLB flush code to this. This looks fine to me. It seems to be manually converting certain symbols to use a dynamic lookup that must be explicitly resolved before first use? I am not sure what do you mean by dynamic lookup, but possibly it was mentioned. I can emulate the ifuncs more sincerely, by requiring a resolver function, which is called on the first real function invocation. I did not see it as very useful, but it is definitely doable. diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c index 6a4077c..fcfba56 100644 --- a/sys/amd64/amd64/pmap.c +++ b/sys/amd64/amd64/pmap.c @@ -412,7 +416,7 @@ static caddr_t crashdumpmap; static voidfree_pv_chunk(struct pv_chunk *pc); static voidfree_pv_entry(pmap_t pmap, pv_entry_t pv); static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp); -static int popcnt_pc_map_elem(uint64_t elem); +static int popcnt_pc_map_elem_pq(uint64_t elem); static vm_page_t reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp); static voidreserve_pv_entries(pmap_t pmap, int needed, struct rwlock **lockp); @@ -2980,20 +3002,27 @@ retry: /* * Returns the number of one bits within the given PV chunk map element. + * + * The erratas for Intel processors state that POPCNT Instruction May + * Take Longer to Execute Than Expected. It is believed that the + * issue is the spurious dependency on the destination register. + * Provide a hint to the register rename logic that the destination + * value is overwritten, by clearing it, as suggested in the + * optimization manual. It should be cheap for unaffected processors + * as well. + * + * Reference numbers for erratas are + * 4th Gen Core: HSD146 + * 5th Gen Core: BDM85 */ static int -popcnt_pc_map_elem(uint64_t elem) +popcnt_pc_map_elem_pq(uint64_t elem) { - int count; + u_long result; - /* -* This simple method of counting the one bits performs well because -* the given element typically contains more zero bits than one bits. -*/ - count = 0; - for (; elem != 0; elem = elem - 1) - count++; - return (count); + __asm __volatile(xorl %k0,%k0;popcntq %1,%0 + : =r (result) : rm (elem)); + return (result); } /* @@ -3025,13 +3054,13 @@ retry: avail = 0; TAILQ_FOREACH(pc, pmap-pm_pvchunk, pc_list) { if ((cpu_feature2 CPUID2_POPCNT) == 0) { - free = popcnt_pc_map_elem(pc-pc_map[0]); - free += popcnt_pc_map_elem(pc-pc_map[1]); - free += popcnt_pc_map_elem(pc-pc_map[2]); + free = bitcount64(pc-pc_map[0]); + free += bitcount64(pc-pc_map[1]); + free += bitcount64(pc-pc_map[2]); } else { - free = popcntq(pc-pc_map[0]); - free += popcntq(pc-pc_map[1]); - free += popcntq(pc-pc_map[2]); + free = popcnt_pc_map_elem_pq(pc-pc_map[0]); + free += popcnt_pc_map_elem_pq(pc-pc_map[1]); + free += popcnt_pc_map_elem_pq(pc-pc_map[2]); } if (free == 0) break; ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r280279 - head/sys/sys
On Monday, March 30, 2015 08:24:34 PM Konstantin Belousov wrote: That is not what Bruce said. He suggested using bitcount64() for the fallback if the cpuid check fails. He did not say to remove the runtime check to use popcnt if it is available: Always using [bitcount64] would lose the micro-optimization... [to] keep [it], it seems best to keep the inline asm but replace popcnt_pc_map_elem(elem) by [bitcount64(elem)]. Ok, thank you for the clarification. I updated the pmap patch, see the end of the message. I think the pmap change looks fine. If we know which compilers include a workaround we might also consider specifying -mno-popcount for everything except known-ok compilers in at least kern.mk. This looks fine to me. It seems to be manually converting certain symbols to use a dynamic lookup that must be explicitly resolved before first use? I am not sure what do you mean by dynamic lookup, but possibly it was mentioned. I can emulate the ifuncs more sincerely, by requiring a resolver function, which is called on the first real function invocation. I did not see it as very useful, but it is definitely doable. I just mean that the effect at runtime is similar to that of dynamic symbols once they are resolved (a call into a PLT entry (or is it GOT? I keep getting those confused) that does a jump to the resolved symbol). -- John Baldwin ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r280279 - head/sys/sys
On Mon, Mar 30, 2015 at 04:37:10PM -0400, John Baldwin wrote: On Monday, March 30, 2015 08:24:34 PM Konstantin Belousov wrote: That is not what Bruce said. He suggested using bitcount64() for the fallback if the cpuid check fails. He did not say to remove the runtime check to use popcnt if it is available: Always using [bitcount64] would lose the micro-optimization... [to] keep [it], it seems best to keep the inline asm but replace popcnt_pc_map_elem(elem) by [bitcount64(elem)]. Ok, thank you for the clarification. I updated the pmap patch, see the end of the message. I think the pmap change looks fine. If we know which compilers include a workaround we might also consider specifying -mno-popcount for everything except known-ok compilers in at least kern.mk. Right now the compilers which implement the workaround are gcc 4.9.2 and gcc trunk, to be released as gcc 5.0. In-tree clang 3.6.0 does not try to eliminate the false dependency. This looks fine to me. It seems to be manually converting certain symbols to use a dynamic lookup that must be explicitly resolved before first use? I am not sure what do you mean by dynamic lookup, but possibly it was mentioned. I can emulate the ifuncs more sincerely, by requiring a resolver function, which is called on the first real function invocation. I did not see it as very useful, but it is definitely doable. I just mean that the effect at runtime is similar to that of dynamic symbols once they are resolved (a call into a PLT entry (or is it GOT? I keep getting those confused) that does a jump to the resolved symbol). PLT is slightly more expensive, since after the resolution it gives callsymbol@plt jmp *symbol@gotpcrel while this code results in call*symbol_selector ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r280279 - head/sys/sys
On Mon, 30 Mar 2015, Konstantin Belousov wrote: On Mon, Mar 30, 2015 at 11:57:08AM -0400, John Baldwin wrote: On Sunday, March 22, 2015 11:32:51 AM Konstantin Belousov wrote: ... So anybody has to compile his own kernel to get popcnt optimization ? We do care about trivial things that improve time. I'm not sure we do. The microoptimizations give speed improvements in the 1-10% range, while general bloat gives speed unimprovements in the 10-500% range. I updated the pmap patch, see the end of the message. Seems OK, except for verbose names. BTW, I have the following WIP change, which popcnt xorl is a piece of. It emulates the ifuncs with some preprocessing mess. It is much better than runtime patching, and is a prerequisite to properly support more things, like SMAP. I did not published it earlier, since I wanted to convert TLB flush code to this. This looks fine to me. It seems to be manually converting certain symbols to use a dynamic lookup that must be explicitly resolved before first use? It looks a bit overengineered to me. A bit like my function pointers for the bcopy() family on i386. bcopy() is a bulk operation, so in theory you can do it much faster by selecting the best available version at runtime. In practice, the gains were not large and are too machine-dependent to maintain. It is even harder to get large gains and maintain them by selecting individual instructions at runtime. diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c index 6a4077c..fcfba56 100644 --- a/sys/amd64/amd64/pmap.c +++ b/sys/amd64/amd64/pmap.c @@ -412,7 +416,7 @@ static caddr_t crashdumpmap; static void free_pv_chunk(struct pv_chunk *pc); static void free_pv_entry(pmap_t pmap, pv_entry_t pv); static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp); -static int popcnt_pc_map_elem(uint64_t elem); +static int popcnt_pc_map_elem_pq(uint64_t elem); popcnt_pc_map_elem_pq() is the most verbose name used here. It is not necessary to encrypt a function's man page in its name. @@ -2980,20 +3002,27 @@ retry: /* * Returns the number of one bits within the given PV chunk map element. + * + * The erratas for Intel processors state that POPCNT Instruction May + * Take Longer to Execute Than Expected. It is believed that the + * issue is the spurious dependency on the destination register. + * Provide a hint to the register rename logic that the destination + * value is overwritten, by clearing it, as suggested in the + * optimization manual. It should be cheap for unaffected processors + * as well. + * + * Reference numbers for erratas are + * 4th Gen Core: HSD146 + * 5th Gen Core: BDM85 */ static int -popcnt_pc_map_elem(uint64_t elem) +popcnt_pc_map_elem_pq(uint64_t elem) Here the function has a specialized use, but internally it is just the popcntq() cpufunc, except that doesn't have the xorl workaround or the above documentation for the workaround. { - int count; + u_long result; - /* -* This simple method of counting the one bits performs well because -* the given element typically contains more zero bits than one bits. -*/ - count = 0; - for (; elem != 0; elem = elem - 1) - count++; - return (count); + __asm __volatile(xorl %k0,%k0;popcntq %1,%0 + : =r (result) : rm (elem)); + return (result); } Style: - there should be a space after ; in the asm. - loading 0 should be in the operands. Then you don't need the k modification. It should be able to load 0 using xorl iff that is best for the current arch. Perhaps a register is already 0, and it is best to do a null load so as to use that register. The patch is missing removal of the popcntq() cpufunc. This function was only used here, and is now unused. cpufunc.h is supposed to contain only wrappers for single instructions, to be combined if necessary to create larger functions. popcntq() follows this rule (some other functions are broken). popcntq() with the load of 0 wouldn't satisify it strictly, but there is no other way to fix it since register allocation is unavailable across inline asms. Actually, if you move the load to the operands, it would follow the rule strictly enough -- many of the asms need more than a single instruction to load things. Bruce ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r280279 - head/sys/sys
On Sun, Mar 22, 2015 at 09:41:53AM +1100, Bruce Evans wrote: On Sat, 21 Mar 2015, John Baldwin wrote: On 3/21/15 12:35 PM, Konstantin Belousov wrote: X static int X popcnt_pc_map_elem(uint64_t elem) X { X int count; X X /* X * This simple method of counting the one bits performs well because X * the given element typically contains more zero bits than one bits. X */ X count = 0; X for (; elem != 0; elem = elem - 1) X count++; X return (count); X } ... Perhaps this appeared to perform well not for the reason stated in its comment, but because it was tested with excessive CFLAGS, and the compiler actually reduced it to popcntq, and the test machine didn't have the bug. No, it probably did not, since popcntq does not exist on the generic amd64 machine. In a game program that I wrote, efficiency of popcount() was actually important since it was used in inner loops. I used lookup tables for popcount() and a couple of application-specific functions and even for fls(). This seemed to be better than the the existing bsf instruction, so it can't be bad for popcount(). The lookup was limited to bitsets of length 13 in the low bits of an int, so the table sizes were only 2**13 = 4096. bitsets of length 64 would need multiple steps. However, if the above performs well without compiler optimizations, then it must be because usually only the lowest few bits are set. Then it would perform even better with even smaller (8-bit) tables combined with an early exit: count = 0; for (; elem != 0; elem = 8) count += lookup[elem 0xff]; return (count); Many variations on this are possible. E.g., use fls() to decide where to start; or avoid all branches, and add up the results in parallel. For bitcount32, the latter is: bitcount64(x) := bitcount32(x 0x) + bitcount32(x 32); bitcount32(x) := bitcount16(x 0x) + bitcount16(x 16); bitcount16(x) := bitcount8(x 0xff) + bitcount8(x 8); bitcount8(x) := lookup[x]; Compilers won't be able to optimize the lookup methods. They might be able to convert the current bitcount*() to popcnt*. Last time I looked, clang but not gcc converted very complicated expressions for bswap*() to bswap* instructions. There's no way we can preemptively locate every bit of C that clang might decide to replace with popcount and fix them to employ a workaround. I think the only viable solution is to use -mno-popcount on all compilers that aren't known to be fixed. Why bother? It is only a short-lived and small(?) efficiency bug. Both you and Bruce and Jilles state this, but I am unable to understand the argument about the compiler. Can somebody explain this to me, please ? It is also to avoid complications for short-lived optimizations. The 3 uses of popcntq() in amd64 pmap cost: - 9 lines of inline asm. - 19 lines for the C version - 9 lines instead of 3 for the 3 uses When popcntq is not available, the result is a pessimization since the optimizations are not complicated enough to be useful in that case. popcnt is not available for machines older than Nehalems. Nobody cares about the last bits of the speed for such CPUs. They give the runtime overhead of a branch to call the C version that is presumably worse than using the existing bitcount32() (possibly twice) or the compiler builtin. (jhb's cleanups didn't bother to optimize to use the builtin in all cases, since it is hard to tell if the builtin is any good if it is in software. If you remove the 19 lines for popcnt_pc_map_elem() and and replace calls to it by __builtin_popcountl(), then the results would be: - compile-time failure for old unsupported compilers that don't have this builtin - usually, a link-time failure for gcc-4.2 through gcc-4.8. gcc generates a call to __popcountdi2 unless CFLAGS enables the hardware instruction. __popcountdi2 is in libgcc but not in libkern. - usually, 3 inline copies of the same code that would be produced if FreeBSD's bitcount64() were used for clang. clang unrolls things excessively, giving enormous code. Unrolling allows it to load the large constants in __bitcount64() only once, but large code is still needed to use these constants. - runtime failures in misconfigured cases where CFLAGS doesn't match the hardware. Both gcc-4.8 and clang produce popcntq. The runtime check doesn't help since the compilers produce popcntq for the C case. Using __builtin_popcountl() asks for this directly. Using __bitcount64() asks for it indirectly, and gets it for clang. Using popcnt_pc_map_elem() may avoid getting it for clang, but apparentely still gets it.) When popcntq is available, the test to decide whether to use it is a small pessimization. It might take longer than a
Re: svn commit: r280279 - head/sys/sys
On 3/21/15 12:35 PM, Konstantin Belousov wrote: On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote: On 3/20/15 9:02 AM, Konstantin Belousov wrote: On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote: Author: jhb Date: Fri Mar 20 10:27:06 2015 New Revision: 280279 URL: https://svnweb.freebsd.org/changeset/base/280279 Log: Expand the bitcount* API to support 64-bit integers, plain ints and longs and create a hidden API that can be used in other system headers without adding namespace pollution. - If the POPCNT instruction is enabled at compile time, use __builtin_popcount*() to implement __bitcount*(), otherwise fall back to software implementations. Are you aware of the Haswell errata HSD146 ? I see the described behaviour on machines back to SandyBridge, but not on Nehalems. HSD146. POPCNT Instruction May Take Longer to Execute Than Expected Problem: POPCNT instruction execution with a 32 or 64 bit operand may be delayed until previous non-dependent instructions have executed. Jilles noted that gcc head and 4.9.2 already provides a workaround by xoring the dst register. I have some patch for amd64 pmap, see the end of the message. No, I was not aware, but I think it's hard to fix this anywhere but the compiler. I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang uses POPCOUNT for this function from ACPI-CA: static UINT8 AcpiRsCountSetBits ( UINT16 BitField) { UINT8 BitsSet; ACPI_FUNCTION_ENTRY (); for (BitsSet = 0; BitField; BitsSet++) { /* Zero the least significant bit that is set */ BitField = (UINT16) (BitField - 1); } return (BitsSet); } (I ran into this accidentally because a kernel built on my system failed to boot in older qemu because the kernel paniced with an illegal instruction fault in this function.) There's no way we can preemptively locate every bit of C that clang might decide to replace with popcount and fix them to employ a workaround. I think the only viable solution is to use -mno-popcount on all compilers that aren't known to be fixed. Both you and Bruce and Jilles state this, but I am unable to understand the argument about the compiler. Can somebody explain this to me, please ? If you compile with -mpopcount (or a march that includes it like -march=corei7) then the compiler will assume that popcount is available and will use it without doing any runtime checks. In the case of my sandybridge desktop, I have CPUTYPE set to corei7-avx in /etc/make.conf which adds -march=corei7-avx to CFLAGS, so the compiler assumes it can use POPCOUNT (as well as newer SSE and AVX). In this case the compiler recognized the pattern in the C function above as doing a population count and replaced the entire function with a POPCOUNT instruction even though the C code doesn't explicitly try to use it. However, if we know that a compiler doesn't employ the workaround for this errata, we can employ -mno-popcount so that using a custom CPUTYPE does not result in use of POPCOUNT except for known-ok compilers. Either that or we use a more elaborate set of #if checks in sys/types.h that only uses __builtin_popcount() for known-ok compilers (and possibly uses the clear-destination workaround for other compilers if -mpopcount is enabled). Compiler cannot generate popcntq instructions unless it is directed to generate code not for amd64, but for some specific microacrchitecture. Any use of bitcount in generic kernel or userspace is going to be a SWAR. In other words, #ifdef __POPCNT__ is for non-default settings of the FreeBSD compilation environment. Correct. And even then, compiler must issue cpuid to fetch the feature mask, which arguably compiler cannot do due to the serialization instruction cost. Compiler could generate the call to instruction in the init code, but this is impossible for freestanding environment, since compiler does not know where to put init code. No, the compiler does not have to check cpuid. If you've told it to compile for a given CPU type, it has tables to hardcode the features known to be present on those CPUs and just uses them. If you use the wrong CPU type you get illegal instruction faults. :) (As I did in my qemu test case.) In regards to whether or not this should be a dynamic test instead, it seems a bit much to require a dynamic check for code that has to work in both userland and the kernel, and I'm not sure if something like CPU_COUNT() would actually benefit from that versus just always using the software solution instead. I agree. Hence why I opted to only use POPCOUNT if someone explicitly enables it via -mpopcount or -march flags. If you've told the compiler it's ok to use it without any checks, then it will use POPCOUNT, otherwise it will use the SWAR method. -- John Baldwin ___
Re: svn commit: r280279 - head/sys/sys
On 3/20/15 9:02 AM, Konstantin Belousov wrote: On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote: Author: jhb Date: Fri Mar 20 10:27:06 2015 New Revision: 280279 URL: https://svnweb.freebsd.org/changeset/base/280279 Log: Expand the bitcount* API to support 64-bit integers, plain ints and longs and create a hidden API that can be used in other system headers without adding namespace pollution. - If the POPCNT instruction is enabled at compile time, use __builtin_popcount*() to implement __bitcount*(), otherwise fall back to software implementations. Are you aware of the Haswell errata HSD146 ? I see the described behaviour on machines back to SandyBridge, but not on Nehalems. HSD146. POPCNT Instruction May Take Longer to Execute Than Expected Problem: POPCNT instruction execution with a 32 or 64 bit operand may be delayed until previous non-dependent instructions have executed. Jilles noted that gcc head and 4.9.2 already provides a workaround by xoring the dst register. I have some patch for amd64 pmap, see the end of the message. No, I was not aware, but I think it's hard to fix this anywhere but the compiler. I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang uses POPCOUNT for this function from ACPI-CA: static UINT8 AcpiRsCountSetBits ( UINT16 BitField) { UINT8 BitsSet; ACPI_FUNCTION_ENTRY (); for (BitsSet = 0; BitField; BitsSet++) { /* Zero the least significant bit that is set */ BitField = (UINT16) (BitField - 1); } return (BitsSet); } (I ran into this accidentally because a kernel built on my system failed to boot in older qemu because the kernel paniced with an illegal instruction fault in this function.) There's no way we can preemptively locate every bit of C that clang might decide to replace with popcount and fix them to employ a workaround. I think the only viable solution is to use -mno-popcount on all compilers that aren't known to be fixed. In regards to whether or not this should be a dynamic test instead, it seems a bit much to require a dynamic check for code that has to work in both userland and the kernel, and I'm not sure if something like CPU_COUNT() would actually benefit from that versus just always using the software solution instead. -- John Baldwin ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r280279 - head/sys/sys
On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote: On 3/20/15 9:02 AM, Konstantin Belousov wrote: On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote: Author: jhb Date: Fri Mar 20 10:27:06 2015 New Revision: 280279 URL: https://svnweb.freebsd.org/changeset/base/280279 Log: Expand the bitcount* API to support 64-bit integers, plain ints and longs and create a hidden API that can be used in other system headers without adding namespace pollution. - If the POPCNT instruction is enabled at compile time, use __builtin_popcount*() to implement __bitcount*(), otherwise fall back to software implementations. Are you aware of the Haswell errata HSD146 ? I see the described behaviour on machines back to SandyBridge, but not on Nehalems. HSD146. POPCNT Instruction May Take Longer to Execute Than Expected Problem: POPCNT instruction execution with a 32 or 64 bit operand may be delayed until previous non-dependent instructions have executed. Jilles noted that gcc head and 4.9.2 already provides a workaround by xoring the dst register. I have some patch for amd64 pmap, see the end of the message. No, I was not aware, but I think it's hard to fix this anywhere but the compiler. I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang uses POPCOUNT for this function from ACPI-CA: static UINT8 AcpiRsCountSetBits ( UINT16 BitField) { UINT8 BitsSet; ACPI_FUNCTION_ENTRY (); for (BitsSet = 0; BitField; BitsSet++) { /* Zero the least significant bit that is set */ BitField = (UINT16) (BitField - 1); } return (BitsSet); } (I ran into this accidentally because a kernel built on my system failed to boot in older qemu because the kernel paniced with an illegal instruction fault in this function.) There's no way we can preemptively locate every bit of C that clang might decide to replace with popcount and fix them to employ a workaround. I think the only viable solution is to use -mno-popcount on all compilers that aren't known to be fixed. Both you and Bruce and Jilles state this, but I am unable to understand the argument about the compiler. Can somebody explain this to me, please ? Compiler cannot generate popcntq instructions unless it is directed to generate code not for amd64, but for some specific microacrchitecture. Any use of bitcount in generic kernel or userspace is going to be a SWAR. In other words, #ifdef __POPCNT__ is for non-default settings of the FreeBSD compilation environment. And even then, compiler must issue cpuid to fetch the feature mask, which arguably compiler cannot do due to the serialization instruction cost. Compiler could generate the call to instruction in the init code, but this is impossible for freestanding environment, since compiler does not know where to put init code. Only explicit uses of popcnt in the asm, with the CPU feature check around, going to end actually use the instruction in the generic code. And there, we may apply the workaround like I did for pmap. In regards to whether or not this should be a dynamic test instead, it seems a bit much to require a dynamic check for code that has to work in both userland and the kernel, and I'm not sure if something like CPU_COUNT() would actually benefit from that versus just always using the software solution instead. I agree. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r280279 - head/sys/sys
On Sat, Mar 21, 2015 at 09:33:22PM +1100, Bruce Evans wrote: On Sat, 21 Mar 2015, Konstantin Belousov wrote: How does the unconditional use of popcntq (in inline asm that is called from amd64 pmap) even work? It is not unconditional. The pmap code does the following: if ((cpu_feature2 CPUID2_POPCNT) == 0) { free = popcnt_pc_map_elem(pc-pc_map[0]); ... } else { free = popcntq(pc-pc_map[0]); ... } jhb's cleanups apparently missed this inline asm, while previous work should not have added it. This inline asm should never have existed, since compilers can generate the same code with less-incorrect configuration. Correct configuration would be messy. jhb's cleanups depend on a compiler predefine (__POPCNT__) to determine if the instruction can be used. But this definition is static and is usually wrong, since compilers default to plain amd64 which doesn't have the instruction. Dynamic configuration would have to use cpuid just to determine if the instruction exists. Exactly, this is what supposed to be done, and is done in the pmap. It is fine for somebody targeting specific machine to change -march, but it cannot be used to optimize the generic kernel which is shipped. Jilles noted that gcc head and 4.9.2 already provides a workaround by xoring the dst register. I have some patch for amd64 pmap, see the end of the message. I thought that that couldn't work since it uses the instruction unconditionally. But the old code already does that. IIRC, then patch never never uses asm, but intentionally uses the popcount builtin to avoid complications. popcntq() in amd64/include/cpufunc.h does use asm, but is missing the complications needed for when it the instruction is not available. All callers (just 3 in pmap) seem to be broken, since they are also missing the complications. See above. Still, it is weird to provide functions from the sys/types.h namespace, and even more weird to provide such special-purpose function. Not in the sys/types.h, but in the implementation namespace :-). Since no one should depend on the implementation details, we can move the details to a better place when one is known. Better places would be something like libkern for userland to hold a large collection of functions, and a Standard place for single functions. I expect the Standard place will be broken as designed for compatibility. Probably strings.h alongside ffs(). strings.h sounds less surprising than sys/types.h The special (inline/asm) functions are in sys/types.h on amd64 are currently limited to just 3 bswap functions and some new popcnt functions. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r280279 - head/sys/sys
On Sat, 21 Mar 2015, Konstantin Belousov wrote: On Sat, Mar 21, 2015 at 09:42:51AM +1100, Bruce Evans wrote: On Fri, 20 Mar 2015, Konstantin Belousov wrote: On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote: Author: jhb Date: Fri Mar 20 10:27:06 2015 New Revision: 280279 URL: https://svnweb.freebsd.org/changeset/base/280279 Log: Expand the bitcount* API to support 64-bit integers, plain ints and longs and create a hidden API that can be used in other system headers without adding namespace pollution. - If the POPCNT instruction is enabled at compile time, use __builtin_popcount*() to implement __bitcount*(), otherwise fall back to software implementations. Are you aware of the Haswell errata HSD146 ? I see the described behaviour I wasn't. on machines back to SandyBridge, but not on Nehalems. HSD146. POPCNT Instruction May Take Longer to Execute Than Expected Problem: POPCNT instruction execution with a 32 or 64 bit operand may be delayed until previous non-dependent instructions have executed. If it only affects performance, then it is up to the compiler to fix it. It affects performance on some cpu models. It is too wrong for compiler to issue cpuid before popcnt. Always issuing xorl before popcnt, as it is currently done by recent gcc, but not clang, is better, but still I think it is up to the code author to decide. How does the unconditional use of popcntq (in inline asm that is called from amd64 pmap) even work? jhb's cleanups apparently missed this inline asm, while previous work should not have added it. This inline asm should never have existed, since compilers can generate the same code with less-incorrect configuration. Correct configuration would be messy. jhb's cleanups depend on a compiler predefine (__POPCNT__) to determine if the instruction can be used. But this definition is static and is usually wrong, since compilers default to plain amd64 which doesn't have the instruction. Dynamic configuration would have to use cpuid just to determine if the instruction exists. Jilles noted that gcc head and 4.9.2 already provides a workaround by xoring the dst register. I have some patch for amd64 pmap, see the end of the message. I thought that that couldn't work since it uses the instruction unconditionally. But the old code already does that. IIRC, then patch never never uses asm, but intentionally uses the popcount builtin to avoid complications. popcntq() in amd64/include/cpufunc.h does use asm, but is missing the complications needed for when it the instruction is not available. All callers (just 3 in pmap) seem to be broken, since they are also missing the complications. There is also a naming problem. The asm API is named popcntq like the builtins, but the APIs in the cleanup are named *bitcount*. We couldn't decide whether the cleanup should rename them to match the builtins. - Use the existing bitcount16() and bitcount32() from sys/systm.h to implement the non-POPCNT __bitcount16() and __bitcount32() in sys/types.h. Why is it in sys/types.h ? To make it easier to use, while minimizing namespace pollution and inefficiencies. Like the functions used to implement ntohl(), except the implementation is MI so it doesn't need to be in machine. (The functions used to implement ntohl() are in machine/endian.h. sys/types.h always includes that, so it makes little difference to pollution and inefficiency that the implementation is not more directly in machine/_types.h.) bitcount is simpler and not burdened by compatibility, so it doesn't need a separate header.) Still, it is weird to provide functions from the sys/types.h namespace, and even more weird to provide such special-purpose function. Not in the sys/types.h, but in the implementation namespace :-). Since no one should depend on the implementation details, we can move the details to a better place when one is known. Better places would be something like libkern for userland to hold a large collection of functions, and a Standard place for single functions. I expect the Standard place will be broken as designed for compatibility. Probably strings.h alongside ffs(). The special (inline/asm) functions are in sys/types.h on amd64 are currently limited to just 3 bswap functions and some new popcnt functions. Bruce ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r280279 - head/sys/sys
On Sat, 21 Mar 2015, John Baldwin wrote: On 3/21/15 12:35 PM, Konstantin Belousov wrote: On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote: On 3/20/15 9:02 AM, Konstantin Belousov wrote: On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote: Author: jhb Date: Fri Mar 20 10:27:06 2015 New Revision: 280279 URL: https://svnweb.freebsd.org/changeset/base/280279 Log: Expand the bitcount* API to support 64-bit integers, plain ints and longs and create a hidden API that can be used in other system headers without adding namespace pollution. - If the POPCNT instruction is enabled at compile time, use __builtin_popcount*() to implement __bitcount*(), otherwise fall back to software implementations. Are you aware of the Haswell errata HSD146 ? I see the described behaviour on machines back to SandyBridge, but not on Nehalems. HSD146. POPCNT Instruction May Take Longer to Execute Than Expected Problem: POPCNT instruction execution with a 32 or 64 bit operand may be delayed until previous non-dependent instructions have executed. Jilles noted that gcc head and 4.9.2 already provides a workaround by xoring the dst register. I have some patch for amd64 pmap, see the end of the message. No, I was not aware, but I think it's hard to fix this anywhere but the compiler. I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang uses POPCOUNT for this function from ACPI-CA: static UINT8 AcpiRsCountSetBits ( UINT16 BitField) { UINT8 BitsSet; ACPI_FUNCTION_ENTRY (); for (BitsSet = 0; BitField; BitsSet++) { /* Zero the least significant bit that is set */ BitField = (UINT16) (BitField - 1); } return (BitsSet); } (I ran into this accidentally because a kernel built on my system failed to boot in older qemu because the kernel paniced with an illegal instruction fault in this function.) Does it do the same for the similar home made popcount in pmap?: X /* X * Returns the number of one bits within the given PV chunk map element. X */ X static int X popcnt_pc_map_elem(uint64_t elem) X { X int count; X X /* X* This simple method of counting the one bits performs well because X* the given element typically contains more zero bits than one bits. X*/ X count = 0; X for (; elem != 0; elem = elem - 1) X count++; X return (count); X } This is actually the same method as in ACPI. It is just missing mounds of style bugs (mostly Windows spellings; also a bogus cast to UINT16; the cast has no effect). Perhaps this appeared to perform well not for the reason stated in its comment, but because it was tested with excessive CFLAGS, and the compiler actually reduced it to popcntq, and the test machine didn't have the bug. In a game program that I wrote, efficiency of popcount() was actually important since it was used in inner loops. I used lookup tables for popcount() and a couple of application-specific functions and even for fls(). This seemed to be better than the the existing bsf instruction, so it can't be bad for popcount(). The lookup was limited to bitsets of length 13 in the low bits of an int, so the table sizes were only 2**13 = 4096. bitsets of length 64 would need multiple steps. However, if the above performs well without compiler optimizations, then it must be because usually only the lowest few bits are set. Then it would perform even better with even smaller (8-bit) tables combined with an early exit: count = 0; for (; elem != 0; elem = 8) count += lookup[elem 0xff]; return (count); Many variations on this are possible. E.g., use fls() to decide where to start; or avoid all branches, and add up the results in parallel. For bitcount32, the latter is: bitcount64(x) := bitcount32(x 0x) + bitcount32(x 32); bitcount32(x) := bitcount16(x 0x) + bitcount16(x 16); bitcount16(x) := bitcount8(x 0xff) + bitcount8(x 8); bitcount8(x) := lookup[x]; Compilers won't be able to optimize the lookup methods. They might be able to convert the current bitcount*() to popcnt*. Last time I looked, clang but not gcc converted very complicated expressions for bswap*() to bswap* instructions. There's no way we can preemptively locate every bit of C that clang might decide to replace with popcount and fix them to employ a workaround. I think the only viable solution is to use -mno-popcount on all compilers that aren't known to be fixed. Why bother? It is only a short-lived and small(?) efficiency bug. Both you and Bruce and Jilles state this, but I am unable to understand the argument about the compiler. Can somebody explain this to me, please ? It is also to avoid complications for short-lived optimizations. The 3 uses of popcntq() in amd64 pmap cost: - 9 lines of inline asm. - 19 lines for the C version - 9 lines instead of 3 for
svn commit: r280279 - head/sys/sys
Author: jhb Date: Fri Mar 20 10:27:06 2015 New Revision: 280279 URL: https://svnweb.freebsd.org/changeset/base/280279 Log: Expand the bitcount* API to support 64-bit integers, plain ints and longs and create a hidden API that can be used in other system headers without adding namespace pollution. - If the POPCNT instruction is enabled at compile time, use __builtin_popcount*() to implement __bitcount*(), otherwise fall back to software implementations. - Use the existing bitcount16() and bitcount32() from sys/systm.h to implement the non-POPCNT __bitcount16() and __bitcount32() in sys/types.h. - For the non-POPCNT __bitcount64(), use a similar SWAR method on 64-bit systems. For 32-bit systems, use two __bitcount32() operations on the two halves. - Use __bitcount32() to provide a __bitcount() that operates on plain ints. - Use either __bitcount32() or __bitcount64() to provide a __bitcountl() that operates on longs. - Add public bitcount*() wrappers for __bitcount*() for use in the kernel in sys/libkern.h. - Use __builtinl() instead of __builtin_popcountl() in BIT_COUNT(). Discussed with: bde Modified: head/sys/sys/bitset.h head/sys/sys/libkern.h head/sys/sys/systm.h head/sys/sys/types.h Modified: head/sys/sys/bitset.h == --- head/sys/sys/bitset.h Fri Mar 20 10:15:34 2015(r280278) +++ head/sys/sys/bitset.h Fri Mar 20 10:27:06 2015(r280279) @@ -182,7 +182,7 @@ \ __count = 0;\ for (__i = 0; __i __bitset_words((_s)); __i++)\ - __count += __builtin_popcountl((p)-__bits[__i]); \ + __count += __bitcountl((p)-__bits[__i]); \ __count;\ }) Modified: head/sys/sys/libkern.h == --- head/sys/sys/libkern.h Fri Mar 20 10:15:34 2015(r280278) +++ head/sys/sys/libkern.h Fri Mar 20 10:27:06 2015(r280279) @@ -98,6 +98,11 @@ int flsl(long); #ifndefHAVE_INLINE_FLSLL int flsll(long long); #endif +#definebitcount64(x) __bitcount64((uint64_t)(x)) +#definebitcount32(x) __bitcount32((uint32_t)(x)) +#definebitcount16(x) __bitcount16((uint16_t)(x)) +#definebitcountl(x)__bitcountl((u_long)(x)) +#definebitcount(x) __bitcount((u_int)(x)) int fnmatch(const char *, const char *, int); int locc(int, char *, u_int); Modified: head/sys/sys/systm.h == --- head/sys/sys/systm.hFri Mar 20 10:15:34 2015(r280278) +++ head/sys/sys/systm.hFri Mar 20 10:27:06 2015(r280279) @@ -428,33 +428,6 @@ int alloc_unr_specific(struct unrhdr *uh int alloc_unrl(struct unrhdr *uh); void free_unr(struct unrhdr *uh, u_int item); -/* - * Population count algorithm using SWAR approach - * - SIMD Within A Register. - */ -static __inline uint32_t -bitcount32(uint32_t x) -{ - - x = (x 0x) + ((x 0x) 1); - x = (x 0x) + ((x 0x) 2); - x = (x + (x 4)) 0x0f0f0f0f; - x = (x + (x 8)); - x = (x + (x 16)) 0x00ff; - return (x); -} - -static __inline uint16_t -bitcount16(uint32_t x) -{ - - x = (x 0x) + ((x 0x) 1); - x = (x 0x) + ((x 0x) 2); - x = (x + (x 4)) 0x0f0f; - x = (x + (x 8)) 0x00ff; - return (x); -} - void intr_prof_stack_use(struct thread *td, struct trapframe *frame); #endif /* !_SYS_SYSTM_H_ */ Modified: head/sys/sys/types.h == --- head/sys/sys/types.hFri Mar 20 10:15:34 2015(r280278) +++ head/sys/sys/types.hFri Mar 20 10:27:06 2015(r280279) @@ -294,6 +294,68 @@ typedef_Bool bool; #include sys/select.h +#ifdef __POPCNT__ +#define__bitcount64(x) __builtin_popcountll((__uint64_t)(x)) +#define__bitcount32(x) __builtin_popcount((__uint32_t)(x)) +#define__bitcount16(x) __builtin_popcount((__uint16_t)(x)) +#define__bitcountl(x) __builtin_popcountl((unsigned long)(x)) +#define__bitcount(x) __builtin_popcount((unsigned int)(x)) +#else +/* + * Population count algorithm using SWAR approach + * - SIMD Within A Register. + */ +static __inline __uint16_t +__bitcount16(__uint16_t _x) +{ + + _x = (_x 0x) + ((_x 0x) 1); + _x = (_x 0x) + ((_x 0x) 2); + _x = (_x + (_x 4)) 0x0f0f; + _x = (_x + (_x 8)) 0x00ff; + return (_x); +} + +static __inline __uint32_t +__bitcount32(__uint32_t _x) +{ +
Re: svn commit: r280279 - head/sys/sys
On Sat, Mar 21, 2015 at 09:42:51AM +1100, Bruce Evans wrote: On Fri, 20 Mar 2015, Konstantin Belousov wrote: On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote: Author: jhb Date: Fri Mar 20 10:27:06 2015 New Revision: 280279 URL: https://svnweb.freebsd.org/changeset/base/280279 Log: Expand the bitcount* API to support 64-bit integers, plain ints and longs and create a hidden API that can be used in other system headers without adding namespace pollution. - If the POPCNT instruction is enabled at compile time, use __builtin_popcount*() to implement __bitcount*(), otherwise fall back to software implementations. Are you aware of the Haswell errata HSD146 ? I see the described behaviour I wasn't. on machines back to SandyBridge, but not on Nehalems. HSD146. POPCNT Instruction May Take Longer to Execute Than Expected Problem: POPCNT instruction execution with a 32 or 64 bit operand may be delayed until previous non-dependent instructions have executed. If it only affects performance, then it is up to the compiler to fix it. It affects performance on some cpu models. It is too wrong for compiler to issue cpuid before popcnt. Always issuing xorl before popcnt, as it is currently done by recent gcc, but not clang, is better, but still I think it is up to the code author to decide. Jilles noted that gcc head and 4.9.2 already provides a workaround by xoring the dst register. I have some patch for amd64 pmap, see the end of the message. IIRC, then patch never never uses asm, but intentionally uses the popcount builtin to avoid complications. - Use the existing bitcount16() and bitcount32() from sys/systm.h to implement the non-POPCNT __bitcount16() and __bitcount32() in sys/types.h. Why is it in sys/types.h ? To make it easier to use, while minimizing namespace pollution and inefficiencies. Like the functions used to implement ntohl(), except the implementation is MI so it doesn't need to be in machine. (The functions used to implement ntohl() are in machine/endian.h. sys/types.h always includes that, so it makes little difference to pollution and inefficiency that the implementation is not more directly in machine/_types.h.) bitcount is simpler and not burdened by compatibility, so it doesn't need a separate header.) Still, it is weird to provide functions from the sys/types.h namespace, and even more weird to provide such special-purpose function. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r280279 - head/sys/sys
On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote: Author: jhb Date: Fri Mar 20 10:27:06 2015 New Revision: 280279 URL: https://svnweb.freebsd.org/changeset/base/280279 Log: Expand the bitcount* API to support 64-bit integers, plain ints and longs and create a hidden API that can be used in other system headers without adding namespace pollution. - If the POPCNT instruction is enabled at compile time, use __builtin_popcount*() to implement __bitcount*(), otherwise fall back to software implementations. Are you aware of the Haswell errata HSD146 ? I see the described behaviour on machines back to SandyBridge, but not on Nehalems. HSD146. POPCNT Instruction May Take Longer to Execute Than Expected Problem: POPCNT instruction execution with a 32 or 64 bit operand may be delayed until previous non-dependent instructions have executed. Jilles noted that gcc head and 4.9.2 already provides a workaround by xoring the dst register. I have some patch for amd64 pmap, see the end of the message. - Use the existing bitcount16() and bitcount32() from sys/systm.h to implement the non-POPCNT __bitcount16() and __bitcount32() in sys/types.h. Why is it in sys/types.h ? - For the non-POPCNT __bitcount64(), use a similar SWAR method on 64-bit systems. For 32-bit systems, use two __bitcount32() operations on the two halves. - Use __bitcount32() to provide a __bitcount() that operates on plain ints. - Use either __bitcount32() or __bitcount64() to provide a __bitcountl() that operates on longs. - Add public bitcount*() wrappers for __bitcount*() for use in the kernel in sys/libkern.h. - Use __builtinl() instead of __builtin_popcountl() in BIT_COUNT(). Discussed with: bde diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c index 6a4077c..f6fbc33 100644 --- a/sys/amd64/amd64/pmap.c +++ b/sys/amd64/amd64/pmap.c @@ -413,6 +417,7 @@ static void free_pv_chunk(struct pv_chunk *pc); static voidfree_pv_entry(pmap_t pmap, pv_entry_t pv); static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp); static int popcnt_pc_map_elem(uint64_t elem); +static int popcnt_pc_map_elem_pq(uint64_t elem); static vm_page_t reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp); static voidreserve_pv_entries(pmap_t pmap, int needed, struct rwlock **lockp); @@ -2997,6 +3020,29 @@ popcnt_pc_map_elem(uint64_t elem) } /* + * The erratas for Intel processors state that POPCNT Instruction May + * Take Longer to Execute Than Expected. It is believed that the + * issue is the spurious dependency on the destination register. + * Provide a hint to the register rename logic that the destination + * value is overwritten, by clearing it, as suggested in the + * optimization manual. It should be cheap for unaffected processors + * as well. + * + * Reference numbers for erratas are + * 4th Gen Core: HSD146 + * 5th Gen Core: BDM85 + */ +static int +popcnt_pc_map_elem_pq(uint64_t elem) +{ + u_long result; + + __asm __volatile(xorl %k0,%k0;popcntq %1,%0 + : =r (result) : rm (elem)); + return (result); +} + +/* * Ensure that the number of spare PV entries in the specified pmap meets or * exceeds the given count, needed. * @@ -3029,9 +3075,9 @@ retry: free += popcnt_pc_map_elem(pc-pc_map[1]); free += popcnt_pc_map_elem(pc-pc_map[2]); } else { - free = popcntq(pc-pc_map[0]); - free += popcntq(pc-pc_map[1]); - free += popcntq(pc-pc_map[2]); + free = popcnt_pc_map_elem_pq(pc-pc_map[0]); + free += popcnt_pc_map_elem_pq(pc-pc_map[1]); + free += popcnt_pc_map_elem_pq(pc-pc_map[2]); } if (free == 0) break; ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r280279 - head/sys/sys
On Fri, 20 Mar 2015, Konstantin Belousov wrote: On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote: Author: jhb Date: Fri Mar 20 10:27:06 2015 New Revision: 280279 URL: https://svnweb.freebsd.org/changeset/base/280279 Log: Expand the bitcount* API to support 64-bit integers, plain ints and longs and create a hidden API that can be used in other system headers without adding namespace pollution. - If the POPCNT instruction is enabled at compile time, use __builtin_popcount*() to implement __bitcount*(), otherwise fall back to software implementations. Are you aware of the Haswell errata HSD146 ? I see the described behaviour I wasn't. on machines back to SandyBridge, but not on Nehalems. HSD146. POPCNT Instruction May Take Longer to Execute Than Expected Problem: POPCNT instruction execution with a 32 or 64 bit operand may be delayed until previous non-dependent instructions have executed. If it only affects performance, then it is up to the compiler to fix it. Jilles noted that gcc head and 4.9.2 already provides a workaround by xoring the dst register. I have some patch for amd64 pmap, see the end of the message. IIRC, then patch never never uses asm, but intentionally uses the popcount builtin to avoid complications. - Use the existing bitcount16() and bitcount32() from sys/systm.h to implement the non-POPCNT __bitcount16() and __bitcount32() in sys/types.h. Why is it in sys/types.h ? To make it easier to use, while minimizing namespace pollution and inefficiencies. Like the functions used to implement ntohl(), except the implementation is MI so it doesn't need to be in machine. (The functions used to implement ntohl() are in machine/endian.h. sys/types.h always includes that, so it makes little difference to pollution and inefficiency that the implementation is not more directly in machine/_types.h.) bitcount is simpler and not burdened by compatibility, so it doesn't need a separate header.) Bruce ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org