Re: svn commit: r280279 - head/sys/sys

2015-04-20 Thread Konstantin Belousov
On Mon, Apr 13, 2015 at 04:04:45PM -0400, Jung-uk Kim wrote:
 Please try the attached patch.
 
 Jung-uk Kim
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v2
 
 iQEcBAEBCAAGBQJVLCFZAAoJEHyflib82/FGOp0H/1+Jr+cKUn/MnV5O5SghPw9f
 XzTM4+BV9BcWabLRjFe1LR065SfLDXqKLuU4h5lmVSlXQaxElAXxaMeyO3mrMzR4
 Sb1xr0rf+ZfUARJeEJWI65Wpn+gEH+7XxXAIAetYGMwwclBOBgbZIoDXITnCaUFa
 /pi3zQIey8EzbvlzhQcffLDV8oF4f8HNEMoSxMRtOiZNNPu/8ECnyGeHZhOd++kh
 pwZNsSbcCw3RXMheuErTpKPrJSEXgMNmWG3G00aP7L8IjcObgOqMUQt+8eT8Ge8B
 tEv40kgm2G/OG2akONh4/6bX3hyodW3IHcb6AYhqZogiDIqd/eXD4jDup/kkVxU=
 =1Ca9
 -END PGP SIGNATURE-

 Index: sys/amd64/amd64/pmap.c
 ===
 --- sys/amd64/amd64/pmap.c(revision 281496)
 +++ sys/amd64/amd64/pmap.c(working copy)
 @@ -412,7 +412,7 @@ static caddr_t crashdumpmap;
  static void  free_pv_chunk(struct pv_chunk *pc);
  static void  free_pv_entry(pmap_t pmap, pv_entry_t pv);
  static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp);
 -static int   popcnt_pc_map_elem_pq(uint64_t elem);
 +static int   popcnt_pc_map(uint64_t *pc_map);
  static vm_page_t reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp);
  static void  reserve_pv_entries(pmap_t pmap, int needed,
   struct rwlock **lockp);
 @@ -2979,7 +2979,7 @@ retry:
  }
  
  /*
 - * Returns the number of one bits within the given PV chunk map element.
 + * Returns the number of one bits within the given PV chunk map.
   *
   * The erratas for Intel processors state that POPCNT Instruction May
   * Take Longer to Execute Than Expected.  It is believed that the
 @@ -2994,12 +2994,21 @@ retry:
   * 5th Gen Core: BDM85
   */
  static int
 -popcnt_pc_map_elem_pq(uint64_t elem)
 +popcnt_pc_map(uint64_t *pc_map)
  {
 - u_long result;
 + u_long count, result;
 + int field;
  
 - __asm __volatile(xorl %k0,%k0;popcntq %1,%0
 - : =r (result) : rm (elem));
 + result = 0;
 + if ((cpu_feature2  CPUID2_POPCNT) != 0)
 + for (field = 0; field  _NPCM; field++) {
 + __asm __volatile(xorl %k0, %k0; popcntq %1, %0
 + : =r (count) : m (pc_map[field]));
 + result += count;
 + }
 + else
 + for (field = 0; field  _NPCM; field++)
 + result += bitcount64(pc_map[field]);
   return (result);
  }
  
 @@ -3031,15 +3040,7 @@ reserve_pv_entries(pmap_t pmap, int needed, struct
  retry:
   avail = 0;
   TAILQ_FOREACH(pc, pmap-pm_pvchunk, pc_list) {
 - if ((cpu_feature2  CPUID2_POPCNT) == 0) {
 - free = bitcount64(pc-pc_map[0]);
 - free += bitcount64(pc-pc_map[1]);
 - free += bitcount64(pc-pc_map[2]);
 - } else {
 - free = popcnt_pc_map_elem_pq(pc-pc_map[0]);
 - free += popcnt_pc_map_elem_pq(pc-pc_map[1]);
 - free += popcnt_pc_map_elem_pq(pc-pc_map[2]);
 - }
 + free = popcnt_pc_map(pc-pc_map);
   if (free == 0)
   break;
   avail += free;

Yes, this worked for me the same way as for you, the argument is taken
directly from memory, without temporary spill.  Is this due to silly
inliner ?  Whatever the reason is, I think a comment should be added
noting the subtlety.

Otherwise, looks fine.
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r280279 - head/sys/sys

2015-04-20 Thread Bruce Evans

On Mon, 20 Apr 2015, Konstantin Belousov wrote:


On Mon, Apr 13, 2015 at 04:04:45PM -0400, Jung-uk Kim wrote:

Please try the attached patch.



Index: sys/amd64/amd64/pmap.c
===
--- sys/amd64/amd64/pmap.c  (revision 281496)
+++ sys/amd64/amd64/pmap.c  (working copy)
@@ -412,7 +412,7 @@ static caddr_t crashdumpmap;
 static voidfree_pv_chunk(struct pv_chunk *pc);
 static voidfree_pv_entry(pmap_t pmap, pv_entry_t pv);
 static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp);
-static int popcnt_pc_map_elem_pq(uint64_t elem);
+static int popcnt_pc_map(uint64_t *pc_map);
 static vm_page_t reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp);
 static voidreserve_pv_entries(pmap_t pmap, int needed,
struct rwlock **lockp);
@@ -2979,7 +2979,7 @@ retry:
 }

 /*
- * Returns the number of one bits within the given PV chunk map element.
+ * Returns the number of one bits within the given PV chunk map.
  *
  * The erratas for Intel processors state that POPCNT Instruction May
  * Take Longer to Execute Than Expected.  It is believed that the
@@ -2994,12 +2994,21 @@ retry:
  * 5th Gen Core: BDM85
  */
 static int
-popcnt_pc_map_elem_pq(uint64_t elem)
+popcnt_pc_map(uint64_t *pc_map)
 {
-   u_long result;
+   u_long count, result;
+   int field;

-   __asm __volatile(xorl %k0,%k0;popcntq %1,%0
-   : =r (result) : rm (elem));
+   result = 0;
+   if ((cpu_feature2  CPUID2_POPCNT) != 0)
+   for (field = 0; field  _NPCM; field++) {
+   __asm __volatile(xorl %k0, %k0; popcntq %1, %0
+   : =r (count) : m (pc_map[field]));
+   result += count;
+   }
+   else
+   for (field = 0; field  _NPCM; field++)
+   result += bitcount64(pc_map[field]);
return (result);
 }

@@ -3031,15 +3040,7 @@ reserve_pv_entries(pmap_t pmap, int needed, struct
 retry:
avail = 0;
TAILQ_FOREACH(pc, pmap-pm_pvchunk, pc_list) {
-   if ((cpu_feature2  CPUID2_POPCNT) == 0) {
-   free = bitcount64(pc-pc_map[0]);
-   free += bitcount64(pc-pc_map[1]);
-   free += bitcount64(pc-pc_map[2]);
-   } else {
-   free = popcnt_pc_map_elem_pq(pc-pc_map[0]);
-   free += popcnt_pc_map_elem_pq(pc-pc_map[1]);
-   free += popcnt_pc_map_elem_pq(pc-pc_map[2]);
-   }
+   free = popcnt_pc_map(pc-pc_map);
if (free == 0)
break;
avail += free;


Yes, this worked for me the same way as for you, the argument is taken
directly from memory, without temporary spill.  Is this due to silly
inliner ?  Whatever the reason is, I think a comment should be added
noting the subtlety.

Otherwise, looks fine.


Erm, this looks silly.  It apparently works by making things too complicated
for the compiler to optimize (where one of the optimizations actually
gives pessimal spills).  Its main changes are:
- change the constraint from rm to m.  This alone is not enough to
  get the compiler to use the natural memory operands without a trip
  through a register
- un-unroll a loop.  This alone makes no difference in my simpler test
  environment.  The compiler un-un-unrolls.  (The compiler also inlines
  all the static function whether asked to or not.  gcc shouldn't do
  this inlining for the 3 calls, and gcc -fno-inline-functions-called-once
  would never do it.  clang is missing support for -fno-inline-functions-
  called-once, and gcc48 vanished with a recent upgrade on freefall,
  so I couldn't test this case easily.  -fno-inline-functions-called-once
  should be the default for kernels.
- some other changes apparently confused clang into doing the right thing.

It works better to change the constraint to r:

X #include sys/types.h
X 
X static __inline u_long

X popcntq(u_long mask)
X {
X   u_long result;
X 
X 	__asm __volatile(xorl %k0,%k0; popcntq %1,%0 :

X   =r (result) : r (mask));
X   return (result);
X }
X 
X u_long x[3];
X 
X int

X test(void)
X {
X   return (popcntq(x[0]) + popcntq(x[1]) + popcntq(x[2]));
X }

X ...
X   .cfi_startproc
X # BB#0: # %entry
X   pushq   %rbp
X .Ltmp0:
X   .cfi_def_cfa_offset 16
X .Ltmp1:
X   .cfi_offset %rbp, -16
X   movq%rsp, %rbp
X .Ltmp2:
X   .cfi_def_cfa_register %rbp
X   movqx(%rip), %rax
X   #APP
X   xorl%ecx, %ecx
X   popcntq %rax, %rcx
X   #NO_APP
X   movqx+8(%rip), %rax
X   #APP
X   xorl%edx, %edx
X   popcntq %rax, %rdx
X   #NO_APP
X   addl%ecx, %edx
X   movqx+16(%rip), %rcx
X   #APP
X   xorl%eax, %eax
X   popcntq %rcx, %rax
X   #NO_APP
X   addl%edx, 

Re: svn commit: r280279 - head/sys/sys

2015-04-20 Thread Bruce Evans

On Tue, 21 Apr 2015, Bruce Evans wrote:


On Mon, 20 Apr 2015, Konstantin Belousov wrote:


On Mon, Apr 13, 2015 at 04:04:45PM -0400, Jung-uk Kim wrote:

Please try the attached patch.
...
-   __asm __volatile(xorl %k0,%k0;popcntq %1,%0
-   : =r (result) : rm (elem));
...
+   __asm __volatile(xorl %k0, %k0; popcntq %1, %0
+   : =r (count) : m (pc_map[field]));

 ...
Yes, this worked for me the same way as for you, the argument is taken
directly from memory, without temporary spill.  Is this due to silly
inliner ?  Whatever the reason is, I think a comment should be added
noting the subtlety.

Otherwise, looks fine.


Erm, this looks silly.  It apparently works by making things too complicated
for the compiler to optimize (where one of the optimizations actually
gives pessimal spills).  Its main changes are:
...
It works better to change the constraint to r:


It's even sillier than that.  The problem is not limited to this function.
clang seems to prefer memory whenever you use the rm constraint.  The
silliest case is when you have a chain of simple asm functions.  Say the
original popcntq (without the xorl):

return (popcntq(popcntq(popcntq(popcntq(popcntq(x));

gcc compiles this to 5 sequential popcntq instructions, but clang
spills the results of the first 4.

This is an old bug.  clang does this on FreeBSD[9-11].  cc does this
on FreeBSD[10-11] (not on FreeBSD-9 since cc = gcc there.

Asms should always use rm if m works.  Ones in cpufunc.h always
do except for lidt(), lldt() and ltr().  These 3 are fixed in my version.
So cpufunc.h almost always asks for the pessimization.

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r280279 - head/sys/sys

2015-04-13 Thread Konstantin Belousov
On Mon, Apr 13, 2015 at 12:36:50PM -0500, Alan Cox wrote:
 However, in the popcnt case, we are spilling the bit map to memory in
 order to popcnt it.  That's rather silly:
 
 3570:   48 8b 48 18 mov0x18(%rax),%rcx
 3574:   f6 04 25 00 00 00 00testb  $0x80,0x0
 357b:   80
 357c:   74 42   je 35c0
 pmap_demote_pde_locked+0x2f0
 357e:   48 89 4d b8 mov%rcx,-0x48(%rbp)
 3582:   31 c9   xor%ecx,%ecx
 3584:   f3 48 0f b8 4d b8   popcnt -0x48(%rbp),%rcx
 358a:   48 8b 50 20 mov0x20(%rax),%rdx
 358e:   48 89 55 b0 mov%rdx,-0x50(%rbp)
 3592:   31 d2   xor%edx,%edx
 3594:   f3 48 0f b8 55 b0   popcnt -0x50(%rbp),%rdx
 359a:   01 ca   add%ecx,%edx
 359c:   48 8b 48 28 mov0x28(%rax),%rcx
 35a0:   48 89 4d a8 mov%rcx,-0x58(%rbp)
 35a4:   31 c9   xor%ecx,%ecx
 35a6:   f3 48 0f b8 4d a8   popcnt -0x58(%rbp),%rcx
 35ac:   01 d1   add%edx,%ecx
 35ae:   e9 12 01 00 00  jmpq   36c5
 pmap_demote_pde_locked+0x3f5
 
 Caveat: I'm still using clang 3.5.  Maybe the newer clang doesn't spill?

3.6.0 generates similar code.
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r280279 - head/sys/sys

2015-04-13 Thread Adrian Chadd
Hi,

These CPUs are supposed to have loop unwinder / streaming hardware. Is
it not unwinding/streaming this loop for us?



-a


On 13 April 2015 at 10:36, Alan Cox a...@rice.edu wrote:
 On 03/30/2015 10:50, John Baldwin wrote:
 On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote:
 On Sat, 21 Mar 2015, John Baldwin wrote:

 On 3/21/15 12:35 PM, Konstantin Belousov wrote:
 On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote:
 On 3/20/15 9:02 AM, Konstantin Belousov wrote:
 On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote:
 Author: jhb
 Date: Fri Mar 20 10:27:06 2015
 New Revision: 280279
 URL: https://svnweb.freebsd.org/changeset/base/280279

 Log:
   Expand the bitcount* API to support 64-bit integers, plain ints and 
 longs
   and create a hidden API that can be used in other system headers 
 without
   adding namespace pollution.
   - If the POPCNT instruction is enabled at compile time, use
 __builtin_popcount*() to implement __bitcount*(), otherwise fall 
 back
 to software implementations.
 Are you aware of the Haswell errata HSD146 ?  I see the described 
 behaviour
 on machines back to SandyBridge, but not on Nehalems.
 HSD146.   POPCNT Instruction May Take Longer to Execute Than Expected
 Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
 delayed until previous non-dependent instructions have executed.

 Jilles noted that gcc head and 4.9.2 already provides a workaround by
 xoring the dst register.  I have some patch for amd64 pmap, see the end
 of the message.
 No, I was not aware, but I think it's hard to fix this anywhere but the
 compiler.  I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang
 uses POPCOUNT for this function from ACPI-CA:

 static UINT8
 AcpiRsCountSetBits (
 UINT16  BitField)
 {
 UINT8   BitsSet;


 ACPI_FUNCTION_ENTRY ();


 for (BitsSet = 0; BitField; BitsSet++)
 {
 /* Zero the least significant bit that is set */

 BitField = (UINT16) (BitField - 1);
 }

 return (BitsSet);
 }

 (I ran into this accidentally because a kernel built on my system failed
 to boot in older qemu because the kernel paniced with an illegal 
 instruction
 fault in this function.)
 Does it do the same for the similar home made popcount in pmap?:
 Yes:

 807658d4:   f6 04 25 46 e2 d6 80testb  
 $0x80,0x80d6e246
 807658db:   80
 807658dc:   74 32   je 80765910 
 pmap_demote_pde_locked+0x4d0
 807658de:   48 89 4d b8 mov%rcx,-0x48(%rbp)
 807658e2:   f3 48 0f b8 4d b8   popcnt -0x48(%rbp),%rcx
 807658e8:   48 8b 50 20 mov0x20(%rax),%rdx
 807658ec:   48 89 55 b0 mov%rdx,-0x50(%rbp)
 807658f0:   f3 48 0f b8 55 b0   popcnt -0x50(%rbp),%rdx
 807658f6:   01 ca   add%ecx,%edx
 807658f8:   48 8b 48 28 mov0x28(%rax),%rcx
 807658fc:   48 89 4d a8 mov%rcx,-0x58(%rbp)
 80765900:   f3 48 0f b8 4d a8   popcnt -0x58(%rbp),%rcx
 80765906:   eb 1b   jmp80765923 
 pmap_demote_pde_locked+0x4e3
 80765908:   0f 1f 84 00 00 00 00nopl   0x0(%rax,%rax,1)
 8076590f:   00
 80765910:   f3 48 0f b8 c9  popcnt %rcx,%rcx
 80765915:   f3 48 0f b8 50 20   popcnt 0x20(%rax),%rdx
 8076591b:   01 ca   add%ecx,%edx
 8076591d:   f3 48 0f b8 48 28   popcnt 0x28(%rax),%rcx
 80765923:   01 d1   add%edx,%ecx

 It also uses popcnt for this in blist_fill() and blist_meta_fill():

 742 /* Count the number of blocks we're about to allocate */
 743 bitmap = scan-u.bmu_bitmap  mask;
 744 for (nblks = 0; bitmap != 0; nblks++)
 745 bitmap = bitmap - 1;

 Always using new API would lose the micro-optimizations given by the runtime
 decision for default CFLAGS (used by distributions for portability).  To
 keep them, it seems best to keep the inline asm but replace
 popcnt_pc_map_elem(elem) by __bitcount64(elem).  -mno-popcount can then
 be used to work around slowness in the software (that is actually
 hardware) case.
 I'm not sure if bitcount64() is strictly better than the loop in this case
 even though it is O(1) given the claimed nature of the values in the comment.



 I checked.  Even with zeroes being more common than ones, bitcount64()
 is faster than the simple loop.  Using bitcount64, reserve_pv_entries()
 takes on average 4265 cycles during buildworld on my test machine.  In
 contrast, with the simple loop, it takes on average 4507 cycles.  Even
 though bitcount64 is a lot larger than the simple loop, we do the 3 bit
 count operations many times in a loop, so the extra i-cache 

Re: svn commit: r280279 - head/sys/sys

2015-04-13 Thread Jung-uk Kim
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

On 04/13/2015 13:36, Alan Cox wrote:
 On 03/30/2015 10:50, John Baldwin wrote:
 On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote:
 On Sat, 21 Mar 2015, John Baldwin wrote:
 
 On 3/21/15 12:35 PM, Konstantin Belousov wrote:
 On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin
 wrote:
 On 3/20/15 9:02 AM, Konstantin Belousov wrote:
 On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin
 wrote:
 Author: jhb Date: Fri Mar 20 10:27:06 2015 New
 Revision: 280279 URL:
 https://svnweb.freebsd.org/changeset/base/280279
 
 Log: Expand the bitcount* API to support 64-bit
 integers, plain ints and longs and create a hidden
 API that can be used in other system headers without 
 adding namespace pollution. - If the POPCNT
 instruction is enabled at compile time, use 
 __builtin_popcount*() to implement __bitcount*(),
 otherwise fall back to software implementations.
 Are you aware of the Haswell errata HSD146 ?  I see the
 described behaviour on machines back to SandyBridge,
 but not on Nehalems. HSD146.   POPCNT Instruction May
 Take Longer to Execute Than Expected Problem: POPCNT
 instruction execution with a 32 or 64 bit operand may
 be delayed until previous non-dependent instructions
 have executed.
 
 Jilles noted that gcc head and 4.9.2 already provides a
 workaround by xoring the dst register.  I have some
 patch for amd64 pmap, see the end of the message.
 No, I was not aware, but I think it's hard to fix this
 anywhere but the compiler.  I set CPUTYPE in src.conf on
 my Ivy Bridge desktop and clang uses POPCOUNT for this
 function from ACPI-CA:
 
 static UINT8 AcpiRsCountSetBits ( UINT16
 BitField) { UINT8   BitsSet;
 
 
 ACPI_FUNCTION_ENTRY ();
 
 
 for (BitsSet = 0; BitField; BitsSet++) { /* Zero the
 least significant bit that is set */
 
 BitField = (UINT16) (BitField - 1); }
 
 return (BitsSet); }
 
 (I ran into this accidentally because a kernel built on
 my system failed to boot in older qemu because the kernel
 paniced with an illegal instruction fault in this
 function.)
 Does it do the same for the similar home made popcount in
 pmap?:
 Yes:
 
 807658d4:   f6 04 25 46 e2 d6 80testb
 $0x80,0x80d6e246 807658db:   80 
 807658dc:   74 32   je
 80765910 pmap_demote_pde_locked+0x4d0 807658de:
 48 89 4d b8 mov%rcx,-0x48(%rbp) 807658e2:
 f3 48 0f b8 4d b8   popcnt -0x48(%rbp),%rcx 807658e8:
 48 8b 50 20 mov0x20(%rax),%rdx 807658ec:
 48 89 55 b0 mov%rdx,-0x50(%rbp) 807658f0:
 f3 48 0f b8 55 b0   popcnt -0x50(%rbp),%rdx 807658f6:
 01 ca   add%ecx,%edx 807658f8:
 48 8b 48 28 mov0x28(%rax),%rcx 807658fc:
 48 89 4d a8 mov%rcx,-0x58(%rbp) 80765900:
 f3 48 0f b8 4d a8   popcnt -0x58(%rbp),%rcx 80765906:
 eb 1b   jmp80765923
 pmap_demote_pde_locked+0x4e3 80765908:   0f 1f 84
 00 00 00 00nopl   0x0(%rax,%rax,1) 8076590f:   00
  80765910:   f3 48 0f b8 c9  popcnt
 %rcx,%rcx 80765915:   f3 48 0f b8 50 20   popcnt
 0x20(%rax),%rdx 8076591b:   01 ca
 add%ecx,%edx 8076591d:   f3 48 0f b8 48 28
 popcnt 0x28(%rax),%rcx 80765923:   01 d1
 add%edx,%ecx
 
 It also uses popcnt for this in blist_fill() and
 blist_meta_fill():
 
 742 /* Count the number of blocks we're about to
 allocate */ 743 bitmap = scan-u.bmu_bitmap  mask; 
 744 for (nblks = 0; bitmap != 0; nblks++) 745
 bitmap = bitmap - 1;
 
 Always using new API would lose the micro-optimizations given
 by the runtime decision for default CFLAGS (used by
 distributions for portability).  To keep them, it seems best to
 keep the inline asm but replace popcnt_pc_map_elem(elem) by
 __bitcount64(elem).  -mno-popcount can then be used to work
 around slowness in the software (that is actually hardware)
 case.
 I'm not sure if bitcount64() is strictly better than the loop in
 this case even though it is O(1) given the claimed nature of the
 values in the comment.
 
 
 
 I checked.  Even with zeroes being more common than ones,
 bitcount64() is faster than the simple loop.  Using bitcount64,
 reserve_pv_entries() takes on average 4265 cycles during
 buildworld on my test machine.  In contrast, with the simple
 loop, it takes on average 4507 cycles.  Even though bitcount64 is a
 lot larger than the simple loop, we do the 3 bit count operations
 many times in a loop, so the extra i-cache misses are being made up
 for by the repeated execution of the faster code.
 
 However, in the popcnt case, we are spilling the bit map to memory
 in order to popcnt it.  That's rather silly:
 
 3570:   48 8b 48 18 mov0x18(%rax),%rcx 3574:
 f6 04 25 00 00 00 00testb  

Re: svn commit: r280279 - head/sys/sys

2015-04-13 Thread Alan Cox
On 03/30/2015 10:50, John Baldwin wrote:
 On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote:
 On Sat, 21 Mar 2015, John Baldwin wrote:

 On 3/21/15 12:35 PM, Konstantin Belousov wrote:
 On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote:
 On 3/20/15 9:02 AM, Konstantin Belousov wrote:
 On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote:
 Author: jhb
 Date: Fri Mar 20 10:27:06 2015
 New Revision: 280279
 URL: https://svnweb.freebsd.org/changeset/base/280279

 Log:
   Expand the bitcount* API to support 64-bit integers, plain ints and 
 longs
   and create a hidden API that can be used in other system headers 
 without
   adding namespace pollution.
   - If the POPCNT instruction is enabled at compile time, use
 __builtin_popcount*() to implement __bitcount*(), otherwise fall 
 back
 to software implementations.
 Are you aware of the Haswell errata HSD146 ?  I see the described 
 behaviour
 on machines back to SandyBridge, but not on Nehalems.
 HSD146.   POPCNT Instruction May Take Longer to Execute Than Expected
 Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
 delayed until previous non-dependent instructions have executed.

 Jilles noted that gcc head and 4.9.2 already provides a workaround by
 xoring the dst register.  I have some patch for amd64 pmap, see the end
 of the message.
 No, I was not aware, but I think it's hard to fix this anywhere but the
 compiler.  I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang
 uses POPCOUNT for this function from ACPI-CA:

 static UINT8
 AcpiRsCountSetBits (
 UINT16  BitField)
 {
 UINT8   BitsSet;


 ACPI_FUNCTION_ENTRY ();


 for (BitsSet = 0; BitField; BitsSet++)
 {
 /* Zero the least significant bit that is set */

 BitField = (UINT16) (BitField - 1);
 }

 return (BitsSet);
 }

 (I ran into this accidentally because a kernel built on my system failed
 to boot in older qemu because the kernel paniced with an illegal 
 instruction
 fault in this function.)
 Does it do the same for the similar home made popcount in pmap?:
 Yes:

 807658d4:   f6 04 25 46 e2 d6 80testb  
 $0x80,0x80d6e246
 807658db:   80 
 807658dc:   74 32   je 80765910 
 pmap_demote_pde_locked+0x4d0
 807658de:   48 89 4d b8 mov%rcx,-0x48(%rbp)
 807658e2:   f3 48 0f b8 4d b8   popcnt -0x48(%rbp),%rcx
 807658e8:   48 8b 50 20 mov0x20(%rax),%rdx
 807658ec:   48 89 55 b0 mov%rdx,-0x50(%rbp)
 807658f0:   f3 48 0f b8 55 b0   popcnt -0x50(%rbp),%rdx
 807658f6:   01 ca   add%ecx,%edx
 807658f8:   48 8b 48 28 mov0x28(%rax),%rcx
 807658fc:   48 89 4d a8 mov%rcx,-0x58(%rbp)
 80765900:   f3 48 0f b8 4d a8   popcnt -0x58(%rbp),%rcx
 80765906:   eb 1b   jmp80765923 
 pmap_demote_pde_locked+0x4e3
 80765908:   0f 1f 84 00 00 00 00nopl   0x0(%rax,%rax,1)
 8076590f:   00 
 80765910:   f3 48 0f b8 c9  popcnt %rcx,%rcx
 80765915:   f3 48 0f b8 50 20   popcnt 0x20(%rax),%rdx
 8076591b:   01 ca   add%ecx,%edx
 8076591d:   f3 48 0f b8 48 28   popcnt 0x28(%rax),%rcx
 80765923:   01 d1   add%edx,%ecx

 It also uses popcnt for this in blist_fill() and blist_meta_fill():

 742 /* Count the number of blocks we're about to allocate */
 743 bitmap = scan-u.bmu_bitmap  mask;
 744 for (nblks = 0; bitmap != 0; nblks++)
 745 bitmap = bitmap - 1;

 Always using new API would lose the micro-optimizations given by the runtime
 decision for default CFLAGS (used by distributions for portability).  To
 keep them, it seems best to keep the inline asm but replace
 popcnt_pc_map_elem(elem) by __bitcount64(elem).  -mno-popcount can then
 be used to work around slowness in the software (that is actually
 hardware) case.
 I'm not sure if bitcount64() is strictly better than the loop in this case
 even though it is O(1) given the claimed nature of the values in the comment.



I checked.  Even with zeroes being more common than ones, bitcount64()
is faster than the simple loop.  Using bitcount64, reserve_pv_entries()
takes on average 4265 cycles during buildworld on my test machine.  In
contrast, with the simple loop, it takes on average 4507 cycles.  Even
though bitcount64 is a lot larger than the simple loop, we do the 3 bit
count operations many times in a loop, so the extra i-cache misses are
being made up for by the repeated execution of the faster code.

However, in the popcnt case, we are spilling the bit map to memory in
order to popcnt it.  That's rather silly:


Re: svn commit: r280279 - head/sys/sys

2015-03-31 Thread Konstantin Belousov
On Tue, Mar 31, 2015 at 03:49:28PM +1100, Bruce Evans wrote:
 It looks a bit overengineered to me.  A bit like my function pointers
 for the bcopy() family on i386.  bcopy() is a bulk operation, so in
 theory you can do it much faster by selecting the best available
 version at runtime.  In practice, the gains were not large and are
 too machine-dependent to maintain.  It is even harder to get large
 gains and maintain them by selecting individual instructions at runtime.
 
Yes, it is similar to bcopy.  The difference in motivation is that the
IFUNCs are for features, not for speed.  The existing patch already
demostrates this WRT self-snoop and different methods of saving FPU
state.

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r280279 - head/sys/sys

2015-03-30 Thread John Baldwin
On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote:
 On Sat, 21 Mar 2015, John Baldwin wrote:
 
  On 3/21/15 12:35 PM, Konstantin Belousov wrote:
  On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote:
  On 3/20/15 9:02 AM, Konstantin Belousov wrote:
  On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote:
  Author: jhb
  Date: Fri Mar 20 10:27:06 2015
  New Revision: 280279
  URL: https://svnweb.freebsd.org/changeset/base/280279
 
  Log:
Expand the bitcount* API to support 64-bit integers, plain ints and 
  longs
and create a hidden API that can be used in other system headers 
  without
adding namespace pollution.
- If the POPCNT instruction is enabled at compile time, use
  __builtin_popcount*() to implement __bitcount*(), otherwise fall 
  back
  to software implementations.
  Are you aware of the Haswell errata HSD146 ?  I see the described 
  behaviour
  on machines back to SandyBridge, but not on Nehalems.
  HSD146.   POPCNT Instruction May Take Longer to Execute Than Expected
  Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
  delayed until previous non-dependent instructions have executed.
 
  Jilles noted that gcc head and 4.9.2 already provides a workaround by
  xoring the dst register.  I have some patch for amd64 pmap, see the end
  of the message.
 
  No, I was not aware, but I think it's hard to fix this anywhere but the
  compiler.  I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang
  uses POPCOUNT for this function from ACPI-CA:
 
  static UINT8
  AcpiRsCountSetBits (
  UINT16  BitField)
  {
  UINT8   BitsSet;
 
 
  ACPI_FUNCTION_ENTRY ();
 
 
  for (BitsSet = 0; BitField; BitsSet++)
  {
  /* Zero the least significant bit that is set */
 
  BitField = (UINT16) (BitField - 1);
  }
 
  return (BitsSet);
  }
 
  (I ran into this accidentally because a kernel built on my system failed
  to boot in older qemu because the kernel paniced with an illegal 
  instruction
  fault in this function.)
 
 Does it do the same for the similar home made popcount in pmap?:

Yes:

807658d4:   f6 04 25 46 e2 d6 80testb  $0x80,0x80d6e246
807658db:   80 
807658dc:   74 32   je 80765910 
pmap_demote_pde_locked+0x4d0
807658de:   48 89 4d b8 mov%rcx,-0x48(%rbp)
807658e2:   f3 48 0f b8 4d b8   popcnt -0x48(%rbp),%rcx
807658e8:   48 8b 50 20 mov0x20(%rax),%rdx
807658ec:   48 89 55 b0 mov%rdx,-0x50(%rbp)
807658f0:   f3 48 0f b8 55 b0   popcnt -0x50(%rbp),%rdx
807658f6:   01 ca   add%ecx,%edx
807658f8:   48 8b 48 28 mov0x28(%rax),%rcx
807658fc:   48 89 4d a8 mov%rcx,-0x58(%rbp)
80765900:   f3 48 0f b8 4d a8   popcnt -0x58(%rbp),%rcx
80765906:   eb 1b   jmp80765923 
pmap_demote_pde_locked+0x4e3
80765908:   0f 1f 84 00 00 00 00nopl   0x0(%rax,%rax,1)
8076590f:   00 
80765910:   f3 48 0f b8 c9  popcnt %rcx,%rcx
80765915:   f3 48 0f b8 50 20   popcnt 0x20(%rax),%rdx
8076591b:   01 ca   add%ecx,%edx
8076591d:   f3 48 0f b8 48 28   popcnt 0x28(%rax),%rcx
80765923:   01 d1   add%edx,%ecx

It also uses popcnt for this in blist_fill() and blist_meta_fill():

742 /* Count the number of blocks we're about to allocate */
743 bitmap = scan-u.bmu_bitmap  mask;
744 for (nblks = 0; bitmap != 0; nblks++)
745 bitmap = bitmap - 1;

 Always using new API would lose the micro-optimizations given by the runtime
 decision for default CFLAGS (used by distributions for portability).  To
 keep them, it seems best to keep the inline asm but replace
 popcnt_pc_map_elem(elem) by __bitcount64(elem).  -mno-popcount can then
 be used to work around slowness in the software (that is actually
 hardware) case.

I'm not sure if bitcount64() is strictly better than the loop in this case
even though it is O(1) given the claimed nature of the values in the comment.

-- 
John Baldwin
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r280279 - head/sys/sys

2015-03-30 Thread John Baldwin
On Sunday, March 22, 2015 11:32:51 AM Konstantin Belousov wrote:
 On Sun, Mar 22, 2015 at 09:41:53AM +1100, Bruce Evans wrote:
  Always using new API would lose the micro-optimizations given by the runtime
  decision for default CFLAGS (used by distributions for portability).  To
  keep them, it seems best to keep the inline asm but replace
  popcnt_pc_map_elem(elem) by __bitcount64(elem).  -mno-popcount can then
  be used to work around slowness in the software (that is actually
  hardware) case.
 
 So anybody has to compile his own kernel to get popcnt optimization ?
 We do care about trivial things that improve time.

That is not what Bruce said.  He suggested using bitcount64() for the fallback
if the cpuid check fails.  He did not say to remove the runtime check to use
popcnt if it is available:

Always using [bitcount64] would lose the micro-optimization... [to] keep
[it], it seems best to keep the inline asm but replace popcnt_pc_map_elem(elem)
by [bitcount64(elem)].

 BTW, I have the following WIP change, which popcnt xorl is a piece of.
 It emulates the ifuncs with some preprocessing mess.  It is much better
 than runtime patching, and is a prerequisite to properly support more
 things, like SMAP.  I did not published it earlier, since I wanted to
 convert TLB flush code to this.

This looks fine to me.  It seems to be manually converting certain symbols
to use a dynamic lookup that must be explicitly resolved before first
use?  I agree that this seems cleaner than runtime patching.

-- 
John Baldwin
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r280279 - head/sys/sys

2015-03-30 Thread Konstantin Belousov
On Mon, Mar 30, 2015 at 11:57:08AM -0400, John Baldwin wrote:
 On Sunday, March 22, 2015 11:32:51 AM Konstantin Belousov wrote:
  On Sun, Mar 22, 2015 at 09:41:53AM +1100, Bruce Evans wrote:
   Always using new API would lose the micro-optimizations given by the 
   runtime
   decision for default CFLAGS (used by distributions for portability).  To
   keep them, it seems best to keep the inline asm but replace
   popcnt_pc_map_elem(elem) by __bitcount64(elem).  -mno-popcount can then
   be used to work around slowness in the software (that is actually
   hardware) case.
  
  So anybody has to compile his own kernel to get popcnt optimization ?
  We do care about trivial things that improve time.
 
 That is not what Bruce said.  He suggested using bitcount64() for the fallback
 if the cpuid check fails.  He did not say to remove the runtime check to use
 popcnt if it is available:
 
 Always using [bitcount64] would lose the micro-optimization... [to] keep
 [it], it seems best to keep the inline asm but replace 
 popcnt_pc_map_elem(elem)
 by [bitcount64(elem)].
Ok, thank you for the clarification.

I updated the pmap patch, see the end of the message.
 
  BTW, I have the following WIP change, which popcnt xorl is a piece of.
  It emulates the ifuncs with some preprocessing mess.  It is much better
  than runtime patching, and is a prerequisite to properly support more
  things, like SMAP.  I did not published it earlier, since I wanted to
  convert TLB flush code to this.
 
 This looks fine to me.  It seems to be manually converting certain symbols
 to use a dynamic lookup that must be explicitly resolved before first
 use?
I am not sure what do you mean by dynamic lookup, but possibly it
was mentioned. I can emulate the ifuncs more sincerely, by requiring
a resolver function, which is called on the first real function
invocation. I did not see it as very useful, but it is definitely
doable.

diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
index 6a4077c..fcfba56 100644
--- a/sys/amd64/amd64/pmap.c
+++ b/sys/amd64/amd64/pmap.c
@@ -412,7 +416,7 @@ static caddr_t crashdumpmap;
 static voidfree_pv_chunk(struct pv_chunk *pc);
 static voidfree_pv_entry(pmap_t pmap, pv_entry_t pv);
 static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp);
-static int popcnt_pc_map_elem(uint64_t elem);
+static int popcnt_pc_map_elem_pq(uint64_t elem);
 static vm_page_t reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp);
 static voidreserve_pv_entries(pmap_t pmap, int needed,
struct rwlock **lockp);
@@ -2980,20 +3002,27 @@ retry:
 
 /*
  * Returns the number of one bits within the given PV chunk map element.
+ *
+ * The erratas for Intel processors state that POPCNT Instruction May
+ * Take Longer to Execute Than Expected.  It is believed that the
+ * issue is the spurious dependency on the destination register.
+ * Provide a hint to the register rename logic that the destination
+ * value is overwritten, by clearing it, as suggested in the
+ * optimization manual.  It should be cheap for unaffected processors
+ * as well.
+ *
+ * Reference numbers for erratas are
+ * 4th Gen Core: HSD146
+ * 5th Gen Core: BDM85
  */
 static int
-popcnt_pc_map_elem(uint64_t elem)
+popcnt_pc_map_elem_pq(uint64_t elem)
 {
-   int count;
+   u_long result;
 
-   /*
-* This simple method of counting the one bits performs well because
-* the given element typically contains more zero bits than one bits.
-*/
-   count = 0;
-   for (; elem != 0; elem = elem - 1)
-   count++;
-   return (count);
+   __asm __volatile(xorl %k0,%k0;popcntq %1,%0
+   : =r (result) : rm (elem));
+   return (result);
 }
 
 /*
@@ -3025,13 +3054,13 @@ retry:
avail = 0;
TAILQ_FOREACH(pc, pmap-pm_pvchunk, pc_list) {
if ((cpu_feature2  CPUID2_POPCNT) == 0) {
-   free = popcnt_pc_map_elem(pc-pc_map[0]);
-   free += popcnt_pc_map_elem(pc-pc_map[1]);
-   free += popcnt_pc_map_elem(pc-pc_map[2]);
+   free = bitcount64(pc-pc_map[0]);
+   free += bitcount64(pc-pc_map[1]);
+   free += bitcount64(pc-pc_map[2]);
} else {
-   free = popcntq(pc-pc_map[0]);
-   free += popcntq(pc-pc_map[1]);
-   free += popcntq(pc-pc_map[2]);
+   free = popcnt_pc_map_elem_pq(pc-pc_map[0]);
+   free += popcnt_pc_map_elem_pq(pc-pc_map[1]);
+   free += popcnt_pc_map_elem_pq(pc-pc_map[2]);
}
if (free == 0)
break;
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r280279 - head/sys/sys

2015-03-30 Thread John Baldwin
On Monday, March 30, 2015 08:24:34 PM Konstantin Belousov wrote:
  That is not what Bruce said.  He suggested using bitcount64() for the 
  fallback
  if the cpuid check fails.  He did not say to remove the runtime check to use
  popcnt if it is available:
  
  Always using [bitcount64] would lose the micro-optimization... [to] keep
  [it], it seems best to keep the inline asm but replace 
  popcnt_pc_map_elem(elem)
  by [bitcount64(elem)].
 Ok, thank you for the clarification.
 
 I updated the pmap patch, see the end of the message.

I think the pmap change looks fine.  If we know which compilers include a
workaround we might also consider specifying -mno-popcount for everything
except known-ok compilers in at least kern.mk.

  This looks fine to me.  It seems to be manually converting certain symbols
  to use a dynamic lookup that must be explicitly resolved before first
  use?
 I am not sure what do you mean by dynamic lookup, but possibly it
 was mentioned. I can emulate the ifuncs more sincerely, by requiring
 a resolver function, which is called on the first real function
 invocation. I did not see it as very useful, but it is definitely
 doable.

I just mean that the effect at runtime is similar to that of dynamic
symbols once they are resolved (a call into a PLT entry (or is it GOT?
I keep getting those confused) that does a jump to the resolved symbol).

-- 
John Baldwin
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r280279 - head/sys/sys

2015-03-30 Thread Konstantin Belousov
On Mon, Mar 30, 2015 at 04:37:10PM -0400, John Baldwin wrote:
 On Monday, March 30, 2015 08:24:34 PM Konstantin Belousov wrote:
   That is not what Bruce said.  He suggested using bitcount64() for the 
   fallback
   if the cpuid check fails.  He did not say to remove the runtime check to 
   use
   popcnt if it is available:
   
   Always using [bitcount64] would lose the micro-optimization... [to] keep
   [it], it seems best to keep the inline asm but replace 
   popcnt_pc_map_elem(elem)
   by [bitcount64(elem)].
  Ok, thank you for the clarification.
  
  I updated the pmap patch, see the end of the message.
 
 I think the pmap change looks fine.  If we know which compilers include a
 workaround we might also consider specifying -mno-popcount for everything
 except known-ok compilers in at least kern.mk.
Right now the compilers which implement the workaround are gcc 4.9.2 and
gcc trunk, to be released as gcc 5.0.  In-tree clang 3.6.0 does not
try to eliminate the false dependency.

 
   This looks fine to me.  It seems to be manually converting certain symbols
   to use a dynamic lookup that must be explicitly resolved before first
   use?
  I am not sure what do you mean by dynamic lookup, but possibly it
  was mentioned. I can emulate the ifuncs more sincerely, by requiring
  a resolver function, which is called on the first real function
  invocation. I did not see it as very useful, but it is definitely
  doable.
 
 I just mean that the effect at runtime is similar to that of dynamic
 symbols once they are resolved (a call into a PLT entry (or is it GOT?
 I keep getting those confused) that does a jump to the resolved symbol).
PLT is slightly more expensive, since after the resolution it gives
callsymbol@plt
jmp *symbol@gotpcrel
while this code results in
call*symbol_selector
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r280279 - head/sys/sys

2015-03-30 Thread Bruce Evans

On Mon, 30 Mar 2015, Konstantin Belousov wrote:


On Mon, Mar 30, 2015 at 11:57:08AM -0400, John Baldwin wrote:

On Sunday, March 22, 2015 11:32:51 AM Konstantin Belousov wrote:

...
So anybody has to compile his own kernel to get popcnt optimization ?
We do care about trivial things that improve time.


I'm not sure we do.  The microoptimizations give speed improvements
in the 1-10% range, while general bloat gives speed unimprovements in
the 10-500% range.


I updated the pmap patch, see the end of the message.


Seems OK, except for verbose names.


BTW, I have the following WIP change, which popcnt xorl is a piece of.
It emulates the ifuncs with some preprocessing mess.  It is much better
than runtime patching, and is a prerequisite to properly support more
things, like SMAP.  I did not published it earlier, since I wanted to
convert TLB flush code to this.


This looks fine to me.  It seems to be manually converting certain symbols
to use a dynamic lookup that must be explicitly resolved before first
use?


It looks a bit overengineered to me.  A bit like my function pointers
for the bcopy() family on i386.  bcopy() is a bulk operation, so in
theory you can do it much faster by selecting the best available
version at runtime.  In practice, the gains were not large and are
too machine-dependent to maintain.  It is even harder to get large
gains and maintain them by selecting individual instructions at runtime.


diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
index 6a4077c..fcfba56 100644
--- a/sys/amd64/amd64/pmap.c
+++ b/sys/amd64/amd64/pmap.c
@@ -412,7 +416,7 @@ static caddr_t crashdumpmap;
static void free_pv_chunk(struct pv_chunk *pc);
static void free_pv_entry(pmap_t pmap, pv_entry_t pv);
static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp);
-static int popcnt_pc_map_elem(uint64_t elem);
+static int popcnt_pc_map_elem_pq(uint64_t elem);


popcnt_pc_map_elem_pq() is the most verbose name used here.  It is not
necessary to encrypt a function's man page in its name.


@@ -2980,20 +3002,27 @@ retry:

/*
 * Returns the number of one bits within the given PV chunk map element.
+ *
+ * The erratas for Intel processors state that POPCNT Instruction May
+ * Take Longer to Execute Than Expected.  It is believed that the
+ * issue is the spurious dependency on the destination register.
+ * Provide a hint to the register rename logic that the destination
+ * value is overwritten, by clearing it, as suggested in the
+ * optimization manual.  It should be cheap for unaffected processors
+ * as well.
+ *
+ * Reference numbers for erratas are
+ * 4th Gen Core: HSD146
+ * 5th Gen Core: BDM85
 */
static int
-popcnt_pc_map_elem(uint64_t elem)
+popcnt_pc_map_elem_pq(uint64_t elem)


Here the function has a specialized use, but internally it is just the
popcntq() cpufunc, except that doesn't have the xorl workaround or the
above documentation for the workaround.


{
-   int count;
+   u_long result;

-   /*
-* This simple method of counting the one bits performs well because
-* the given element typically contains more zero bits than one bits.
-*/
-   count = 0;
-   for (; elem != 0; elem = elem - 1)
-   count++;
-   return (count);
+   __asm __volatile(xorl %k0,%k0;popcntq %1,%0
+   : =r (result) : rm (elem));
+   return (result);
}


Style:
- there should be a space after ; in the asm.
- loading 0 should be in the operands.  Then you don't need the k
  modification.  It should be able to load 0 using xorl iff that is
  best for the current arch.  Perhaps a register is already 0, and it
  is best to do a null load so as to use that register.

The patch is missing removal of the popcntq() cpufunc.  This function was
only used here, and is now unused.

cpufunc.h is supposed to contain only wrappers for single instructions,
to be combined if necessary to create larger functions.  popcntq()
follows this rule (some other functions are broken).  popcntq() with
the load of 0 wouldn't satisify it strictly, but there is no other way
to fix it since register allocation is unavailable across inline asms.
Actually, if you move the load to the operands, it would follow the 
rule strictly enough -- many of the asms need more than a single

instruction to load things.

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r280279 - head/sys/sys

2015-03-22 Thread Konstantin Belousov
On Sun, Mar 22, 2015 at 09:41:53AM +1100, Bruce Evans wrote:
 On Sat, 21 Mar 2015, John Baldwin wrote:
 
  On 3/21/15 12:35 PM, Konstantin Belousov wrote:
 X static int
 X popcnt_pc_map_elem(uint64_t elem)
 X {
 X int count;
 X 
 X /*
 X  * This simple method of counting the one bits performs well because
 X  * the given element typically contains more zero bits than one bits.
 X  */
 X count = 0;
 X for (; elem != 0; elem = elem - 1)
 X count++;
 X return (count);
 X }
 
...
 
 Perhaps this appeared to perform well not for the reason stated in its
 comment, but because it was tested with excessive CFLAGS, and the compiler
 actually reduced it to popcntq, and the test machine didn't have the
 bug.
No, it probably did not, since popcntq does not exist on the generic
amd64 machine.

 
 In a game program that I wrote, efficiency of popcount() was actually
 important since it was used in inner loops.  I used lookup tables for
 popcount() and a couple of application-specific functions and even for
 fls().  This seemed to be better than the the existing bsf instruction,
 so it can't be bad for popcount().  The lookup was limited to bitsets
 of length 13 in the low bits of an int, so the table sizes were only
 2**13 = 4096.  bitsets of length 64 would need multiple steps.  However,
 if the above performs well without compiler optimizations, then it must
 be because usually only the lowest few bits are set.  Then it would
 perform even better with even smaller (8-bit) tables combined with an
 early exit:
 
   count = 0;
   for (; elem != 0; elem = 8)
   count += lookup[elem 0xff];
   return (count);
 
 Many variations on this are possible.  E.g., use fls() to decide where to
 start; or avoid all branches, and add up the results in parallel.  For
 bitcount32, the latter is:
 
   bitcount64(x) := bitcount32(x  0x) + bitcount32(x  32);
   bitcount32(x) := bitcount16(x  0x) + bitcount16(x  16);
   bitcount16(x) := bitcount8(x  0xff) + bitcount8(x  8);
   bitcount8(x) := lookup[x];
 
 Compilers won't be able to optimize the lookup methods.  They might be
 able to convert the current bitcount*() to popcnt*.  Last time I looked,
 clang but not gcc converted very complicated expressions for bswap*()
 to bswap* instructions.
 
  There's no way we can preemptively locate every bit of C that clang might
  decide to replace with popcount and fix them to employ a workaround.  I 
  think
  the only viable solution is to use -mno-popcount on all compilers that 
  aren't
  known to be fixed.
 
 Why bother?  It is only a short-lived and small(?) efficiency bug.
 
  Both you and Bruce and Jilles state this, but I am unable to understand the
  argument about the compiler.  Can somebody explain this to me, please ?
 
 It is also to avoid complications for short-lived optimizations.  The
 3 uses of popcntq() in amd64 pmap cost:
 - 9 lines of inline asm.
 - 19 lines for the C version
 - 9 lines instead of 3 for the 3 uses
 When popcntq is not available, the result is a pessimization since the
 optimizations are not complicated enough to be useful in that case.
popcnt is not available for machines older than Nehalems.
Nobody cares about the last bits of the speed for such CPUs.

 They give the runtime overhead of a branch to call the C version that
 is presumably worse than using the existing bitcount32() (possibly
 twice) or the compiler builtin.
(jhb's cleanups didn't bother to optimize to use the builtin in
all cases, since it is hard to tell if the builtin is any good if
it is in software.  If you remove the 19 lines for
popcnt_pc_map_elem() and and replace calls to it by __builtin_popcountl(),
then the results would be:
- compile-time failure for old unsupported compilers that don't have this
  builtin
- usually, a link-time failure for gcc-4.2 through gcc-4.8.  gcc
  generates a call to __popcountdi2 unless CFLAGS enables the hardware
  instruction.  __popcountdi2 is in libgcc but not in libkern.
- usually, 3 inline copies of the same code that would be produced if
  FreeBSD's bitcount64() were used for clang.  clang unrolls things
  excessively, giving enormous code.  Unrolling allows it to load the
  large constants in __bitcount64() only once, but large code is still
  needed to use these constants.
- runtime failures in misconfigured cases where CFLAGS doesn't match
  the hardware.  Both gcc-4.8 and clang produce popcntq.  The runtime
  check doesn't help since the compilers produce popcntq for the C
  case.  Using __builtin_popcountl() asks for this directly.  Using
  __bitcount64() asks for it indirectly, and gets it for clang.
  Using popcnt_pc_map_elem() may avoid getting it for clang, but
  apparentely still gets it.)
 When popcntq is available, the test to decide whether to use it is a
 small pessimization.  It might take longer than a 

Re: svn commit: r280279 - head/sys/sys

2015-03-21 Thread John Baldwin
On 3/21/15 12:35 PM, Konstantin Belousov wrote:
 On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote:
 On 3/20/15 9:02 AM, Konstantin Belousov wrote:
 On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote:
 Author: jhb
 Date: Fri Mar 20 10:27:06 2015
 New Revision: 280279
 URL: https://svnweb.freebsd.org/changeset/base/280279

 Log:
   Expand the bitcount* API to support 64-bit integers, plain ints and longs
   and create a hidden API that can be used in other system headers 
 without
   adding namespace pollution.
   - If the POPCNT instruction is enabled at compile time, use
 __builtin_popcount*() to implement __bitcount*(), otherwise fall back
 to software implementations.
 Are you aware of the Haswell errata HSD146 ?  I see the described behaviour
 on machines back to SandyBridge, but not on Nehalems.
 HSD146.   POPCNT Instruction May Take Longer to Execute Than Expected
 Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
 delayed until previous non-dependent instructions have executed.

 Jilles noted that gcc head and 4.9.2 already provides a workaround by
 xoring the dst register.  I have some patch for amd64 pmap, see the end
 of the message.

 No, I was not aware, but I think it's hard to fix this anywhere but the
 compiler.  I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang
 uses POPCOUNT for this function from ACPI-CA:

 static UINT8
 AcpiRsCountSetBits (
 UINT16  BitField)
 {
 UINT8   BitsSet;


 ACPI_FUNCTION_ENTRY ();


 for (BitsSet = 0; BitField; BitsSet++)
 {
 /* Zero the least significant bit that is set */

 BitField = (UINT16) (BitField - 1);
 }

 return (BitsSet);
 }

 (I ran into this accidentally because a kernel built on my system failed
 to boot in older qemu because the kernel paniced with an illegal instruction
 fault in this function.)

 There's no way we can preemptively locate every bit of C that clang might
 decide to replace with popcount and fix them to employ a workaround.  I think
 the only viable solution is to use -mno-popcount on all compilers that 
 aren't
 known to be fixed.
 Both you and Bruce and Jilles state this, but I am unable to understand the
 argument about the compiler.  Can somebody explain this to me, please ?

If you compile with -mpopcount (or a march that includes it like -march=corei7)
then the compiler will assume that popcount is available and will use it without
doing any runtime checks.  In the case of my sandybridge desktop, I have
CPUTYPE set to corei7-avx in /etc/make.conf which adds -march=corei7-avx
to CFLAGS, so the compiler assumes it can use POPCOUNT (as well as newer
SSE and AVX).  In this case the compiler recognized the pattern in the C
function above as doing a population count and replaced the entire function with
a POPCOUNT instruction even though the C code doesn't explicitly try to use it.

However, if we know that a compiler doesn't employ the workaround for this
errata, we can employ -mno-popcount so that using a custom CPUTYPE does not
result in use of POPCOUNT except for known-ok compilers.  Either that or
we use a more elaborate set of #if checks in sys/types.h that only uses
__builtin_popcount() for known-ok compilers (and possibly uses the
clear-destination workaround for other compilers if -mpopcount is enabled).

 Compiler cannot generate popcntq instructions unless it is directed to
 generate code not for amd64, but for some specific microacrchitecture.
 Any use of bitcount in generic kernel or userspace is going to be a
 SWAR. In other words, #ifdef __POPCNT__ is for non-default settings of
 the FreeBSD compilation environment.

Correct.

 And even then, compiler must issue cpuid to fetch the feature mask, which
 arguably compiler cannot do due to the serialization instruction cost.
 Compiler could generate the call to instruction in the init code, but
 this is impossible for freestanding environment, since compiler does not
 know where to put init code.

No, the compiler does not have to check cpuid.  If you've told it to compile
for a given CPU type, it has tables to hardcode the features known to be
present on those CPUs and just uses them.  If you use the wrong CPU type
you get illegal instruction faults. :)  (As I did in my qemu test case.)

 In regards to whether or not this should be a dynamic test instead, it seems 
 a
 bit much to require a dynamic check for code that has to work in both 
 userland
 and the kernel, and I'm not sure if something like CPU_COUNT() would actually
 benefit from that versus just always using the software solution instead.
 
 I agree.

Hence why I opted to only use POPCOUNT if someone explicitly enables it via
-mpopcount or -march flags.  If you've told the compiler it's ok to use it
without any checks, then it will use POPCOUNT, otherwise it will use the
SWAR method.

-- 
John Baldwin
___

Re: svn commit: r280279 - head/sys/sys

2015-03-21 Thread John Baldwin
On 3/20/15 9:02 AM, Konstantin Belousov wrote:
 On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote:
 Author: jhb
 Date: Fri Mar 20 10:27:06 2015
 New Revision: 280279
 URL: https://svnweb.freebsd.org/changeset/base/280279

 Log:
   Expand the bitcount* API to support 64-bit integers, plain ints and longs
   and create a hidden API that can be used in other system headers without
   adding namespace pollution.
   - If the POPCNT instruction is enabled at compile time, use
 __builtin_popcount*() to implement __bitcount*(), otherwise fall back
 to software implementations.
 Are you aware of the Haswell errata HSD146 ?  I see the described behaviour
 on machines back to SandyBridge, but not on Nehalems.
 HSD146.   POPCNT Instruction May Take Longer to Execute Than Expected
 Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
 delayed until previous non-dependent instructions have executed.
 
 Jilles noted that gcc head and 4.9.2 already provides a workaround by
 xoring the dst register.  I have some patch for amd64 pmap, see the end
 of the message.

No, I was not aware, but I think it's hard to fix this anywhere but the
compiler.  I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang
uses POPCOUNT for this function from ACPI-CA:

static UINT8
AcpiRsCountSetBits (
UINT16  BitField)
{
UINT8   BitsSet;


ACPI_FUNCTION_ENTRY ();


for (BitsSet = 0; BitField; BitsSet++)
{
/* Zero the least significant bit that is set */

BitField = (UINT16) (BitField - 1);
}

return (BitsSet);
}

(I ran into this accidentally because a kernel built on my system failed
to boot in older qemu because the kernel paniced with an illegal instruction
fault in this function.)

There's no way we can preemptively locate every bit of C that clang might
decide to replace with popcount and fix them to employ a workaround.  I think
the only viable solution is to use -mno-popcount on all compilers that aren't
known to be fixed.

In regards to whether or not this should be a dynamic test instead, it seems a
bit much to require a dynamic check for code that has to work in both userland
and the kernel, and I'm not sure if something like CPU_COUNT() would actually
benefit from that versus just always using the software solution instead.

-- 
John Baldwin
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r280279 - head/sys/sys

2015-03-21 Thread Konstantin Belousov
On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote:
 On 3/20/15 9:02 AM, Konstantin Belousov wrote:
  On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote:
  Author: jhb
  Date: Fri Mar 20 10:27:06 2015
  New Revision: 280279
  URL: https://svnweb.freebsd.org/changeset/base/280279
 
  Log:
Expand the bitcount* API to support 64-bit integers, plain ints and longs
and create a hidden API that can be used in other system headers 
  without
adding namespace pollution.
- If the POPCNT instruction is enabled at compile time, use
  __builtin_popcount*() to implement __bitcount*(), otherwise fall back
  to software implementations.
  Are you aware of the Haswell errata HSD146 ?  I see the described behaviour
  on machines back to SandyBridge, but not on Nehalems.
  HSD146.   POPCNT Instruction May Take Longer to Execute Than Expected
  Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
  delayed until previous non-dependent instructions have executed.
  
  Jilles noted that gcc head and 4.9.2 already provides a workaround by
  xoring the dst register.  I have some patch for amd64 pmap, see the end
  of the message.
 
 No, I was not aware, but I think it's hard to fix this anywhere but the
 compiler.  I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang
 uses POPCOUNT for this function from ACPI-CA:
 
 static UINT8
 AcpiRsCountSetBits (
 UINT16  BitField)
 {
 UINT8   BitsSet;
 
 
 ACPI_FUNCTION_ENTRY ();
 
 
 for (BitsSet = 0; BitField; BitsSet++)
 {
 /* Zero the least significant bit that is set */
 
 BitField = (UINT16) (BitField - 1);
 }
 
 return (BitsSet);
 }
 
 (I ran into this accidentally because a kernel built on my system failed
 to boot in older qemu because the kernel paniced with an illegal instruction
 fault in this function.)
 
 There's no way we can preemptively locate every bit of C that clang might
 decide to replace with popcount and fix them to employ a workaround.  I think
 the only viable solution is to use -mno-popcount on all compilers that 
 aren't
 known to be fixed.
Both you and Bruce and Jilles state this, but I am unable to understand the
argument about the compiler.  Can somebody explain this to me, please ?

Compiler cannot generate popcntq instructions unless it is directed to
generate code not for amd64, but for some specific microacrchitecture.
Any use of bitcount in generic kernel or userspace is going to be a
SWAR. In other words, #ifdef __POPCNT__ is for non-default settings of
the FreeBSD compilation environment.

And even then, compiler must issue cpuid to fetch the feature mask, which
arguably compiler cannot do due to the serialization instruction cost.
Compiler could generate the call to instruction in the init code, but
this is impossible for freestanding environment, since compiler does not
know where to put init code.

Only explicit uses of popcnt in the asm, with the CPU feature check around,
going to end actually use the instruction in the generic code.  And there,
we may apply the workaround like I did for pmap.

 
 In regards to whether or not this should be a dynamic test instead, it seems a
 bit much to require a dynamic check for code that has to work in both userland
 and the kernel, and I'm not sure if something like CPU_COUNT() would actually
 benefit from that versus just always using the software solution instead.

I agree.

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r280279 - head/sys/sys

2015-03-21 Thread Konstantin Belousov
On Sat, Mar 21, 2015 at 09:33:22PM +1100, Bruce Evans wrote:
 On Sat, 21 Mar 2015, Konstantin Belousov wrote:
 How does the unconditional use of popcntq (in inline asm that is called
 from amd64 pmap) even work?
It is not unconditional.  The pmap code does the following:

if ((cpu_feature2  CPUID2_POPCNT) == 0) {
free = popcnt_pc_map_elem(pc-pc_map[0]);
...
} else {
free = popcntq(pc-pc_map[0]);
...
}

 
 jhb's cleanups apparently missed this inline asm, while previous work
 should not have added it.  This inline asm should never have existed,
 since compilers can generate the same code with less-incorrect
 configuration.  Correct configuration would be messy.  jhb's cleanups
 depend on a compiler predefine (__POPCNT__) to determine if the
 instruction can be used.  But this definition is static and is usually
 wrong, since compilers default to plain amd64 which doesn't have the
 instruction.  Dynamic configuration would have to use cpuid just to
 determine if the instruction exists.
Exactly, this is what supposed to be done, and is done in the pmap.
It is fine for somebody targeting specific machine to change -march,
but it cannot be used to optimize the generic kernel which is shipped.

 
  Jilles noted that gcc head and 4.9.2 already provides a workaround by
  xoring the dst register.  I have some patch for amd64 pmap, see the end
  of the message.
 
 I thought that that couldn't work since it uses the instruction
 unconditionally.  But the old code already does that.
 
  IIRC, then patch never never uses asm, but intentionally uses the popcount
  builtin to avoid complications.
 
 popcntq() in amd64/include/cpufunc.h does use asm, but is missing the
 complications needed for when it the instruction is not available.
 All callers (just 3 in pmap) seem to be broken, since they are also
 missing the complications.
See above.

  Still, it is weird to provide functions from the sys/types.h namespace,
  and even more weird to provide such special-purpose function.
 
 Not in the sys/types.h, but in the implementation namespace :-).  Since
 no one should depend on the implementation details, we can move the
 details to a better place when one is known.  Better places would be
 something like libkern for userland to hold a large collection of
 functions, and a Standard place for single functions.  I expect the
 Standard place will be broken as designed for compatibility.  Probably
 strings.h alongside ffs().
strings.h sounds less surprising than sys/types.h

 
 The special (inline/asm) functions are in sys/types.h on amd64 are
 currently limited to just 3 bswap functions and some new popcnt functions.
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r280279 - head/sys/sys

2015-03-21 Thread Bruce Evans

On Sat, 21 Mar 2015, Konstantin Belousov wrote:


On Sat, Mar 21, 2015 at 09:42:51AM +1100, Bruce Evans wrote:

On Fri, 20 Mar 2015, Konstantin Belousov wrote:


On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote:

Author: jhb
Date: Fri Mar 20 10:27:06 2015
New Revision: 280279
URL: https://svnweb.freebsd.org/changeset/base/280279

Log:
  Expand the bitcount* API to support 64-bit integers, plain ints and longs
  and create a hidden API that can be used in other system headers without
  adding namespace pollution.
  - If the POPCNT instruction is enabled at compile time, use
__builtin_popcount*() to implement __bitcount*(), otherwise fall back
to software implementations.



Are you aware of the Haswell errata HSD146 ?  I see the described behaviour


I wasn't.


on machines back to SandyBridge, but not on Nehalems.
HSD146.   POPCNT Instruction May Take Longer to Execute Than Expected
Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
delayed until previous non-dependent instructions have executed.


If it only affects performance, then it is up to the compiler to fix it.

It affects performance on some cpu models.  It is too wrong for compiler
to issue cpuid before popcnt.  Always issuing xorl before popcnt, as
it is currently done by recent gcc, but not clang, is better, but still
I think it is up to the code author to decide.


How does the unconditional use of popcntq (in inline asm that is called
from amd64 pmap) even work?

jhb's cleanups apparently missed this inline asm, while previous work
should not have added it.  This inline asm should never have existed,
since compilers can generate the same code with less-incorrect
configuration.  Correct configuration would be messy.  jhb's cleanups
depend on a compiler predefine (__POPCNT__) to determine if the
instruction can be used.  But this definition is static and is usually
wrong, since compilers default to plain amd64 which doesn't have the
instruction.  Dynamic configuration would have to use cpuid just to
determine if the instruction exists.


Jilles noted that gcc head and 4.9.2 already provides a workaround by
xoring the dst register.  I have some patch for amd64 pmap, see the end
of the message.


I thought that that couldn't work since it uses the instruction
unconditionally.  But the old code already does that.


IIRC, then patch never never uses asm, but intentionally uses the popcount
builtin to avoid complications.


popcntq() in amd64/include/cpufunc.h does use asm, but is missing the
complications needed for when it the instruction is not available.
All callers (just 3 in pmap) seem to be broken, since they are also
missing the complications.

There is also a naming problem.  The asm API is named popcntq like the
builtins, but the APIs in the cleanup are named *bitcount*.  We couldn't
decide whether the cleanup should rename them to match the builtins.


  - Use the existing bitcount16() and bitcount32() from sys/systm.h to
implement the non-POPCNT __bitcount16() and __bitcount32() in
sys/types.h.

Why is it in sys/types.h ?


To make it easier to use, while minimizing namespace pollution and
inefficiencies.  Like the functions used to implement ntohl(), except
the implementation is MI so it doesn't need to be in machine.
(The functions used to implement ntohl() are in machine/endian.h.
sys/types.h always includes that, so it makes little difference to
pollution and inefficiency that the implementation is not more directly
in machine/_types.h.)  bitcount is simpler and not burdened by
compatibility, so it doesn't need a separate header.)


Still, it is weird to provide functions from the sys/types.h namespace,
and even more weird to provide such special-purpose function.


Not in the sys/types.h, but in the implementation namespace :-).  Since
no one should depend on the implementation details, we can move the
details to a better place when one is known.  Better places would be
something like libkern for userland to hold a large collection of
functions, and a Standard place for single functions.  I expect the
Standard place will be broken as designed for compatibility.  Probably
strings.h alongside ffs().

The special (inline/asm) functions are in sys/types.h on amd64 are
currently limited to just 3 bswap functions and some new popcnt functions.

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r280279 - head/sys/sys

2015-03-21 Thread Bruce Evans

On Sat, 21 Mar 2015, John Baldwin wrote:


On 3/21/15 12:35 PM, Konstantin Belousov wrote:

On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote:

On 3/20/15 9:02 AM, Konstantin Belousov wrote:

On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote:

Author: jhb
Date: Fri Mar 20 10:27:06 2015
New Revision: 280279
URL: https://svnweb.freebsd.org/changeset/base/280279

Log:
  Expand the bitcount* API to support 64-bit integers, plain ints and longs
  and create a hidden API that can be used in other system headers without
  adding namespace pollution.
  - If the POPCNT instruction is enabled at compile time, use
__builtin_popcount*() to implement __bitcount*(), otherwise fall back
to software implementations.

Are you aware of the Haswell errata HSD146 ?  I see the described behaviour
on machines back to SandyBridge, but not on Nehalems.
HSD146.   POPCNT Instruction May Take Longer to Execute Than Expected
Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
delayed until previous non-dependent instructions have executed.

Jilles noted that gcc head and 4.9.2 already provides a workaround by
xoring the dst register.  I have some patch for amd64 pmap, see the end
of the message.


No, I was not aware, but I think it's hard to fix this anywhere but the
compiler.  I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang
uses POPCOUNT for this function from ACPI-CA:

static UINT8
AcpiRsCountSetBits (
UINT16  BitField)
{
UINT8   BitsSet;


ACPI_FUNCTION_ENTRY ();


for (BitsSet = 0; BitField; BitsSet++)
{
/* Zero the least significant bit that is set */

BitField = (UINT16) (BitField - 1);
}

return (BitsSet);
}

(I ran into this accidentally because a kernel built on my system failed
to boot in older qemu because the kernel paniced with an illegal instruction
fault in this function.)


Does it do the same for the similar home made popcount in pmap?:

X /*
X  * Returns the number of one bits within the given PV chunk map element.
X  */
X static int
X popcnt_pc_map_elem(uint64_t elem)
X {
X   int count;
X 
X 	/*

X* This simple method of counting the one bits performs well because
X* the given element typically contains more zero bits than one bits.
X*/
X   count = 0;
X   for (; elem != 0; elem = elem - 1)
X   count++;
X   return (count);
X }

This is actually the same method as in ACPI.  It is just missing mounds
of style bugs (mostly Windows spellings; also a bogus cast to UINT16;
the cast has no effect).

Perhaps this appeared to perform well not for the reason stated in its
comment, but because it was tested with excessive CFLAGS, and the compiler
actually reduced it to popcntq, and the test machine didn't have the
bug.

In a game program that I wrote, efficiency of popcount() was actually
important since it was used in inner loops.  I used lookup tables for
popcount() and a couple of application-specific functions and even for
fls().  This seemed to be better than the the existing bsf instruction,
so it can't be bad for popcount().  The lookup was limited to bitsets
of length 13 in the low bits of an int, so the table sizes were only
2**13 = 4096.  bitsets of length 64 would need multiple steps.  However,
if the above performs well without compiler optimizations, then it must
be because usually only the lowest few bits are set.  Then it would
perform even better with even smaller (8-bit) tables combined with an
early exit:

count = 0;
for (; elem != 0; elem = 8)
count += lookup[elem 0xff];
return (count);

Many variations on this are possible.  E.g., use fls() to decide where to
start; or avoid all branches, and add up the results in parallel.  For
bitcount32, the latter is:

bitcount64(x) := bitcount32(x  0x) + bitcount32(x  32);
bitcount32(x) := bitcount16(x  0x) + bitcount16(x  16);
bitcount16(x) := bitcount8(x  0xff) + bitcount8(x  8);
bitcount8(x) := lookup[x];

Compilers won't be able to optimize the lookup methods.  They might be
able to convert the current bitcount*() to popcnt*.  Last time I looked,
clang but not gcc converted very complicated expressions for bswap*()
to bswap* instructions.


There's no way we can preemptively locate every bit of C that clang might
decide to replace with popcount and fix them to employ a workaround.  I think
the only viable solution is to use -mno-popcount on all compilers that aren't
known to be fixed.


Why bother?  It is only a short-lived and small(?) efficiency bug.


Both you and Bruce and Jilles state this, but I am unable to understand the
argument about the compiler.  Can somebody explain this to me, please ?


It is also to avoid complications for short-lived optimizations.  The
3 uses of popcntq() in amd64 pmap cost:
- 9 lines of inline asm.
- 19 lines for the C version
- 9 lines instead of 3 for 

svn commit: r280279 - head/sys/sys

2015-03-20 Thread John Baldwin
Author: jhb
Date: Fri Mar 20 10:27:06 2015
New Revision: 280279
URL: https://svnweb.freebsd.org/changeset/base/280279

Log:
  Expand the bitcount* API to support 64-bit integers, plain ints and longs
  and create a hidden API that can be used in other system headers without
  adding namespace pollution.
  - If the POPCNT instruction is enabled at compile time, use
__builtin_popcount*() to implement __bitcount*(), otherwise fall back
to software implementations.
  - Use the existing bitcount16() and bitcount32() from sys/systm.h to
implement the non-POPCNT __bitcount16() and __bitcount32() in
sys/types.h.
  - For the non-POPCNT __bitcount64(), use a similar SWAR method on 64-bit
systems.  For 32-bit systems, use two __bitcount32() operations on the
two halves.
  - Use __bitcount32() to provide a __bitcount() that operates on plain ints.
  - Use either __bitcount32() or __bitcount64() to provide a
__bitcountl() that operates on longs.
  - Add public bitcount*() wrappers for __bitcount*() for use in the kernel
in sys/libkern.h.
  - Use __builtinl() instead of __builtin_popcountl() in BIT_COUNT().
  
  Discussed with:   bde

Modified:
  head/sys/sys/bitset.h
  head/sys/sys/libkern.h
  head/sys/sys/systm.h
  head/sys/sys/types.h

Modified: head/sys/sys/bitset.h
==
--- head/sys/sys/bitset.h   Fri Mar 20 10:15:34 2015(r280278)
+++ head/sys/sys/bitset.h   Fri Mar 20 10:27:06 2015(r280279)
@@ -182,7 +182,7 @@
\
__count = 0;\
for (__i = 0; __i  __bitset_words((_s)); __i++)\
-   __count += __builtin_popcountl((p)-__bits[__i]);   \
+   __count += __bitcountl((p)-__bits[__i]);   \
__count;\
 })


Modified: head/sys/sys/libkern.h
==
--- head/sys/sys/libkern.h  Fri Mar 20 10:15:34 2015(r280278)
+++ head/sys/sys/libkern.h  Fri Mar 20 10:27:06 2015(r280279)
@@ -98,6 +98,11 @@ int   flsl(long);
 #ifndefHAVE_INLINE_FLSLL
 int flsll(long long);
 #endif
+#definebitcount64(x)   __bitcount64((uint64_t)(x))
+#definebitcount32(x)   __bitcount32((uint32_t)(x))
+#definebitcount16(x)   __bitcount16((uint16_t)(x))
+#definebitcountl(x)__bitcountl((u_long)(x))
+#definebitcount(x) __bitcount((u_int)(x))
 
 int fnmatch(const char *, const char *, int);
 int locc(int, char *, u_int);

Modified: head/sys/sys/systm.h
==
--- head/sys/sys/systm.hFri Mar 20 10:15:34 2015(r280278)
+++ head/sys/sys/systm.hFri Mar 20 10:27:06 2015(r280279)
@@ -428,33 +428,6 @@ int alloc_unr_specific(struct unrhdr *uh
 int alloc_unrl(struct unrhdr *uh);
 void free_unr(struct unrhdr *uh, u_int item);
 
-/*
- * Population count algorithm using SWAR approach
- * - SIMD Within A Register.
- */
-static __inline uint32_t
-bitcount32(uint32_t x)
-{
-
-   x = (x  0x) + ((x  0x)  1);
-   x = (x  0x) + ((x  0x)  2);
-   x = (x + (x  4))  0x0f0f0f0f;
-   x = (x + (x  8));
-   x = (x + (x  16))  0x00ff;
-   return (x);
-}
-
-static __inline uint16_t
-bitcount16(uint32_t x)
-{
-
-   x = (x  0x) + ((x  0x)  1);
-   x = (x  0x) + ((x  0x)  2);
-   x = (x + (x  4))  0x0f0f;
-   x = (x + (x  8))  0x00ff;
-   return (x);
-}
-
 void   intr_prof_stack_use(struct thread *td, struct trapframe *frame);
 
 #endif /* !_SYS_SYSTM_H_ */

Modified: head/sys/sys/types.h
==
--- head/sys/sys/types.hFri Mar 20 10:15:34 2015(r280278)
+++ head/sys/sys/types.hFri Mar 20 10:27:06 2015(r280279)
@@ -294,6 +294,68 @@ typedef_Bool   bool;
 
 #include sys/select.h
 
+#ifdef __POPCNT__
+#define__bitcount64(x) __builtin_popcountll((__uint64_t)(x))
+#define__bitcount32(x) __builtin_popcount((__uint32_t)(x))
+#define__bitcount16(x) __builtin_popcount((__uint16_t)(x))
+#define__bitcountl(x)  __builtin_popcountl((unsigned long)(x))
+#define__bitcount(x)   __builtin_popcount((unsigned int)(x))
+#else
+/*
+ * Population count algorithm using SWAR approach
+ * - SIMD Within A Register.
+ */
+static __inline __uint16_t
+__bitcount16(__uint16_t _x)
+{
+
+   _x = (_x  0x) + ((_x  0x)  1);
+   _x = (_x  0x) + ((_x  0x)  2);
+   _x = (_x + (_x  4))  0x0f0f;
+   _x = (_x + (_x  8))  0x00ff;
+   return (_x);
+}
+
+static __inline __uint32_t
+__bitcount32(__uint32_t _x)
+{
+

Re: svn commit: r280279 - head/sys/sys

2015-03-20 Thread Konstantin Belousov
On Sat, Mar 21, 2015 at 09:42:51AM +1100, Bruce Evans wrote:
 On Fri, 20 Mar 2015, Konstantin Belousov wrote:
 
  On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote:
  Author: jhb
  Date: Fri Mar 20 10:27:06 2015
  New Revision: 280279
  URL: https://svnweb.freebsd.org/changeset/base/280279
 
  Log:
Expand the bitcount* API to support 64-bit integers, plain ints and longs
and create a hidden API that can be used in other system headers 
  without
adding namespace pollution.
- If the POPCNT instruction is enabled at compile time, use
  __builtin_popcount*() to implement __bitcount*(), otherwise fall back
  to software implementations.
 
  Are you aware of the Haswell errata HSD146 ?  I see the described behaviour
 
 I wasn't.
 
  on machines back to SandyBridge, but not on Nehalems.
  HSD146.   POPCNT Instruction May Take Longer to Execute Than Expected
  Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
  delayed until previous non-dependent instructions have executed.
 
 If it only affects performance, then it is up to the compiler to fix it.
It affects performance on some cpu models.  It is too wrong for compiler
to issue cpuid before popcnt.  Always issuing xorl before popcnt, as
it is currently done by recent gcc, but not clang, is better, but still
I think it is up to the code author to decide.

 
  Jilles noted that gcc head and 4.9.2 already provides a workaround by
  xoring the dst register.  I have some patch for amd64 pmap, see the end
  of the message.
 
 IIRC, then patch never never uses asm, but intentionally uses the popcount
 builtin to avoid complications.
 
- Use the existing bitcount16() and bitcount32() from sys/systm.h to
  implement the non-POPCNT __bitcount16() and __bitcount32() in
  sys/types.h.
  Why is it in sys/types.h ?
 
 To make it easier to use, while minimizing namespace pollution and
 inefficiencies.  Like the functions used to implement ntohl(), except
 the implementation is MI so it doesn't need to be in machine.
 (The functions used to implement ntohl() are in machine/endian.h.
 sys/types.h always includes that, so it makes little difference to
 pollution and inefficiency that the implementation is not more directly
 in machine/_types.h.)  bitcount is simpler and not burdened by
 compatibility, so it doesn't need a separate header.)

Still, it is weird to provide functions from the sys/types.h namespace,
and even more weird to provide such special-purpose function.
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r280279 - head/sys/sys

2015-03-20 Thread Konstantin Belousov
On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote:
 Author: jhb
 Date: Fri Mar 20 10:27:06 2015
 New Revision: 280279
 URL: https://svnweb.freebsd.org/changeset/base/280279
 
 Log:
   Expand the bitcount* API to support 64-bit integers, plain ints and longs
   and create a hidden API that can be used in other system headers without
   adding namespace pollution.
   - If the POPCNT instruction is enabled at compile time, use
 __builtin_popcount*() to implement __bitcount*(), otherwise fall back
 to software implementations.
Are you aware of the Haswell errata HSD146 ?  I see the described behaviour
on machines back to SandyBridge, but not on Nehalems.
HSD146.   POPCNT Instruction May Take Longer to Execute Than Expected
Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
delayed until previous non-dependent instructions have executed.

Jilles noted that gcc head and 4.9.2 already provides a workaround by
xoring the dst register.  I have some patch for amd64 pmap, see the end
of the message.

   - Use the existing bitcount16() and bitcount32() from sys/systm.h to
 implement the non-POPCNT __bitcount16() and __bitcount32() in
 sys/types.h.
Why is it in sys/types.h ?

   - For the non-POPCNT __bitcount64(), use a similar SWAR method on 64-bit
 systems.  For 32-bit systems, use two __bitcount32() operations on the
 two halves.
   - Use __bitcount32() to provide a __bitcount() that operates on plain ints.
   - Use either __bitcount32() or __bitcount64() to provide a
 __bitcountl() that operates on longs.
   - Add public bitcount*() wrappers for __bitcount*() for use in the kernel
 in sys/libkern.h.
   - Use __builtinl() instead of __builtin_popcountl() in BIT_COUNT().
   
   Discussed with: bde

diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
index 6a4077c..f6fbc33 100644
--- a/sys/amd64/amd64/pmap.c
+++ b/sys/amd64/amd64/pmap.c
@@ -413,6 +417,7 @@ static void free_pv_chunk(struct pv_chunk *pc);
 static voidfree_pv_entry(pmap_t pmap, pv_entry_t pv);
 static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp);
 static int popcnt_pc_map_elem(uint64_t elem);
+static int popcnt_pc_map_elem_pq(uint64_t elem);
 static vm_page_t reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp);
 static voidreserve_pv_entries(pmap_t pmap, int needed,
struct rwlock **lockp);
@@ -2997,6 +3020,29 @@ popcnt_pc_map_elem(uint64_t elem)
 }
 
 /*
+ * The erratas for Intel processors state that POPCNT Instruction May
+ * Take Longer to Execute Than Expected.  It is believed that the
+ * issue is the spurious dependency on the destination register.
+ * Provide a hint to the register rename logic that the destination
+ * value is overwritten, by clearing it, as suggested in the
+ * optimization manual.  It should be cheap for unaffected processors
+ * as well.
+ *
+ * Reference numbers for erratas are
+ * 4th Gen Core: HSD146
+ * 5th Gen Core: BDM85
+ */
+static int
+popcnt_pc_map_elem_pq(uint64_t elem)
+{
+   u_long result;
+
+   __asm __volatile(xorl %k0,%k0;popcntq %1,%0
+   : =r (result) : rm (elem));
+   return (result);
+}
+
+/*
  * Ensure that the number of spare PV entries in the specified pmap meets or
  * exceeds the given count, needed.
  *
@@ -3029,9 +3075,9 @@ retry:
free += popcnt_pc_map_elem(pc-pc_map[1]);
free += popcnt_pc_map_elem(pc-pc_map[2]);
} else {
-   free = popcntq(pc-pc_map[0]);
-   free += popcntq(pc-pc_map[1]);
-   free += popcntq(pc-pc_map[2]);
+   free = popcnt_pc_map_elem_pq(pc-pc_map[0]);
+   free += popcnt_pc_map_elem_pq(pc-pc_map[1]);
+   free += popcnt_pc_map_elem_pq(pc-pc_map[2]);
}
if (free == 0)
break;
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r280279 - head/sys/sys

2015-03-20 Thread Bruce Evans

On Fri, 20 Mar 2015, Konstantin Belousov wrote:


On Fri, Mar 20, 2015 at 10:27:06AM +, John Baldwin wrote:

Author: jhb
Date: Fri Mar 20 10:27:06 2015
New Revision: 280279
URL: https://svnweb.freebsd.org/changeset/base/280279

Log:
  Expand the bitcount* API to support 64-bit integers, plain ints and longs
  and create a hidden API that can be used in other system headers without
  adding namespace pollution.
  - If the POPCNT instruction is enabled at compile time, use
__builtin_popcount*() to implement __bitcount*(), otherwise fall back
to software implementations.



Are you aware of the Haswell errata HSD146 ?  I see the described behaviour


I wasn't.


on machines back to SandyBridge, but not on Nehalems.
HSD146.   POPCNT Instruction May Take Longer to Execute Than Expected
Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
delayed until previous non-dependent instructions have executed.


If it only affects performance, then it is up to the compiler to fix it.


Jilles noted that gcc head and 4.9.2 already provides a workaround by
xoring the dst register.  I have some patch for amd64 pmap, see the end
of the message.


IIRC, then patch never never uses asm, but intentionally uses the popcount
builtin to avoid complications.


  - Use the existing bitcount16() and bitcount32() from sys/systm.h to
implement the non-POPCNT __bitcount16() and __bitcount32() in
sys/types.h.

Why is it in sys/types.h ?


To make it easier to use, while minimizing namespace pollution and
inefficiencies.  Like the functions used to implement ntohl(), except
the implementation is MI so it doesn't need to be in machine.
(The functions used to implement ntohl() are in machine/endian.h.
sys/types.h always includes that, so it makes little difference to
pollution and inefficiency that the implementation is not more directly
in machine/_types.h.)  bitcount is simpler and not burdened by
compatibility, so it doesn't need a separate header.)

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org