Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?

2014-02-10 Thread Gabriel Paubert
On Fri, Feb 07, 2014 at 02:49:40PM -0600, James Yang wrote:
 On Fri, 7 Feb 2014, Gabriel Paubert wrote:
 
  Hi Stephen,
  
  On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
   Gabriel Paubert paub...@iram.es wrote on 02/06/2014 07:26:37 PM:
   
From: Gabriel Paubert paub...@iram.es
To: Stephen N Chivers schiv...@csc.com.au
Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor cproc...@csc.com.au
Date: 02/06/2014 07:26 PM
Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?

On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
 
 
 mask = 0;
 if (FM  (1  0))
 mask |= 0x000f;
 if (FM  (1  1))
 mask |= 0x00f0;
 if (FM  (1  2))
 mask |= 0x0f00;
 if (FM  (1  3))
 mask |= 0xf000;
 if (FM  (1  4))
 mask |= 0x000f;
 if (FM  (1  5))
 mask |= 0x00f0;
 if (FM  (1  6))
 mask |= 0x0f00;
 if (FM  (1  7))
 mask |= 0x9000;
 
 With the above mask computation I get consistent results for 
 both the MPC8548 and MPC7410 boards.
 
 Am I missing something subtle?

No I think you are correct. This said, this code may probably be 
   optimized 
to eliminate a lot of the conditional branches. I think that:
 
 
 If the compiler is enabled to generate isel instructions, it would not 
 use a conditional branch for this code. (ignore the andi's values, 
 this is an old compile)
 
 c0037c2c mtfsf:
 c0037c2c:   2c 03 00 00 cmpwi   r3,0
 c0037c30:   41 82 01 1c beq-c0037d4c mtfsf+0x120
 c0037c34:   2f 83 00 ff cmpwi   cr7,r3,255
 c0037c38:   41 9e 01 28 beq-cr7,c0037d60 mtfsf+0x134
 c0037c3c:   70 66 00 01 andi.   r6,r3,1
 c0037c40:   3d 00 90 00 lis r8,-28672
 c0037c44:   7d 20 40 9e iseleq  r9,r0,r8
 c0037c48:   70 6a 00 02 andi.   r10,r3,2
 c0037c4c:   65 28 0f 00 orisr8,r9,3840
 c0037c50:   7d 29 40 9e iseleq  r9,r9,r8
 c0037c54:   70 66 00 04 andi.   r6,r3,4
 c0037c58:   65 28 00 f0 orisr8,r9,240
 c0037c5c:   7d 29 40 9e iseleq  r9,r9,r8
 c0037c60:   70 6a 00 08 andi.   r10,r3,8
 c0037c64:   65 28 00 0f orisr8,r9,15
 c0037c68:   7d 29 40 9e iseleq  r9,r9,r8
 c0037c6c:   70 66 00 10 andi.   r6,r3,16
 c0037c70:   61 28 f0 00 ori r8,r9,61440
 c0037c74:   7d 29 40 9e iseleq  r9,r9,r8
 c0037c78:   70 6a 00 20 andi.   r10,r3,32
 c0037c7c:   61 28 0f 00 ori r8,r9,3840
 c0037c80:   54 6a cf fe rlwinm  r10,r3,25,31,31
 c0037c84:   7d 29 40 9e iseleq  r9,r9,r8
 c0037c88:   2f 8a 00 00 cmpwi   cr7,r10,0
 c0037c8c:   70 66 00 40 andi.   r6,r3,64
 c0037c90:   61 28 00 f0 ori r8,r9,240
 c0037c94:   7d 29 40 9e iseleq  r9,r9,r8
 c0037c98:   41 9e 00 08 beq-cr7,c0037ca0 mtfsf+0x74
 c0037c9c:   61 29 00 0f ori r9,r9,15
   ...
  
 However, your other solutions are better.
 
 

mask = (FM  1);
mask |= (FM  3)  0x10;
mask |= (FM  6)  0x100;
mask |= (FM  9)  0x1000;
mask |= (FM  12)  0x1;
mask |= (FM  15)  0x10;
mask |= (FM  18)  0x100;
mask |= (FM  21)  0x1000;
mask *= 15;

should do the job, in less code space and without a single branch.

Each one of the mask |= lines should be translated into an
rlwinm instruction followed by an or. Actually it should be possible
to transform each of these lines into a single rlwimi instruction
but I don't know how to coerce gcc to reach this level of optimization.

Another way of optomizing this could be:

mask = (FM  0x0f) | ((FM  12)  0x000f);
mask = (mask  0x00030003) | ((mask  6)  0x03030303);
mask = (mask  0x01010101) | ((mask  3)  0x10101010);
mask *= 15;

 
 
  Ok, I finally edited my sources and test compiled the suggestions
  I gave. I'd say that method2 is the best overall indeed. You can
  actually save one more instruction by setting mask to all ones in 
  the case FM=0xff, but that's about all in this area.
 
 
 My measurements show method1 to be smaller and faster than method2 due 
 to the number of instructions needed to generate the constant masks in 
 method2, but it may depend upon your compiler and hardware.  Both are 
 faster than the original with isel.

Ok, if you have measured that method1 is faster than method2, let us go for it.
I believe method2 would be faster if you had a large out-of-order execution
window, because more parallelism can be extracted from it, but 

RE: arch/powerpc/math-emu/mtfsf.c - incorrect mask?

2014-02-10 Thread David Laight
  However, your other solutions are better.
 
 

 mask = (FM  1);
 mask |= (FM  3)  0x10;
 mask |= (FM  6)  0x100;
 mask |= (FM  9)  0x1000;
 mask |= (FM  12)  0x1;
 mask |= (FM  15)  0x10;
 mask |= (FM  18)  0x100;
 mask |= (FM  21)  0x1000;
 mask *= 15;

 should do the job, in less code space and without a single branch.
...
 Another way of optomizing this could be:

 mask = (FM  0x0f) | ((FM  12)  0x000f);
 mask = (mask  0x00030003) | ((mask  6)  0x03030303);
 mask = (mask  0x01010101) | ((mask  3)  0x10101010);
 mask *= 15;
...
 Ok, if you have measured that method1 is faster than method2, let us go for 
 it.
 I believe method2 would be faster if you had a large out-of-order execution
 window, because more parallelism can be extracted from it, but this is 
 probably
 only true for high end cores, which do not need FPU emulation in the first 
 place.

FWIW the second has a long dependency chain on 'mask', whereas the first can 
execute
the shift/and in any order and then merge the results.
So on most superscalar cpu, or one with result delays for arithmetic, the first
is likely to be faster.

David



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?

2014-02-10 Thread Gabriel Paubert
On Mon, Feb 10, 2014 at 11:17:38AM +, David Laight wrote:
   However, your other solutions are better.
  
  
 
  mask = (FM  1);
  mask |= (FM  3)  0x10;
  mask |= (FM  6)  0x100;
  mask |= (FM  9)  0x1000;
  mask |= (FM  12)  0x1;
  mask |= (FM  15)  0x10;
  mask |= (FM  18)  0x100;
  mask |= (FM  21)  0x1000;
  mask *= 15;
 
  should do the job, in less code space and without a single branch.
 ...
  Another way of optomizing this could be:
 
  mask = (FM  0x0f) | ((FM  12)  0x000f);
  mask = (mask  0x00030003) | ((mask  6)  0x03030303);
  mask = (mask  0x01010101) | ((mask  3)  0x10101010);
  mask *= 15;
 ...
  Ok, if you have measured that method1 is faster than method2, let us go for 
  it.
  I believe method2 would be faster if you had a large out-of-order execution
  window, because more parallelism can be extracted from it, but this is 
  probably
  only true for high end cores, which do not need FPU emulation in the first 
  place.
 
 FWIW the second has a long dependency chain on 'mask', whereas the first can 
 execute
 the shift/and in any order and then merge the results.
 So on most superscalar cpu, or one with result delays for arithmetic, the 
 first
 is likely to be faster.

I disagree, perhaps mostly because the compiler is not clever enough, but right
now the code for solution 1 is (actually I have rewritten the code
and it reads:

mask = (FM  1)
| ((FM  3)  0x10)
| ((FM  6)  0x100)
| ((FM  9)  0x1000)
| ((FM  12)  0x1)
| ((FM  15)  0x10)
| ((FM  18)  0x100)
| ((FM  21)  0x1000);
to avoid sequence point in case it hampers the compiler)

and the output is:

rlwinm 10,3,3,27,27  # D.11621, FM,,
rlwinm 9,3,6,23,23   # D.11621, FM,,
or 9,10,9#, D.11621, D.11621, D.11621
rlwinm 10,3,0,31,31  # D.11621, FM,
or 9,9,10#, D.11621, D.11621, D.11621
rlwinm 10,3,9,19,19  # D.11621, FM,,
or 9,9,10#, D.11621, D.11621, D.11621
rlwinm 10,3,12,15,15 # D.11621, FM,,
or 9,9,10#, D.11621, D.11621, D.11621
rlwinm 10,3,15,11,11 # D.11621, FM,,
or 9,9,10#, D.11621, D.11621, D.11621
rlwinm 10,3,18,7,7   # D.11621, FM,,
or 9,9,10#, D.11621, D.11621, D.11621
rlwinm 3,3,21,3,3# D.11621, FM,,
or 9,9,3 #, mask, D.11621, D.11621
mulli 9,9,15 # mask, mask,

see that r9 is used 7 times as both input and output operand, plus
once for rlwinm. This gives a dependency length of 8 at least.

In the other case (I've deleted the code) the dependency length
was significantly shorter. In any case that one is fewer instructions, 
which is good for occasional use. 

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: arch/powerpc/math-emu/mtfsf.c - incorrect mask?

2014-02-10 Thread David Laight
 I disagree, perhaps mostly because the compiler is not clever enough, but 
 right
 now the code for solution 1 is (actually I have rewritten the code
 and it reads:
 
   mask = (FM  1)
   | ((FM  3)  0x10)
   | ((FM  6)  0x100)
   | ((FM  9)  0x1000)
   | ((FM  12)  0x1)
   | ((FM  15)  0x10)
   | ((FM  18)  0x100)
   | ((FM  21)  0x1000);
 to avoid sequence point in case it hampers the compiler)
 
 and the output is:
 
 rlwinm 10,3,3,27,27  # D.11621, FM,,
 rlwinm 9,3,6,23,23   # D.11621, FM,,
 or 9,10,9#, D.11621, D.11621, D.11621
 rlwinm 10,3,0,31,31  # D.11621, FM,
 or 9,9,10#, D.11621, D.11621, D.11621
 rlwinm 10,3,9,19,19  # D.11621, FM,,
 or 9,9,10#, D.11621, D.11621, D.11621
 rlwinm 10,3,12,15,15 # D.11621, FM,,
 or 9,9,10#, D.11621, D.11621, D.11621
 rlwinm 10,3,15,11,11 # D.11621, FM,,
 or 9,9,10#, D.11621, D.11621, D.11621
 rlwinm 10,3,18,7,7   # D.11621, FM,,
 or 9,9,10#, D.11621, D.11621, D.11621
 rlwinm 3,3,21,3,3# D.11621, FM,,
 or 9,9,3 #, mask, D.11621, D.11621
 mulli 9,9,15 # mask, mask,
 
 see that r9 is used 7 times as both input and output operand, plus
 once for rlwinm. This gives a dependency length of 8 at least.
 
 In the other case (I've deleted the code) the dependency length
 was significantly shorter. In any case that one is fewer instructions,
 which is good for occasional use.

Hmmm... I hand-counted a dependency length of 8 for the other version.
Maybe there are some ppc instructions that reduce it.

Stupid compiler :-)
Trouble is, I bet that even if you code it as:
mask1 = (FM  1) | ((FM  3)  0x10);
mask2 = ((FM  6)  0x100) | ((FM  9)  0x1000);
mask3 = ((FM  12)  0x1) | ((FM  15)  0x10);
mask4 = ((FM  18)  0x100) | ((FM  21)  0x1000);
mask1 |= mask2;
mask3 |= mask4;
mask = mask1 | mask3;
the compiler will 'optimise' it to the above before code generation.
If it doesn't adding () to pair the | might be enough.
Then a new version of the compiler will change the behaviour again.

David



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?

2014-02-10 Thread Gabriel Paubert
On Mon, Feb 10, 2014 at 12:32:18PM +, David Laight wrote:
  I disagree, perhaps mostly because the compiler is not clever enough, but 
  right
  now the code for solution 1 is (actually I have rewritten the code
  and it reads:
  
  mask = (FM  1)
  | ((FM  3)  0x10)
  | ((FM  6)  0x100)
  | ((FM  9)  0x1000)
  | ((FM  12)  0x1)
  | ((FM  15)  0x10)
  | ((FM  18)  0x100)
  | ((FM  21)  0x1000);
  to avoid sequence point in case it hampers the compiler)
  
  and the output is:
  
  rlwinm 10,3,3,27,27  # D.11621, FM,,
  rlwinm 9,3,6,23,23   # D.11621, FM,,
  or 9,10,9#, D.11621, D.11621, D.11621
  rlwinm 10,3,0,31,31  # D.11621, FM,
  or 9,9,10#, D.11621, D.11621, D.11621
  rlwinm 10,3,9,19,19  # D.11621, FM,,
  or 9,9,10#, D.11621, D.11621, D.11621
  rlwinm 10,3,12,15,15 # D.11621, FM,,
  or 9,9,10#, D.11621, D.11621, D.11621
  rlwinm 10,3,15,11,11 # D.11621, FM,,
  or 9,9,10#, D.11621, D.11621, D.11621
  rlwinm 10,3,18,7,7   # D.11621, FM,,
  or 9,9,10#, D.11621, D.11621, D.11621
  rlwinm 3,3,21,3,3# D.11621, FM,,
  or 9,9,3 #, mask, D.11621, D.11621
  mulli 9,9,15 # mask, mask,
  
  see that r9 is used 7 times as both input and output operand, plus
  once for rlwinm. This gives a dependency length of 8 at least.
  
  In the other case (I've deleted the code) the dependency length
  was significantly shorter. In any case that one is fewer instructions,
  which is good for occasional use.
 
 Hmmm... I hand-counted a dependency length of 8 for the other version.
 Maybe there are some ppc instructions that reduce it.

Either I misread the generated code or I got somewhat less.

What helps for method1 is the rotate and mask instructions of PPC. Each of
left shift and mask becomes a single rlwinm. 
 
 Stupid compiler :-)

Indeed. I've trying to coerce it into generating rlwimi instructions
(in which case the whole building of the mask reduces to 8 assembly
instructions) and failed. It seems that the compiler lacks some patterns
some patterns that would directly map to rlwimi.

 Trouble is, I bet that even if you code it as:
   mask1 = (FM  1) | ((FM  3)  0x10);
   mask2 = ((FM  6)  0x100) | ((FM  9)  0x1000);
   mask3 = ((FM  12)  0x1) | ((FM  15)  0x10);
   mask4 = ((FM  18)  0x100) | ((FM  21)  0x1000);
   mask1 |= mask2;
   mask3 |= mask4;
   mask = mask1 | mask3;
 the compiler will 'optimise' it to the above before code generation.

Indeed it's what it does :-(

I believe that the current suggestion is good enough.

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2] powerpc ticket locks

2014-02-10 Thread Torsten Duwe
On Mon, Feb 10, 2014 at 02:10:23PM +1100, Benjamin Herrenschmidt wrote:
 On Fri, 2014-02-07 at 17:58 +0100, Torsten Duwe wrote:
   typedef struct {
  -   volatile unsigned int slock;
  -} arch_spinlock_t;
  +   union {
  +   __ticketpair_t head_tail;
  +   struct __raw_tickets {
  +#ifdef __BIG_ENDIAN__  /* The tail part should be in the MSBs */
  +   __ticket_t tail, head;
  +#else
  +   __ticket_t head, tail;
  +#endif
  +   } tickets;
  +   };
  +#if defined(CONFIG_PPC_SPLPAR)
  +   u32 holder;
  +#endif
  +} arch_spinlock_t __aligned(4);
 
 That's still broken with lockref (which we just merged).
 
 We must have the arch_spinlock_t and the ref in the same 64-bit word
 otherwise it will break.

Well, as far as I can see you'll just not be able to
USE_CMPXCHG_LOCKREF -- with the appropriate performance hit --
the code just falls back into lockref on pSeries.
What again was the intention of directed yield in the first place...?

 We can make it work in theory since the holder doesn't have to be
 accessed atomically, but the practicals are a complete mess ...
 lockref would essentially have to re-implement the holder handling
 of the spinlocks and use lower level ticket stuff.
 
 Unless you can find a sneaky trick ... :-(

What if I squeeze the bits a little?
4k vCPUs, and 256 physical, as a limit to stay within 32 bits?
At the cost that unlock may become an ll/sc operation again.
I could think about a trick against that.
But alas, hw_cpu_id is 16 bit, which makes a lookup table neccessary :-/

Doing another round of yields for lockrefs now doesn't
sound so bad any more.

Opinions, anyone?

Torsten

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?

2014-02-10 Thread James Yang
On Sun, 9 Feb 2014, Stephen N Chivers wrote:

 James Yang james.y...@freescale.com wrote on 02/08/2014 07:49:40 AM:
 
  From: James Yang james.y...@freescale.com
  To: Gabriel Paubert paub...@iram.es
  Cc: Stephen N Chivers schiv...@csc.com.au, Chris Proctor 
  cproc...@csc.com.au, linuxppc-dev@lists.ozlabs.org
  Date: 02/08/2014 07:49 AM
  Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
  
  On Fri, 7 Feb 2014, Gabriel Paubert wrote:
  
  Hi Stephen,
   
   On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
Gabriel Paubert paub...@iram.es wrote on 02/06/2014 07:26:37 PM:

 From: Gabriel Paubert paub...@iram.es
 To: Stephen N Chivers schiv...@csc.com.au
 Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor 
 cproc...@csc.com.au
 Date: 02/06/2014 07:26 PM
 Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
 
 On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
  With the above mask computation I get consistent results for 
  both the MPC8548 and MPC7410 boards.
  
  Am I missing something subtle?
 
 No I think you are correct. This said, this code may probably be 
optimized 
 to eliminate a lot of the conditional branches. I think that:
  
  
  If the compiler is enabled to generate isel instructions, it would not 
  use a conditional branch for this code. (ignore the andi's values, 
  this is an old compile)
  
 From limited research, the 440GP is a processor
 that doesn't implement the isel instruction and it does
 not implement floating point.
 
 The kernel emulates isel and so using that instruction
 for the 440GP would have a double trap penalty.

Are you writing about something outside the scope of this thread?  OP 
was using MPC8548 not a 440GP.  The compiler should not be using or 
targeting 8548 for a 440GP so having to emulate isel shouldn't be an 
issue because there wouldn't be any.  (The assembly listing I posted 
was generated by gcc targeting 8548.)  Anyway, I had measured the 
non-isel routines to be faster and that works without illop traps.

 
 Correct me if I am wrong, the isel instruction first appears
 in PowerPC ISA v2.04 around mid 2007. 


isel appeared in 2003 in the e500 (v1) core that is in the MPC8540.  
The instruction is Power ISA 2.03 (9/2006).

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?

2014-02-10 Thread James Yang
On Mon, 10 Feb 2014, Gabriel Paubert wrote:

 On Fri, Feb 07, 2014 at 02:49:40PM -0600, James Yang wrote:
  On Fri, 7 Feb 2014, Gabriel Paubert wrote:
  
 Hi Stephen,
   
   On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
Gabriel Paubert paub...@iram.es wrote on 02/06/2014 07:26:37 PM:

 From: Gabriel Paubert paub...@iram.es
 To: Stephen N Chivers schiv...@csc.com.au
 Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor cproc...@csc.com.au
 Date: 02/06/2014 07:26 PM
 Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
 
 On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:


  
  
  mask = 0;
  if (FM  (1  0))
  mask |= 0x000f;
  if (FM  (1  1))
  mask |= 0x00f0;
  if (FM  (1  2))
  mask |= 0x0f00;
  if (FM  (1  3))
  mask |= 0xf000;
  if (FM  (1  4))
  mask |= 0x000f;
  if (FM  (1  5))
  mask |= 0x00f0;
  if (FM  (1  6))
  mask |= 0x0f00;
  if (FM  (1  7))
  mask |= 0x9000;
  
  With the above mask computation I get consistent results for 
  both the MPC8548 and MPC7410 boards.

 Ok, if you have measured that method1 is faster than method2, let us go for 
 it.
 I believe method2 would be faster if you had a large out-of-order execution
 window, because more parallelism can be extracted from it, but this is 
 probably
 only true for high end cores, which do not need FPU emulation in the first 
 place.

Yeah, 8548 can issue 2 SFX instructions per cycle which is what the 
compiler generated.   

 
 The other place where we can optimize is the generation of FEX. Here is 
 my current patch:
 
 
 diff --git a/arch/powerpc/math-emu/mtfsf.c b/arch/powerpc/math-emu/mtfsf.c
 index dbce92e..b57b3fa8 100644
 --- a/arch/powerpc/math-emu/mtfsf.c
 +++ b/arch/powerpc/math-emu/mtfsf.c
 @@ -11,48 +11,35 @@ mtfsf(unsigned int FM, u32 *frB)
   u32 mask;
   u32 fpscr;
  
 - if (FM == 0)
 + if (likely(FM == 0xff))
 + mask = 0x;
 + else if (unlikely(FM == 0))
   return 0;
 -
 - if (FM == 0xff)
 - mask = 0x9fff;
   else {
 - mask = 0;
 - if (FM  (1  0))
 - mask |= 0x9000;
 - if (FM  (1  1))
 - mask |= 0x0f00;
 - if (FM  (1  2))
 - mask |= 0x00f0;
 - if (FM  (1  3))
 - mask |= 0x000f;
 - if (FM  (1  4))
 - mask |= 0xf000;
 - if (FM  (1  5))
 - mask |= 0x0f00;
 - if (FM  (1  6))
 - mask |= 0x00f0;
 - if (FM  (1  7))
 - mask |= 0x000f;
 + mask = (FM  1);
 + mask |= (FM  3)  0x10;
 + mask |= (FM  6)  0x100;
 + mask |= (FM  9)  0x1000;
 + mask |= (FM  12)  0x1;
 + mask |= (FM  15)  0x10;
 + mask |= (FM  18)  0x100;
 + mask |= (FM  21)  0x1000;
 + mask *= 15;


Needs to also mask out bits 1 and 2, they aren't to be set from frB.

mask = 0x9FFF;




   }
  
 - __FPU_FPSCR = ~(mask);
 - __FPU_FPSCR |= (frB[1]  mask);
 + fpscr = ((__FPU_FPSCR  ~mask) | (frB[1]  mask)) 
 + ~(FPSCR_VX | FPSCR_FEX);
  
 - __FPU_FPSCR = ~(FPSCR_VX);
 - if (__FPU_FPSCR  (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
 + if (fpscr  (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
FPSCR_VXZDZ | FPSCR_VXIMZ | FPSCR_VXVC |
FPSCR_VXSOFT | FPSCR_VXSQRT | FPSCR_VXCVI))
 - __FPU_FPSCR |= FPSCR_VX;
 -
 - fpscr = __FPU_FPSCR;
 - fpscr = ~(FPSCR_FEX);
 - if (((fpscr  FPSCR_VX)  (fpscr  FPSCR_VE)) ||
 - ((fpscr  FPSCR_OX)  (fpscr  FPSCR_OE)) ||
 - ((fpscr  FPSCR_UX)  (fpscr  FPSCR_UE)) ||
 - ((fpscr  FPSCR_ZX)  (fpscr  FPSCR_ZE)) ||
 - ((fpscr  FPSCR_XX)  (fpscr  FPSCR_XE)))
 - fpscr |= FPSCR_FEX;
 + fpscr |= FPSCR_VX;
 +
 + /* The bit order of exception enables and exception status
 +  * is the same. Simply shift and mask to check for enabled
 +  * exceptions.
 +  */
 + if (fpscr  (fpscr  22)   0xf8) fpscr |= FPSCR_FEX;
   __FPU_FPSCR = fpscr;
  
  #ifdef DEBUG
  mtfsf.c |   57 ++---
  1 file changed, 22 insertions(+), 35 deletions(-)
 
 
 Notes: 
 
 1) I'm really unsure on whether 0xff is frequent or not. So the 

Re: [PATCH v2] powerpc ticket locks

2014-02-10 Thread Peter Zijlstra
On Mon, Feb 10, 2014 at 04:52:17PM +0100, Torsten Duwe wrote:
 Opinions, anyone?

Since the holder thing is a performance thing, not a correctness thing;
one thing you could do is something like:

static const int OWNER_HASH_SIZE = CONFIG_NR_CPUS * 4;
static const int OWNER_HASH_BITS = ilog2(OWNER_HASH_SIZE);

u16 lock_owner_array[OWNER_HASH_SIZE] = { 0, };

void set_owner(struct arch_spinlock_t *lock, int owner)
{
int hash = hash_ptr(lock, OWNER_HASH_BITS);
lock_owner_array[hash] = owner;
}

void yield_to_owner(struct arch_spinlock_t *lock)
{
int hash = hash_ptr(lock, OWNER_HASH_BITS);
int owner = lock_owner_array[hash];
yield_to_cpu(owner);
}

And call set_owner() after the ticket lock is acquired, and don't bother
clearing it again; a new acquire will overwrite, a collision we have to
live with.

It should on average get you the right yield and does away with having
to track the owner field in place.

It does however get you an extra cacheline miss on acquire :/

One could consider patching it out when you know your kernel is not
running on an overloaded partition.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] mtd: m25p80: Make the name of mtd_info fixed

2014-02-10 Thread Brian Norris
Hi Hou,

On Thu, Jan 23, 2014 at 03:29:41AM +, b48...@freescale.com wrote:
  -Original Message-
  From: Brian Norris [mailto:computersforpe...@gmail.com]
  Sent: Thursday, January 23, 2014 10:12 AM
  To: Hou Zhiqiang-B48286
  Cc: linux-...@lists.infradead.org; linuxppc-...@ozlabs.org; Wood Scott-
  B07421; Hu Mingkai-B21284; Ezequiel Garcia
  Subject: Re: [PATCH] mtd: m25p80: Make the name of mtd_info fixed
  
  On Mon, Jan 06, 2014 at 02:34:29PM +0800, Hou Zhiqiang wrote:
   --- a/drivers/mtd/devices/m25p80.c
   +++ b/drivers/mtd/devices/m25p80.c
   @@ -1012,7 +1012,8 @@ static int m25p_probe(struct spi_device *spi)
 if (data  data-name)
 flash-mtd.name = data-name;
 else
   - flash-mtd.name = dev_name(spi-dev);
   + flash-mtd.name = kasprintf(GFP_KERNEL, %s.%d,
   + id-name, spi-chip_select);
  
  Changing the mtd.name may have far-reaching consequences for users who
  already have mtdparts= command lines. But your concern is probably valid
  for dynamically-determined bus numbers. Perhaps you can edit this patch
  to only change the name when the busnum is dynamically-allocated?
 
 It's a good idea, but in the case of mtd_info's name dynamically-allocated
 using mtdparts=... in command lines is illegal obviously.

I agree that users should never have relied on the dynamically-allocated
name. But changing the name for non-dynamic schemes (e.g., where the SPI
busnum is a fixed value) is not pleasant for users.

 Would you tell
 me what side-effect will be brought by the change of mtd_info's name.

I can only think of the mtdparts= command line. Otherwise, I don't think
the name is very important.

Brian
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-10 Thread Nishanth Aravamudan
Hi Christoph,

On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
 Here is a draft of a patch to make this work with memoryless nodes.
 
 The first thing is that we modify node_match to also match if we hit an
 empty node. In that case we simply take the current slab if its there.
 
 If there is no current slab then a regular allocation occurs with the
 memoryless node. The page allocator will fallback to a possible node and
 that will become the current slab. Next alloc from a memoryless node
 will then use that slab.
 
 For that we also add some tracking of allocations on nodes that were not
 satisfied using the empty_node[] array. A successful alloc on a node
 clears that flag.
 
 I would rather avoid the empty_node[] array since its global and there may
 be thread specific allocation restrictions but it would be expensive to do
 an allocation attempt via the page allocator to make sure that there is
 really no page available from the page allocator.

With this patch on our test system (I pulled out the numa_mem_id()
change, since you Acked Joonsoo's already), on top of 3.13.0 + my
kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
patch 1):

MemTotal:8264704 kB
MemFree: 5924608 kB
...
Slab:1402496 kB
SReclaimable: 102848 kB
SUnreclaim:  1299648 kB

And Anton's slabusage reports:

slab   mem objsslabs
  used   active   active

kmalloc-16384   207 MB   98.60%  100.00%
task_struct 134 MB   97.82%  100.00%
kmalloc-8192117 MB  100.00%  100.00%
pgtable-2^12111 MB  100.00%  100.00%
pgtable-2^10104 MB  100.00%  100.00%

For comparison, Anton's patch applied at the same point in the series:

meminfo:

MemTotal:8264704 kB
MemFree: 4150464 kB
...
Slab:1590336 kB
SReclaimable: 208768 kB
SUnreclaim:  1381568 kB

slabusage:

slab   mem objsslabs
  used   active   active

kmalloc-16384   227 MB   98.63%  100.00%
kmalloc-8192130 MB  100.00%  100.00%
task_struct 129 MB   97.73%  100.00%
pgtable-2^12112 MB  100.00%  100.00%
pgtable-2^10106 MB  100.00%  100.00%


Consider this patch:

Acked-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
Tested-by: Nishanth Aravamudan n...@linux.vnet.ibm.com

I was thinking about your concerns about empty_node[]. Would it make
sense to use a helper function, rather than direct access to
direct_node, such as:

bool is_node_empty(int nid)

void set_node_empty(int nid, bool empty)

which we stub out if !HAVE_MEMORYLESS_NODES to return false and noop
respectively?

That way only architectures that have memoryless nodes pay the penalty
of the array allocation?

Thanks,
Nish

 Index: linux/mm/slub.c
 ===
 --- linux.orig/mm/slub.c  2014-02-03 13:19:22.896853227 -0600
 +++ linux/mm/slub.c   2014-02-07 12:44:49.311494806 -0600
 @@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
  #endif
  }
 
 +static int empty_node[MAX_NUMNODES];
 +
  /*
   * Issues still to be resolved:
   *
 @@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
   void *last;
   void *p;
   int order;
 + int alloc_node;
 
   BUG_ON(flags  GFP_SLAB_BUG_MASK);
 
   page = allocate_slab(s,
   flags  (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
 - if (!page)
 + if (!page) {
 + if (node != NUMA_NO_NODE)
 + empty_node[node] = 1;
   goto out;
 + }
 
   order = compound_order(page);
 - inc_slabs_node(s, page_to_nid(page), page-objects);
 + alloc_node = page_to_nid(page);
 + empty_node[alloc_node] = 0;
 + inc_slabs_node(s, alloc_node, page-objects);
   memcg_bind_pages(s, order);
   page-slab_cache = s;
   __SetPageSlab(page);
 @@ -1712,7 +1720,7 @@ static void *get_partial(struct kmem_cac
   struct kmem_cache_cpu *c)
  {
   void *object;
 - int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
 + int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
 
   object = get_partial_node(s, get_node(s, searchnode), c, flags);
   if (object || node != NUMA_NO_NODE)
 @@ -2107,8 +2115,25 @@ static void flush_all(struct kmem_cache
  static inline int node_match(struct page *page, int node)
  {
  #ifdef CONFIG_NUMA
 - if (!page || (node != NUMA_NO_NODE  page_to_nid(page) != node))
 + int page_node;
 +
 + /* No data means no match */
 + if (!page)
  

Re: [PATCH v2] mtd: m25p80: Make the name of mtd_info fixed

2014-02-10 Thread Brian Norris
On Sun, Jan 26, 2014 at 02:16:43PM +0800, Hou Zhiqiang wrote:
 To give spi flash layout using mtdparts=... in cmdline, we must
 give mtd_info a fixed name,because the cmdlinepart's parser will
 match the name given in cmdline with the mtd_info.
 
 Now, if use OF node, mtd_info's name will be spi-dev-name. It
 consists of spi_master-bus_num, and the spi_master-bus_num maybe
 dynamically fetched.
 So, give the mtd_info a new fiexd name name.cs, name is name of
 spi_device_id and cs is chip-select in spi_dev.
 
 Signed-off-by: Hou Zhiqiang b48...@freescale.com
 ---
 v2:
  - add check for return value of function kasprintf.
  - whether the spi_master-bus_num is dynamical is determined by spi
controller driver, and it can't be check in this driver. So, we can
not initial the mtd_info's name by distinguishing the spi_master
bus_num dynamically-allocated or not.

How about spi-master-bus_num  0 ?

  drivers/mtd/devices/m25p80.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/mtd/devices/m25p80.c b/drivers/mtd/devices/m25p80.c
 index eb558e8..1f494d2 100644
 --- a/drivers/mtd/devices/m25p80.c
 +++ b/drivers/mtd/devices/m25p80.c
 @@ -1011,8 +1011,12 @@ static int m25p_probe(struct spi_device *spi)
  
   if (data  data-name)
   flash-mtd.name = data-name;
 - else
 - flash-mtd.name = dev_name(spi-dev);
 + else {
 + flash-mtd.name = kasprintf(GFP_KERNEL, %s.%d,
 + id-name, spi-chip_select);

I don't think this name is specific enough. What if there are more than
one SPI controller? Then there could be one chip with the same
chip-select. You probably still need to incorporate the SPI master
somehow, even if it's not by using the bus number directly (because it's
dynamic).

 + if (!flash-mtd.name)
 + return -ENOMEM;
 + }
  
   flash-mtd.type = MTD_NORFLASH;
   flash-mtd.writesize = 1;

Brian
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[RFT][PATCH 04/12] drivers/macintosh/adb: change platform power management to use dev_pm_ops

2014-02-10 Thread Shuah Khan
Change adb platform driver to register pm ops using dev_pm_ops instead of
legacy pm_ops. .pm hooks call existing legacy suspend and resume interfaces
by passing in the right pm state.

Signed-off-by: Shuah Khan shuah...@samsung.com
---
 drivers/macintosh/adb.c |   41 -
 1 file changed, 36 insertions(+), 5 deletions(-)

diff --git a/drivers/macintosh/adb.c b/drivers/macintosh/adb.c
index 04a5049..df6b201 100644
--- a/drivers/macintosh/adb.c
+++ b/drivers/macintosh/adb.c
@@ -263,7 +263,7 @@ adb_reset_bus(void)
 /*
  * notify clients before sleep
  */
-static int adb_suspend(struct platform_device *dev, pm_message_t state)
+static int __adb_suspend(struct platform_device *dev, pm_message_t state)
 {
adb_got_sleep = 1;
/* We need to get a lock on the probe thread */
@@ -276,10 +276,25 @@ static int adb_suspend(struct platform_device *dev, 
pm_message_t state)
return 0;
 }
 
+static int adb_suspend(struct device *dev)
+{
+   return __adb_suspend(to_platform_device(dev), PMSG_SUSPEND);
+}
+
+static int adb_freeze(struct device *dev)
+{
+   return __adb_suspend(to_platform_device(dev), PMSG_FREEZE);
+}
+
+static int adb_poweroff(struct device *dev)
+{
+   return __adb_suspend(to_platform_device(dev), PMSG_HIBERNATE);
+}
+
 /*
  * reset bus after sleep
  */
-static int adb_resume(struct platform_device *dev)
+static int __adb_resume(struct platform_device *dev)
 {
adb_got_sleep = 0;
up(adb_probe_mutex);
@@ -287,6 +302,11 @@ static int adb_resume(struct platform_device *dev)
 
return 0;
 }
+
+static int adb_resume(struct device *dev)
+{
+   return __adb_resume(to_platform_device(dev));
+}
 #endif /* CONFIG_PM */
 
 static int __init adb_init(void)
@@ -831,14 +851,25 @@ static const struct file_operations adb_fops = {
.release= adb_release,
 };
 
+#ifdef CONFIG_PM
+static const struct dev_pm_ops adb_dev_pm_ops = {
+   .suspend = adb_suspend,
+   .resume = adb_resume,
+   /* Hibernate hooks */
+   .freeze = adb_freeze,
+   .thaw = adb_resume,
+   .poweroff = adb_poweroff,
+   .restore = adb_resume,
+};
+#endif
+
 static struct platform_driver adb_pfdrv = {
.driver = {
.name = adb,
-   },
 #ifdef CONFIG_PM
-   .suspend = adb_suspend,
-   .resume = adb_resume,
+   .pm = adb_dev_pm_ops,
 #endif
+   },
 };
 
 static struct platform_device adb_pfdev = {
-- 
1.7.10.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[RFT][PATCH 00/12] change drivers power management to use dev_pm_ops

2014-02-10 Thread Shuah Khan
Change drivers to register pm ops using dev_pm_ops instead of legacy pm_ops.
.pm hooks call existing legacy suspend and resume interfaces by passing in
the right pm state. Bus drivers suspend and resume routines call .pm driver
hooks if found.

Shuah Khan (12):
  arm: change locomo platform and bus power management to use
dev_pm_ops
  arm: change sa platform and bus power management to use
dev_pm_ops
  arm: change scoop platform power management to use dev_pm_ops
  drivers/macintosh/adb: change platform power managemnet to use
dev_pm_ops
  mmc: change au1xmmc platform power management to use dev_pm_ops
  mmc: change bfin_sdh platform power management to use dev_pm_ops
  isa: change isa bus power managemnet to use dev_pm_ops
  mmc: change cb710-mmc platform power management to use dev_pm_ops
  mmc: change msm_sdcc platform power management to use dev_pm_ops
  mmc: change tmio_mmc platform power management to use dev_pm_ops
  drivers/pcmcia: change ds driver power management to use dev_pm_ops
  drivers/s390/crypto: change ap_bus driver power management to use
dev_pm_ops

 arch/arm/common/locomo.c |   93 +++---
 arch/arm/common/sa.c |   88 +++
 arch/arm/common/scoop.c  |   44 
 drivers/base/isa.c   |   30 --
 drivers/macintosh/adb.c  |   41 ---
 drivers/mmc/host/au1xmmc.c   |   43 +++
 drivers/mmc/host/bfin_sdh.c  |   40 +++---
 drivers/mmc/host/cb710-mmc.c |   37 +++--
 drivers/mmc/host/msm_sdcc.c  |   42 +++
 drivers/mmc/host/tmio_mmc.c  |   42 +++
 drivers/pcmcia/ds.c  |   36 +---
 drivers/s390/crypto/ap_bus.c |   30 --
 12 files changed, 481 insertions(+), 85 deletions(-)

-- 
1.7.10.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] Handle vmalloc addresses

2014-02-10 Thread Benjamin Herrenschmidt
Hi Nathan !

Please do a better submission :-)

Your subject is to be honest, crap. Something like

[PATCH] crypto/nx/nx-842: Fix handling of vmalloc addresses

Would have been much more informative.

Additionally, this is a patch for drivers/crypto, and while that driver
is powerpc-specific meaning I *could* take that patch, it should at
least be CCed to the crypto list/maintainer since that would
be the normal path for such a patch to be applied.

I'm taking it this time around but I know you can do better !

Cheers,
Ben.

On Wed, 2014-01-29 at 10:34 -0600, Nathan Fontenot wrote:
 The nx-842 compression driver does not currently handle getting
 a physical address for vmalloc addresses. The current driver
 uses __pa() for all addresses which does not properly handle
 vmalloc addresses and thus causes a failure since we do not pass
 a proper physical address to phyp.
 
 This patch adds a routine to convert an address to a physical
 address by checking for vmalloc addresses and handling them properly.
 
 Signed-off-by: Nathan Fontenot nf...@linux.vnet.ibm.com
  ---
  drivers/crypto/nx/nx-842.c |   29 +++--
  1 file changed, 19 insertions(+), 10 deletions(-)
 
 Index: linux/drivers/crypto/nx/nx-842.c
 ===
 --- linux.orig/drivers/crypto/nx/nx-842.c 2014-01-22 08:52:55.0 
 -0600
 +++ linux/drivers/crypto/nx/nx-842.c  2014-01-29 08:25:33.0 -0600
 @@ -158,6 +158,15 @@
   return sl-entry_nr * sizeof(struct nx842_slentry);
  }
  
 +static inline unsigned long nx842_get_pa(void *addr)
 +{
 + if (is_vmalloc_addr(addr))
 + return page_to_phys(vmalloc_to_page(addr))
 ++ offset_in_page(addr);
 + else
 + return __pa(addr);
 +}
 +
  static int nx842_build_scatterlist(unsigned long buf, int len,
   struct nx842_scatterlist *sl)
  {
 @@ -168,7 +177,7 @@
  
   entry = sl-entries;
   while (len) {
 - entry-ptr = __pa(buf);
 + entry-ptr = nx842_get_pa((void *)buf);
   nextpage = ALIGN(buf + 1, NX842_HW_PAGE_SIZE);
   if (nextpage  buf + len) {
   /* we aren't at the end yet */
 @@ -370,8 +379,8 @@
   op.flags = NX842_OP_COMPRESS;
   csbcpb = workmem-csbcpb;
   memset(csbcpb, 0, sizeof(*csbcpb));
 - op.csbcpb = __pa(csbcpb);
 - op.out = __pa(slout.entries);
 + op.csbcpb = nx842_get_pa(csbcpb);
 + op.out = nx842_get_pa(slout.entries);
  
   for (i = 0; i  hdr-blocks_nr; i++) {
   /*
 @@ -401,13 +410,13 @@
*/
   if (likely(max_sync_size == NX842_HW_PAGE_SIZE)) {
   /* Create direct DDE */
 - op.in = __pa(inbuf);
 + op.in = nx842_get_pa((void *)inbuf);
   op.inlen = max_sync_size;
  
   } else {
   /* Create indirect DDE (scatterlist) */
   nx842_build_scatterlist(inbuf, max_sync_size, slin);
 - op.in = __pa(slin.entries);
 + op.in = nx842_get_pa(slin.entries);
   op.inlen = -nx842_get_scatterlist_size(slin);
   }
  
 @@ -565,7 +574,7 @@
   op.flags = NX842_OP_DECOMPRESS;
   csbcpb = workmem-csbcpb;
   memset(csbcpb, 0, sizeof(*csbcpb));
 - op.csbcpb = __pa(csbcpb);
 + op.csbcpb = nx842_get_pa(csbcpb);
  
   /*
* max_sync_size may have changed since compression,
 @@ -597,12 +606,12 @@
   if (likely((inbuf  NX842_HW_PAGE_MASK) ==
   ((inbuf + hdr-sizes[i] - 1)  NX842_HW_PAGE_MASK))) {
   /* Create direct DDE */
 - op.in = __pa(inbuf);
 + op.in = nx842_get_pa((void *)inbuf);
   op.inlen = hdr-sizes[i];
   } else {
   /* Create indirect DDE (scatterlist) */
   nx842_build_scatterlist(inbuf, hdr-sizes[i] , slin);
 - op.in = __pa(slin.entries);
 + op.in = nx842_get_pa(slin.entries);
   op.inlen = -nx842_get_scatterlist_size(slin);
   }
  
 @@ -613,12 +622,12 @@
*/
   if (likely(max_sync_size == NX842_HW_PAGE_SIZE)) {
   /* Create direct DDE */
 - op.out = __pa(outbuf);
 + op.out = nx842_get_pa((void *)outbuf);
   op.outlen = max_sync_size;
   } else {
   /* Create indirect DDE (scatterlist) */
   nx842_build_scatterlist(outbuf, max_sync_size, slout);
 - op.out = __pa(slout.entries);
 + op.out = nx842_get_pa(slout.entries);
   op.outlen = -nx842_get_scatterlist_size(slout);
   }
  
 
 

[PATCH] powerpc/powernv: Add iommu DMA bypass support for IODA2

2014-02-10 Thread Benjamin Herrenschmidt
this patch adds the support for to create a direct iommu bypass
window on IODA2 bridges (such as Power8) allowing to bypass iommu
page translation completely for 64-bit DMA capable devices, thus
significantly improving DMA performances.

Additionally, this adds a hook to the struct iommu_table so that
the IOMMU API / VFIO can disable the bypass when external ownership
is requested, since in that case, the device will be used by an
environment such as userspace or a KVM guest which must not be
allowed to bypass translations.

Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org
---

This is pretty much identical to the old version I posted a while ago,
except that it does have the iommu calls to enable/disable which I had
forgotten to git add before posting the previous one.

 arch/powerpc/include/asm/dma-mapping.h|  1 +
 arch/powerpc/include/asm/iommu.h  |  1 +
 arch/powerpc/kernel/dma.c | 10 ++--
 arch/powerpc/kernel/iommu.c   | 12 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 84 +++
 arch/powerpc/platforms/powernv/pci.c  | 10 
 arch/powerpc/platforms/powernv/pci.h  |  6 ++-
 arch/powerpc/platforms/powernv/powernv.h  |  4 ++
 arch/powerpc/platforms/powernv/setup.c|  9 
 9 files changed, 133 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index e27e9ad..150866b 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -134,6 +134,7 @@ static inline int dma_supported(struct device *dev, u64 
mask)
 }
 
 extern int dma_set_mask(struct device *dev, u64 dma_mask);
+extern int __dma_set_mask(struct device *dev, u64 dma_mask);
 
 #define dma_alloc_coherent(d,s,h,f)dma_alloc_attrs(d,s,h,f,NULL)
 
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index f7a8036..42632c7 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -77,6 +77,7 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
 #endif
+   void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index 8032b97..ee78f6e 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -191,12 +191,10 @@ EXPORT_SYMBOL(dma_direct_ops);
 
 #define PREALLOC_DMA_DEBUG_ENTRIES (1  16)
 
-int dma_set_mask(struct device *dev, u64 dma_mask)
+int __dma_set_mask(struct device *dev, u64 dma_mask)
 {
struct dma_map_ops *dma_ops = get_dma_ops(dev);
 
-   if (ppc_md.dma_set_mask)
-   return ppc_md.dma_set_mask(dev, dma_mask);
if ((dma_ops != NULL)  (dma_ops-set_dma_mask != NULL))
return dma_ops-set_dma_mask(dev, dma_mask);
if (!dev-dma_mask || !dma_supported(dev, dma_mask))
@@ -204,6 +202,12 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
*dev-dma_mask = dma_mask;
return 0;
 }
+int dma_set_mask(struct device *dev, u64 dma_mask)
+{
+   if (ppc_md.dma_set_mask)
+   return ppc_md.dma_set_mask(dev, dma_mask);
+   return __dma_set_mask(dev, dma_mask);
+}
 EXPORT_SYMBOL(dma_set_mask);
 
 u64 dma_get_required_mask(struct device *dev)
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d773dd4..88e3ec6 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1088,6 +1088,14 @@ int iommu_take_ownership(struct iommu_table *tbl)
memset(tbl-it_map, 0xff, sz);
iommu_clear_tces_and_put_pages(tbl, tbl-it_offset, tbl-it_size);
 
+   /*
+* Disable iommu bypass, otherwise the user can DMA to all of
+* our physical memory via the bypass window instead of just
+* the pages that has been explicitly mapped into the iommu
+*/
+   if (tbl-set_bypass)
+   tbl-set_bypass(tbl, false);
+
return 0;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
@@ -1102,6 +1110,10 @@ void iommu_release_ownership(struct iommu_table *tbl)
/* Restore bit#0 set by iommu_init_table() */
if (tbl-it_offset == 0)
set_bit(0, tbl-it_map);
+
+   /* The kernel owns the device now, we can restore the iommu bypass */
+   if (tbl-set_bypass)
+   tbl-set_bypass(tbl, true);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7d6dcc6..3b2b4fb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -21,6 +21,7 @@
 #include linux/irq.h
 #include linux/io.h
 #include linux/msi.h
+#include linux/memblock.h
 
 #include asm/sections.h
 #include asm/io.h
@@ -460,9 +461,39 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb 
*phb, struct pci_dev *pdev
return;
 
 

Re: [PATCH v2] powerpc ticket locks

2014-02-10 Thread Benjamin Herrenschmidt
(Linus, Al, a question for you down there about lockref ref size)

On Mon, 2014-02-10 at 16:52 +0100, Torsten Duwe wrote:

 What if I squeeze the bits a little?
 4k vCPUs, and 256 physical, as a limit to stay within 32 bits?
 At the cost that unlock may become an ll/sc operation again.
 I could think about a trick against that.
 But alas, hw_cpu_id is 16 bit, which makes a lookup table neccessary :-/
 
 Doing another round of yields for lockrefs now doesn't
 sound so bad any more.

So, the ticketpair has to be 16:16 so we can avoid the atomic on unlock

That leaves us with 32 bits to put the ref and the owner. The question
is how big the ref really has to be and can we have a reasonable failure
mode if it overflows ?

If we limit ourselves to, for example, 16-bit for the ref in lockref,
then we can have the second 32-bit split between the owner and the ref.

If we limit ourselves to 4k CPUs, then we get 4 more bits of ref ...

So the question is, is it reasonable to have the ref smaller than
32-bit...

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2] powerpc ticket locks

2014-02-10 Thread Al Viro
On Tue, Feb 11, 2014 at 01:44:20PM +1100, Benjamin Herrenschmidt wrote:
 That leaves us with 32 bits to put the ref and the owner. The question
 is how big the ref really has to be and can we have a reasonable failure
 mode if it overflows ?
 
 If we limit ourselves to, for example, 16-bit for the ref in lockref,
 then we can have the second 32-bit split between the owner and the ref.
 
 If we limit ourselves to 4k CPUs, then we get 4 more bits of ref ...
 
 So the question is, is it reasonable to have the ref smaller than
 32-bit...

Every time you open a file, you bump dentry refcount.  Something like
libc or ld.so will be opened on just about every execve(), so I'd say
that 16 bits is far too low.  If nothing else, 32 bits might be too
low on 64bit boxen...
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2] powerpc ticket locks

2014-02-10 Thread Benjamin Herrenschmidt
On Tue, 2014-02-11 at 02:56 +, Al Viro wrote:
  So the question is, is it reasonable to have the ref smaller than
  32-bit...
 
 Every time you open a file, you bump dentry refcount.  Something like
 libc or ld.so will be opened on just about every execve(), so I'd say
 that 16 bits is far too low.  If nothing else, 32 bits might be too
 low on 64bit boxen...

So back to square 1 ... we can't implement together lockref, ticket
locks, and our lock confer mechanism within 64-bit.

I see two options at this stage. Both require a custom implementation
of lockref for powerpc, so some ifdef's such that we can replace the
generic implementation completely.

 - We can use a small ref, and when it's too big, overflow into a larger
one, falling back to the old style lock + ref (an overflow bit or a
compare with )

 - We can have lockref build it's own lock out of the ticketpair and
ref, keeping the owner in a separate word. The owner doesn't strictly
need to be atomic.

Both are gross though :(

Anybody has a better idea ?

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] powerpc/spufs: Remove MAX_USER_PRIO define

2014-02-10 Thread Jeremy Kerr
Current ppc64_defconfig fails with:

 arch/powerpc/platforms/cell/spufs/sched.c:86:0: error: MAX_USER_PRIO 
redefined [-Werror]
 cc1: all warnings being treated as errors

6b6350f1 introduced a generic MAX_USER_PRIO macro to sched/prio.h, which
is causing the conflit. Use that one instead of our own.

Signed-off-by: Jeremy Kerr j...@ozlabs.org

---
Ingo: 6b6350f1 is currently in tip; this fixes a build breakage for spufs

---
 arch/powerpc/platforms/cell/spufs/sched.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c 
b/arch/powerpc/platforms/cell/spufs/sched.c
index 4931838..4a0a64f 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -83,7 +83,6 @@ static struct timer_list spuloadavg_timer;
 #define MIN_SPU_TIMESLICE  max(5 * HZ / (1000 * SPUSCHED_TICK), 1)
 #define DEF_SPU_TIMESLICE  (100 * HZ / (1000 * SPUSCHED_TICK))
 
-#define MAX_USER_PRIO  (MAX_PRIO - MAX_RT_PRIO)
 #define SCALE_PRIO(x, prio) \
max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_SPU_TIMESLICE)
 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v1 0/2] powernv: cpufreq support for IBM POWERNV platform

2014-02-10 Thread Vaidyanathan Srinivasan
Hi,

The following patch series implements the platform driver to support
dynamic CPU frequency scaling on IBM POWERNV platforms.

This patch series is based on Linux kernel 3.14-rc2 and tested on
OPAL v3 based IBM POWERNV platform and IBM POWER8 processor.

--Vaidy

---

Srivatsa S. Bhat (1):
  powernv, cpufreq: Add per-core locking to serialize frequency transitions

Vaidyanathan Srinivasan (1):
  powernv: cpufreq driver for powernv platform


 arch/powerpc/include/asm/reg.h|4 +
 drivers/cpufreq/Kconfig.powerpc   |9 +
 drivers/cpufreq/Makefile  |1 
 drivers/cpufreq/powernv-cpufreq.c |  286 +
 4 files changed, 300 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

-- 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v1 2/2] powernv, cpufreq: Add per-core locking to serialize frequency transitions

2014-02-10 Thread Vaidyanathan Srinivasan
From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com

On POWER systems, the CPU frequency is controlled at a core-level and
hence we need to serialize so that only one of the threads in the core
switches the core's frequency at a time.

Using a global mutex lock would needlessly serialize _all_ frequency
transitions in the system (across all cores). So introduce per-core
locking to enable finer-grained synchronization and thereby enhance
the speed and responsiveness of the cpufreq driver to varying workload
demands.

The design of per-core locking is very simple and straight-forward: we
first define a Per-CPU lock and use the ones that belongs to the first
thread sibling of the core.

cpu_first_thread_sibling() macro is used to find the *common* lock for
all thread siblings belonging to a core.

Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com
Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 drivers/cpufreq/powernv-cpufreq.c |   21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index ea3b630..8240e90 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -24,8 +24,15 @@
 #include linux/of.h
 #include asm/cputhreads.h
 
-/* FIXME: Make this per-core */
-static DEFINE_MUTEX(freq_switch_mutex);
+/* Per-Core locking for frequency transitions */
+static DEFINE_PER_CPU(struct mutex, freq_switch_lock);
+
+#define lock_core_freq(cpu)\
+   mutex_lock(per_cpu(freq_switch_lock,\
+   cpu_first_thread_sibling(cpu)));
+#define unlock_core_freq(cpu)  \
+   mutex_unlock(per_cpu(freq_switch_lock,\
+   cpu_first_thread_sibling(cpu)));
 
 #define POWERNV_MAX_PSTATES256
 
@@ -219,7 +226,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy 
*policy,
freqs.new = powernv_freqs[new_index].frequency;
freqs.cpu = policy-cpu;
 
-   mutex_lock(freq_switch_mutex);
+   lock_core_freq(policy-cpu);
cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE);
 
pr_debug(setting frequency for cpu %d to %d kHz index %d pstate %d,
@@ -231,7 +238,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy 
*policy,
rc = powernv_set_freq(policy-cpus, new_index);
 
cpufreq_notify_transition(policy, freqs, CPUFREQ_POSTCHANGE);
-   mutex_unlock(freq_switch_mutex);
+   unlock_core_freq(policy-cpu);
 
return rc;
 }
@@ -248,7 +255,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
 
 static int __init powernv_cpufreq_init(void)
 {
-   int rc = 0;
+   int cpu, rc = 0;
 
/* Discover pstates from device tree and init */
 
@@ -258,6 +265,10 @@ static int __init powernv_cpufreq_init(void)
pr_info(powernv-cpufreq disabled\n);
return rc;
}
+   /* Init per-core mutex */
+   for_each_possible_cpu(cpu) {
+   mutex_init(per_cpu(freq_switch_lock, cpu));
+   }
 
rc = cpufreq_register_driver(powernv_cpufreq_driver);
return rc;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v1 1/2] powernv: cpufreq driver for powernv platform

2014-02-10 Thread Vaidyanathan Srinivasan
Backend driver to dynamically set voltage and frequency on
IBM POWER non-virtualized platforms.  Power management SPRs
are used to set the required PState.

This driver works in conjunction with cpufreq governors
like 'ondemand' to provide a demand based frequency and
voltage setting on IBM POWER non-virtualized platforms.

PState table is obtained from OPAL v3 firmware through device
tree.

powernv_cpufreq back-end driver would parse the relevant device-tree
nodes and initialise the cpufreq subsystem on powernv platform.

Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com
Signed-off-by: Anton Blanchard an...@samba.org
---
 arch/powerpc/include/asm/reg.h|4 +
 drivers/cpufreq/Kconfig.powerpc   |9 +
 drivers/cpufreq/Makefile  |1 
 drivers/cpufreq/powernv-cpufreq.c |  275 +
 4 files changed, 289 insertions(+)
 create mode 100644 drivers/cpufreq/powernv-cpufreq.c

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 90c06ec..84f92ca 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -271,6 +271,10 @@
 #define SPRN_HSRR1 0x13B   /* Hypervisor Save/Restore 1 */
 #define SPRN_IC0x350   /* Virtual Instruction Count */
 #define SPRN_VTB   0x351   /* Virtual Time Base */
+#define SPRN_PMICR 0x354   /* Power Management Idle Control Reg */
+#define SPRN_PMSR  0x355   /* Power Management Status Reg */
+#define SPRN_PMCR  0x374   /* Power Management Control Register */
+
 /* HFSCR and FSCR bit numbers are the same */
 #define FSCR_TAR_LG8   /* Enable Target Address Register */
 #define FSCR_EBB_LG7   /* Enable Event Based Branching */
diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc
index ca0021a..4a91ab1 100644
--- a/drivers/cpufreq/Kconfig.powerpc
+++ b/drivers/cpufreq/Kconfig.powerpc
@@ -54,3 +54,12 @@ config PPC_PASEMI_CPUFREQ
help
  This adds the support for frequency switching on PA Semi
  PWRficient processors.
+
+config POWERNV_CPUFREQ
+   tristate CPU frequency scaling for IBM POWERNV platform
+   depends on PPC_POWERNV
+   select CPU_FREQ_TABLE
+   default y
+   help
+This adds support for CPU frequency switching on IBM POWERNV
+platform
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 7494565..0dbb963 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ)   += ppc-corenet-cpufreq.o
 obj-$(CONFIG_CPU_FREQ_PMAC)+= pmac32-cpufreq.o
 obj-$(CONFIG_CPU_FREQ_PMAC64)  += pmac64-cpufreq.o
 obj-$(CONFIG_PPC_PASEMI_CPUFREQ)   += pasemi-cpufreq.o
+obj-$(CONFIG_POWERNV_CPUFREQ)  += powernv-cpufreq.o
 
 
##
 # Other platform drivers
diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
new file mode 100644
index 000..ea3b630
--- /dev/null
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -0,0 +1,275 @@
+/*
+ * POWERNV cpufreq driver for the IBM POWER processors
+ *
+ * (C) Copyright IBM 2014
+ *
+ * Author: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#define pr_fmt(fmt)powernv-cpufreq:  fmt
+
+#include linux/module.h
+#include linux/cpufreq.h
+#include linux/of.h
+#include asm/cputhreads.h
+
+/* FIXME: Make this per-core */
+static DEFINE_MUTEX(freq_switch_mutex);
+
+#define POWERNV_MAX_PSTATES256
+
+static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
+
+/*
+ * Initialize the freq table based on data obtained
+ * from the firmware passed via device-tree
+ */
+
+static int init_powernv_pstates(void)
+{
+   struct device_node *power_mgt;
+   int nr_pstates = 0;
+   int pstate_min, pstate_max, pstate_nominal;
+   const __be32 *pstate_ids, *pstate_freqs;
+   int i;
+   u32 len_ids, len_freqs;
+
+   power_mgt = of_find_node_by_path(/ibm,opal/power-mgt);
+   if (!power_mgt) {
+   pr_warn(power-mgt node not found\n);
+   return -ENODEV;
+   }
+
+   if (of_property_read_u32(power_mgt, ibm,pstate-min, pstate_min)) {
+   pr_warn(ibm,pstate-min node not found\n);
+   return -ENODEV;
+   }
+
+   if (of_property_read_u32(power_mgt, 

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-10 Thread Joonsoo Kim
On Mon, Feb 10, 2014 at 11:13:21AM -0800, Nishanth Aravamudan wrote:
 Hi Christoph,
 
 On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
  Here is a draft of a patch to make this work with memoryless nodes.
  
  The first thing is that we modify node_match to also match if we hit an
  empty node. In that case we simply take the current slab if its there.
  
  If there is no current slab then a regular allocation occurs with the
  memoryless node. The page allocator will fallback to a possible node and
  that will become the current slab. Next alloc from a memoryless node
  will then use that slab.
  
  For that we also add some tracking of allocations on nodes that were not
  satisfied using the empty_node[] array. A successful alloc on a node
  clears that flag.
  
  I would rather avoid the empty_node[] array since its global and there may
  be thread specific allocation restrictions but it would be expensive to do
  an allocation attempt via the page allocator to make sure that there is
  really no page available from the page allocator.
 
 With this patch on our test system (I pulled out the numa_mem_id()
 change, since you Acked Joonsoo's already), on top of 3.13.0 + my
 kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
 patch 1):
 
 MemTotal:8264704 kB
 MemFree: 5924608 kB
 ...
 Slab:1402496 kB
 SReclaimable: 102848 kB
 SUnreclaim:  1299648 kB
 
 And Anton's slabusage reports:
 
 slab   mem objsslabs
   used   active   active
 
 kmalloc-16384   207 MB   98.60%  100.00%
 task_struct 134 MB   97.82%  100.00%
 kmalloc-8192117 MB  100.00%  100.00%
 pgtable-2^12111 MB  100.00%  100.00%
 pgtable-2^10104 MB  100.00%  100.00%
 
 For comparison, Anton's patch applied at the same point in the series:
 
 meminfo:
 
 MemTotal:8264704 kB
 MemFree: 4150464 kB
 ...
 Slab:1590336 kB
 SReclaimable: 208768 kB
 SUnreclaim:  1381568 kB
 
 slabusage:
 
 slab   mem objsslabs
   used   active   active
 
 kmalloc-16384   227 MB   98.63%  100.00%
 kmalloc-8192130 MB  100.00%  100.00%
 task_struct 129 MB   97.73%  100.00%
 pgtable-2^12112 MB  100.00%  100.00%
 pgtable-2^10106 MB  100.00%  100.00%
 
 
 Consider this patch:
 
 Acked-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
 Tested-by: Nishanth Aravamudan n...@linux.vnet.ibm.com

Hello,

I still think that there is another problem.
Your report about CONFIG_SLAB said that SLAB uses just 200MB.
Below is your previous report.

  Ok, with your patches applied and CONFIG_SLAB enabled:

  MemTotal:8264640 kB
  MemFree: 7119680 kB
  Slab: 207232 kB
  SReclaimable:  32896 kB
  SUnreclaim:   174336 kB

The number on CONFIG_SLUB with these patches tell us that SLUB uses 1.4GB.
There is large difference on slab usage.

And, I should note that number of active objects on slabinfo can be wrong
on some situation, since it doesn't consider cpu slab (and cpu partial slab).

I recommend to confirm page_to_nid() and other things as I mentioned earlier.

Thanks.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?

2014-02-10 Thread Gabriel Paubert
Hi James,

On Mon, Feb 10, 2014 at 11:03:07AM -0600, James Yang wrote:

[snipped]
  Ok, if you have measured that method1 is faster than method2, let us go for 
  it.
  I believe method2 would be faster if you had a large out-of-order execution
  window, because more parallelism can be extracted from it, but this is 
  probably
  only true for high end cores, which do not need FPU emulation in the first 
  place.
 
 Yeah, 8548 can issue 2 SFX instructions per cycle which is what the 
 compiler generated.   
 

Then it is method1.

  
  The other place where we can optimize is the generation of FEX. Here is 
  my current patch:
  
  
  diff --git a/arch/powerpc/math-emu/mtfsf.c b/arch/powerpc/math-emu/mtfsf.c
  index dbce92e..b57b3fa8 100644
  --- a/arch/powerpc/math-emu/mtfsf.c
  +++ b/arch/powerpc/math-emu/mtfsf.c
  @@ -11,48 +11,35 @@ mtfsf(unsigned int FM, u32 *frB)
  u32 mask;
  u32 fpscr;
   
  -   if (FM == 0)
  +   if (likely(FM == 0xff))
  +   mask = 0x;
  +   else if (unlikely(FM == 0))
  return 0;
  -
  -   if (FM == 0xff)
  -   mask = 0x9fff;
  else {
  -   mask = 0;
  -   if (FM  (1  0))
  -   mask |= 0x9000;
  -   if (FM  (1  1))
  -   mask |= 0x0f00;
  -   if (FM  (1  2))
  -   mask |= 0x00f0;
  -   if (FM  (1  3))
  -   mask |= 0x000f;
  -   if (FM  (1  4))
  -   mask |= 0xf000;
  -   if (FM  (1  5))
  -   mask |= 0x0f00;
  -   if (FM  (1  6))
  -   mask |= 0x00f0;
  -   if (FM  (1  7))
  -   mask |= 0x000f;
  +   mask = (FM  1);
  +   mask |= (FM  3)  0x10;
  +   mask |= (FM  6)  0x100;
  +   mask |= (FM  9)  0x1000;
  +   mask |= (FM  12)  0x1;
  +   mask |= (FM  15)  0x10;
  +   mask |= (FM  18)  0x100;
  +   mask |= (FM  21)  0x1000;
  +   mask *= 15;
 
 
 Needs to also mask out bits 1 and 2, they aren't to be set from frB.
 
   mask = 0x9FFF;
 
 

Look at the following lines:

 
 
  }
   
  -   __FPU_FPSCR = ~(mask);
  -   __FPU_FPSCR |= (frB[1]  mask);
  +   fpscr = ((__FPU_FPSCR  ~mask) | (frB[1]  mask)) 
  +   ~(FPSCR_VX | FPSCR_FEX);
 
It's here (masking FPSCR_VX and FPSCR_FEX).

Actually the previous code was redundant, it cleared FEX and VX in the
mask computation and later again when recomputing them. Clearing them
once should be enough.

   
  -   __FPU_FPSCR = ~(FPSCR_VX);
  -   if (__FPU_FPSCR  (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
  +   if (fpscr  (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
   FPSCR_VXZDZ | FPSCR_VXIMZ | FPSCR_VXVC |
   FPSCR_VXSOFT | FPSCR_VXSQRT | FPSCR_VXCVI))
  -   __FPU_FPSCR |= FPSCR_VX;
  -
  -   fpscr = __FPU_FPSCR;
  -   fpscr = ~(FPSCR_FEX);
  -   if (((fpscr  FPSCR_VX)  (fpscr  FPSCR_VE)) ||
  -   ((fpscr  FPSCR_OX)  (fpscr  FPSCR_OE)) ||
  -   ((fpscr  FPSCR_UX)  (fpscr  FPSCR_UE)) ||
  -   ((fpscr  FPSCR_ZX)  (fpscr  FPSCR_ZE)) ||
  -   ((fpscr  FPSCR_XX)  (fpscr  FPSCR_XE)))
  -   fpscr |= FPSCR_FEX;
  +   fpscr |= FPSCR_VX;
  +
  +   /* The bit order of exception enables and exception status
  +* is the same. Simply shift and mask to check for enabled
  +* exceptions.
  +*/
  +   if (fpscr  (fpscr  22)   0xf8) fpscr |= FPSCR_FEX;
  __FPU_FPSCR = fpscr;
   
   #ifdef DEBUG
   mtfsf.c |   57 ++---
   1 file changed, 22 insertions(+), 35 deletions(-)
  
  
  Notes: 
  
  1) I'm really unsure on whether 0xff is frequent or not. So the likely()
  statement at the beginning may be wrong. Actually, if it is not very likely,
  it might be better to remove the special casef for FM==0xff. A look at 
  GCC sources shows that it never generates a mask of 0xff. From glibc
  sources, there vast majority of cases uses 0x1, only isnan() uses 0xff.
 
 Can't handle all cases here.  

That's why I would go for the simplest possible code. Conditionals are
expensive and minimizing cache footprint is often the best measure of 
performance for infrequently used code. With this in mind, I would get 
rid of all the tests for special FM values and rely on the optimized
generic case.


  
  2) it may be better to remove the check for FM==0, after all, the 
  instruction
  effectively becomes a nop, and generating the instruction in the first place
  would be too stupid for words.
 
 Hmm a heavy no-op.  I wonder if it is heavier than a sync.

In theory not. It contains the equivalent of several isync (taking an exception
and returning from it), but not any synchronization wrt the memory accesses.

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org