Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
On Fri, Feb 07, 2014 at 02:49:40PM -0600, James Yang wrote: On Fri, 7 Feb 2014, Gabriel Paubert wrote: Hi Stephen, On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote: Gabriel Paubert paub...@iram.es wrote on 02/06/2014 07:26:37 PM: From: Gabriel Paubert paub...@iram.es To: Stephen N Chivers schiv...@csc.com.au Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor cproc...@csc.com.au Date: 02/06/2014 07:26 PM Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask? On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote: mask = 0; if (FM (1 0)) mask |= 0x000f; if (FM (1 1)) mask |= 0x00f0; if (FM (1 2)) mask |= 0x0f00; if (FM (1 3)) mask |= 0xf000; if (FM (1 4)) mask |= 0x000f; if (FM (1 5)) mask |= 0x00f0; if (FM (1 6)) mask |= 0x0f00; if (FM (1 7)) mask |= 0x9000; With the above mask computation I get consistent results for both the MPC8548 and MPC7410 boards. Am I missing something subtle? No I think you are correct. This said, this code may probably be optimized to eliminate a lot of the conditional branches. I think that: If the compiler is enabled to generate isel instructions, it would not use a conditional branch for this code. (ignore the andi's values, this is an old compile) c0037c2c mtfsf: c0037c2c: 2c 03 00 00 cmpwi r3,0 c0037c30: 41 82 01 1c beq-c0037d4c mtfsf+0x120 c0037c34: 2f 83 00 ff cmpwi cr7,r3,255 c0037c38: 41 9e 01 28 beq-cr7,c0037d60 mtfsf+0x134 c0037c3c: 70 66 00 01 andi. r6,r3,1 c0037c40: 3d 00 90 00 lis r8,-28672 c0037c44: 7d 20 40 9e iseleq r9,r0,r8 c0037c48: 70 6a 00 02 andi. r10,r3,2 c0037c4c: 65 28 0f 00 orisr8,r9,3840 c0037c50: 7d 29 40 9e iseleq r9,r9,r8 c0037c54: 70 66 00 04 andi. r6,r3,4 c0037c58: 65 28 00 f0 orisr8,r9,240 c0037c5c: 7d 29 40 9e iseleq r9,r9,r8 c0037c60: 70 6a 00 08 andi. r10,r3,8 c0037c64: 65 28 00 0f orisr8,r9,15 c0037c68: 7d 29 40 9e iseleq r9,r9,r8 c0037c6c: 70 66 00 10 andi. r6,r3,16 c0037c70: 61 28 f0 00 ori r8,r9,61440 c0037c74: 7d 29 40 9e iseleq r9,r9,r8 c0037c78: 70 6a 00 20 andi. r10,r3,32 c0037c7c: 61 28 0f 00 ori r8,r9,3840 c0037c80: 54 6a cf fe rlwinm r10,r3,25,31,31 c0037c84: 7d 29 40 9e iseleq r9,r9,r8 c0037c88: 2f 8a 00 00 cmpwi cr7,r10,0 c0037c8c: 70 66 00 40 andi. r6,r3,64 c0037c90: 61 28 00 f0 ori r8,r9,240 c0037c94: 7d 29 40 9e iseleq r9,r9,r8 c0037c98: 41 9e 00 08 beq-cr7,c0037ca0 mtfsf+0x74 c0037c9c: 61 29 00 0f ori r9,r9,15 ... However, your other solutions are better. mask = (FM 1); mask |= (FM 3) 0x10; mask |= (FM 6) 0x100; mask |= (FM 9) 0x1000; mask |= (FM 12) 0x1; mask |= (FM 15) 0x10; mask |= (FM 18) 0x100; mask |= (FM 21) 0x1000; mask *= 15; should do the job, in less code space and without a single branch. Each one of the mask |= lines should be translated into an rlwinm instruction followed by an or. Actually it should be possible to transform each of these lines into a single rlwimi instruction but I don't know how to coerce gcc to reach this level of optimization. Another way of optomizing this could be: mask = (FM 0x0f) | ((FM 12) 0x000f); mask = (mask 0x00030003) | ((mask 6) 0x03030303); mask = (mask 0x01010101) | ((mask 3) 0x10101010); mask *= 15; Ok, I finally edited my sources and test compiled the suggestions I gave. I'd say that method2 is the best overall indeed. You can actually save one more instruction by setting mask to all ones in the case FM=0xff, but that's about all in this area. My measurements show method1 to be smaller and faster than method2 due to the number of instructions needed to generate the constant masks in method2, but it may depend upon your compiler and hardware. Both are faster than the original with isel. Ok, if you have measured that method1 is faster than method2, let us go for it. I believe method2 would be faster if you had a large out-of-order execution window, because more parallelism can be extracted from it, but
RE: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
However, your other solutions are better. mask = (FM 1); mask |= (FM 3) 0x10; mask |= (FM 6) 0x100; mask |= (FM 9) 0x1000; mask |= (FM 12) 0x1; mask |= (FM 15) 0x10; mask |= (FM 18) 0x100; mask |= (FM 21) 0x1000; mask *= 15; should do the job, in less code space and without a single branch. ... Another way of optomizing this could be: mask = (FM 0x0f) | ((FM 12) 0x000f); mask = (mask 0x00030003) | ((mask 6) 0x03030303); mask = (mask 0x01010101) | ((mask 3) 0x10101010); mask *= 15; ... Ok, if you have measured that method1 is faster than method2, let us go for it. I believe method2 would be faster if you had a large out-of-order execution window, because more parallelism can be extracted from it, but this is probably only true for high end cores, which do not need FPU emulation in the first place. FWIW the second has a long dependency chain on 'mask', whereas the first can execute the shift/and in any order and then merge the results. So on most superscalar cpu, or one with result delays for arithmetic, the first is likely to be faster. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
On Mon, Feb 10, 2014 at 11:17:38AM +, David Laight wrote: However, your other solutions are better. mask = (FM 1); mask |= (FM 3) 0x10; mask |= (FM 6) 0x100; mask |= (FM 9) 0x1000; mask |= (FM 12) 0x1; mask |= (FM 15) 0x10; mask |= (FM 18) 0x100; mask |= (FM 21) 0x1000; mask *= 15; should do the job, in less code space and without a single branch. ... Another way of optomizing this could be: mask = (FM 0x0f) | ((FM 12) 0x000f); mask = (mask 0x00030003) | ((mask 6) 0x03030303); mask = (mask 0x01010101) | ((mask 3) 0x10101010); mask *= 15; ... Ok, if you have measured that method1 is faster than method2, let us go for it. I believe method2 would be faster if you had a large out-of-order execution window, because more parallelism can be extracted from it, but this is probably only true for high end cores, which do not need FPU emulation in the first place. FWIW the second has a long dependency chain on 'mask', whereas the first can execute the shift/and in any order and then merge the results. So on most superscalar cpu, or one with result delays for arithmetic, the first is likely to be faster. I disagree, perhaps mostly because the compiler is not clever enough, but right now the code for solution 1 is (actually I have rewritten the code and it reads: mask = (FM 1) | ((FM 3) 0x10) | ((FM 6) 0x100) | ((FM 9) 0x1000) | ((FM 12) 0x1) | ((FM 15) 0x10) | ((FM 18) 0x100) | ((FM 21) 0x1000); to avoid sequence point in case it hampers the compiler) and the output is: rlwinm 10,3,3,27,27 # D.11621, FM,, rlwinm 9,3,6,23,23 # D.11621, FM,, or 9,10,9#, D.11621, D.11621, D.11621 rlwinm 10,3,0,31,31 # D.11621, FM, or 9,9,10#, D.11621, D.11621, D.11621 rlwinm 10,3,9,19,19 # D.11621, FM,, or 9,9,10#, D.11621, D.11621, D.11621 rlwinm 10,3,12,15,15 # D.11621, FM,, or 9,9,10#, D.11621, D.11621, D.11621 rlwinm 10,3,15,11,11 # D.11621, FM,, or 9,9,10#, D.11621, D.11621, D.11621 rlwinm 10,3,18,7,7 # D.11621, FM,, or 9,9,10#, D.11621, D.11621, D.11621 rlwinm 3,3,21,3,3# D.11621, FM,, or 9,9,3 #, mask, D.11621, D.11621 mulli 9,9,15 # mask, mask, see that r9 is used 7 times as both input and output operand, plus once for rlwinm. This gives a dependency length of 8 at least. In the other case (I've deleted the code) the dependency length was significantly shorter. In any case that one is fewer instructions, which is good for occasional use. Gabriel ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
I disagree, perhaps mostly because the compiler is not clever enough, but right now the code for solution 1 is (actually I have rewritten the code and it reads: mask = (FM 1) | ((FM 3) 0x10) | ((FM 6) 0x100) | ((FM 9) 0x1000) | ((FM 12) 0x1) | ((FM 15) 0x10) | ((FM 18) 0x100) | ((FM 21) 0x1000); to avoid sequence point in case it hampers the compiler) and the output is: rlwinm 10,3,3,27,27 # D.11621, FM,, rlwinm 9,3,6,23,23 # D.11621, FM,, or 9,10,9#, D.11621, D.11621, D.11621 rlwinm 10,3,0,31,31 # D.11621, FM, or 9,9,10#, D.11621, D.11621, D.11621 rlwinm 10,3,9,19,19 # D.11621, FM,, or 9,9,10#, D.11621, D.11621, D.11621 rlwinm 10,3,12,15,15 # D.11621, FM,, or 9,9,10#, D.11621, D.11621, D.11621 rlwinm 10,3,15,11,11 # D.11621, FM,, or 9,9,10#, D.11621, D.11621, D.11621 rlwinm 10,3,18,7,7 # D.11621, FM,, or 9,9,10#, D.11621, D.11621, D.11621 rlwinm 3,3,21,3,3# D.11621, FM,, or 9,9,3 #, mask, D.11621, D.11621 mulli 9,9,15 # mask, mask, see that r9 is used 7 times as both input and output operand, plus once for rlwinm. This gives a dependency length of 8 at least. In the other case (I've deleted the code) the dependency length was significantly shorter. In any case that one is fewer instructions, which is good for occasional use. Hmmm... I hand-counted a dependency length of 8 for the other version. Maybe there are some ppc instructions that reduce it. Stupid compiler :-) Trouble is, I bet that even if you code it as: mask1 = (FM 1) | ((FM 3) 0x10); mask2 = ((FM 6) 0x100) | ((FM 9) 0x1000); mask3 = ((FM 12) 0x1) | ((FM 15) 0x10); mask4 = ((FM 18) 0x100) | ((FM 21) 0x1000); mask1 |= mask2; mask3 |= mask4; mask = mask1 | mask3; the compiler will 'optimise' it to the above before code generation. If it doesn't adding () to pair the | might be enough. Then a new version of the compiler will change the behaviour again. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
On Mon, Feb 10, 2014 at 12:32:18PM +, David Laight wrote: I disagree, perhaps mostly because the compiler is not clever enough, but right now the code for solution 1 is (actually I have rewritten the code and it reads: mask = (FM 1) | ((FM 3) 0x10) | ((FM 6) 0x100) | ((FM 9) 0x1000) | ((FM 12) 0x1) | ((FM 15) 0x10) | ((FM 18) 0x100) | ((FM 21) 0x1000); to avoid sequence point in case it hampers the compiler) and the output is: rlwinm 10,3,3,27,27 # D.11621, FM,, rlwinm 9,3,6,23,23 # D.11621, FM,, or 9,10,9#, D.11621, D.11621, D.11621 rlwinm 10,3,0,31,31 # D.11621, FM, or 9,9,10#, D.11621, D.11621, D.11621 rlwinm 10,3,9,19,19 # D.11621, FM,, or 9,9,10#, D.11621, D.11621, D.11621 rlwinm 10,3,12,15,15 # D.11621, FM,, or 9,9,10#, D.11621, D.11621, D.11621 rlwinm 10,3,15,11,11 # D.11621, FM,, or 9,9,10#, D.11621, D.11621, D.11621 rlwinm 10,3,18,7,7 # D.11621, FM,, or 9,9,10#, D.11621, D.11621, D.11621 rlwinm 3,3,21,3,3# D.11621, FM,, or 9,9,3 #, mask, D.11621, D.11621 mulli 9,9,15 # mask, mask, see that r9 is used 7 times as both input and output operand, plus once for rlwinm. This gives a dependency length of 8 at least. In the other case (I've deleted the code) the dependency length was significantly shorter. In any case that one is fewer instructions, which is good for occasional use. Hmmm... I hand-counted a dependency length of 8 for the other version. Maybe there are some ppc instructions that reduce it. Either I misread the generated code or I got somewhat less. What helps for method1 is the rotate and mask instructions of PPC. Each of left shift and mask becomes a single rlwinm. Stupid compiler :-) Indeed. I've trying to coerce it into generating rlwimi instructions (in which case the whole building of the mask reduces to 8 assembly instructions) and failed. It seems that the compiler lacks some patterns some patterns that would directly map to rlwimi. Trouble is, I bet that even if you code it as: mask1 = (FM 1) | ((FM 3) 0x10); mask2 = ((FM 6) 0x100) | ((FM 9) 0x1000); mask3 = ((FM 12) 0x1) | ((FM 15) 0x10); mask4 = ((FM 18) 0x100) | ((FM 21) 0x1000); mask1 |= mask2; mask3 |= mask4; mask = mask1 | mask3; the compiler will 'optimise' it to the above before code generation. Indeed it's what it does :-( I believe that the current suggestion is good enough. Gabriel ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] powerpc ticket locks
On Mon, Feb 10, 2014 at 02:10:23PM +1100, Benjamin Herrenschmidt wrote: On Fri, 2014-02-07 at 17:58 +0100, Torsten Duwe wrote: typedef struct { - volatile unsigned int slock; -} arch_spinlock_t; + union { + __ticketpair_t head_tail; + struct __raw_tickets { +#ifdef __BIG_ENDIAN__ /* The tail part should be in the MSBs */ + __ticket_t tail, head; +#else + __ticket_t head, tail; +#endif + } tickets; + }; +#if defined(CONFIG_PPC_SPLPAR) + u32 holder; +#endif +} arch_spinlock_t __aligned(4); That's still broken with lockref (which we just merged). We must have the arch_spinlock_t and the ref in the same 64-bit word otherwise it will break. Well, as far as I can see you'll just not be able to USE_CMPXCHG_LOCKREF -- with the appropriate performance hit -- the code just falls back into lockref on pSeries. What again was the intention of directed yield in the first place...? We can make it work in theory since the holder doesn't have to be accessed atomically, but the practicals are a complete mess ... lockref would essentially have to re-implement the holder handling of the spinlocks and use lower level ticket stuff. Unless you can find a sneaky trick ... :-( What if I squeeze the bits a little? 4k vCPUs, and 256 physical, as a limit to stay within 32 bits? At the cost that unlock may become an ll/sc operation again. I could think about a trick against that. But alas, hw_cpu_id is 16 bit, which makes a lookup table neccessary :-/ Doing another round of yields for lockrefs now doesn't sound so bad any more. Opinions, anyone? Torsten ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
On Sun, 9 Feb 2014, Stephen N Chivers wrote: James Yang james.y...@freescale.com wrote on 02/08/2014 07:49:40 AM: From: James Yang james.y...@freescale.com To: Gabriel Paubert paub...@iram.es Cc: Stephen N Chivers schiv...@csc.com.au, Chris Proctor cproc...@csc.com.au, linuxppc-dev@lists.ozlabs.org Date: 02/08/2014 07:49 AM Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask? On Fri, 7 Feb 2014, Gabriel Paubert wrote: Hi Stephen, On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote: Gabriel Paubert paub...@iram.es wrote on 02/06/2014 07:26:37 PM: From: Gabriel Paubert paub...@iram.es To: Stephen N Chivers schiv...@csc.com.au Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor cproc...@csc.com.au Date: 02/06/2014 07:26 PM Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask? On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote: With the above mask computation I get consistent results for both the MPC8548 and MPC7410 boards. Am I missing something subtle? No I think you are correct. This said, this code may probably be optimized to eliminate a lot of the conditional branches. I think that: If the compiler is enabled to generate isel instructions, it would not use a conditional branch for this code. (ignore the andi's values, this is an old compile) From limited research, the 440GP is a processor that doesn't implement the isel instruction and it does not implement floating point. The kernel emulates isel and so using that instruction for the 440GP would have a double trap penalty. Are you writing about something outside the scope of this thread? OP was using MPC8548 not a 440GP. The compiler should not be using or targeting 8548 for a 440GP so having to emulate isel shouldn't be an issue because there wouldn't be any. (The assembly listing I posted was generated by gcc targeting 8548.) Anyway, I had measured the non-isel routines to be faster and that works without illop traps. Correct me if I am wrong, the isel instruction first appears in PowerPC ISA v2.04 around mid 2007. isel appeared in 2003 in the e500 (v1) core that is in the MPC8540. The instruction is Power ISA 2.03 (9/2006). ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
On Mon, 10 Feb 2014, Gabriel Paubert wrote: On Fri, Feb 07, 2014 at 02:49:40PM -0600, James Yang wrote: On Fri, 7 Feb 2014, Gabriel Paubert wrote: Hi Stephen, On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote: Gabriel Paubert paub...@iram.es wrote on 02/06/2014 07:26:37 PM: From: Gabriel Paubert paub...@iram.es To: Stephen N Chivers schiv...@csc.com.au Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor cproc...@csc.com.au Date: 02/06/2014 07:26 PM Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask? On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote: mask = 0; if (FM (1 0)) mask |= 0x000f; if (FM (1 1)) mask |= 0x00f0; if (FM (1 2)) mask |= 0x0f00; if (FM (1 3)) mask |= 0xf000; if (FM (1 4)) mask |= 0x000f; if (FM (1 5)) mask |= 0x00f0; if (FM (1 6)) mask |= 0x0f00; if (FM (1 7)) mask |= 0x9000; With the above mask computation I get consistent results for both the MPC8548 and MPC7410 boards. Ok, if you have measured that method1 is faster than method2, let us go for it. I believe method2 would be faster if you had a large out-of-order execution window, because more parallelism can be extracted from it, but this is probably only true for high end cores, which do not need FPU emulation in the first place. Yeah, 8548 can issue 2 SFX instructions per cycle which is what the compiler generated. The other place where we can optimize is the generation of FEX. Here is my current patch: diff --git a/arch/powerpc/math-emu/mtfsf.c b/arch/powerpc/math-emu/mtfsf.c index dbce92e..b57b3fa8 100644 --- a/arch/powerpc/math-emu/mtfsf.c +++ b/arch/powerpc/math-emu/mtfsf.c @@ -11,48 +11,35 @@ mtfsf(unsigned int FM, u32 *frB) u32 mask; u32 fpscr; - if (FM == 0) + if (likely(FM == 0xff)) + mask = 0x; + else if (unlikely(FM == 0)) return 0; - - if (FM == 0xff) - mask = 0x9fff; else { - mask = 0; - if (FM (1 0)) - mask |= 0x9000; - if (FM (1 1)) - mask |= 0x0f00; - if (FM (1 2)) - mask |= 0x00f0; - if (FM (1 3)) - mask |= 0x000f; - if (FM (1 4)) - mask |= 0xf000; - if (FM (1 5)) - mask |= 0x0f00; - if (FM (1 6)) - mask |= 0x00f0; - if (FM (1 7)) - mask |= 0x000f; + mask = (FM 1); + mask |= (FM 3) 0x10; + mask |= (FM 6) 0x100; + mask |= (FM 9) 0x1000; + mask |= (FM 12) 0x1; + mask |= (FM 15) 0x10; + mask |= (FM 18) 0x100; + mask |= (FM 21) 0x1000; + mask *= 15; Needs to also mask out bits 1 and 2, they aren't to be set from frB. mask = 0x9FFF; } - __FPU_FPSCR = ~(mask); - __FPU_FPSCR |= (frB[1] mask); + fpscr = ((__FPU_FPSCR ~mask) | (frB[1] mask)) + ~(FPSCR_VX | FPSCR_FEX); - __FPU_FPSCR = ~(FPSCR_VX); - if (__FPU_FPSCR (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI | + if (fpscr (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI | FPSCR_VXZDZ | FPSCR_VXIMZ | FPSCR_VXVC | FPSCR_VXSOFT | FPSCR_VXSQRT | FPSCR_VXCVI)) - __FPU_FPSCR |= FPSCR_VX; - - fpscr = __FPU_FPSCR; - fpscr = ~(FPSCR_FEX); - if (((fpscr FPSCR_VX) (fpscr FPSCR_VE)) || - ((fpscr FPSCR_OX) (fpscr FPSCR_OE)) || - ((fpscr FPSCR_UX) (fpscr FPSCR_UE)) || - ((fpscr FPSCR_ZX) (fpscr FPSCR_ZE)) || - ((fpscr FPSCR_XX) (fpscr FPSCR_XE))) - fpscr |= FPSCR_FEX; + fpscr |= FPSCR_VX; + + /* The bit order of exception enables and exception status + * is the same. Simply shift and mask to check for enabled + * exceptions. + */ + if (fpscr (fpscr 22) 0xf8) fpscr |= FPSCR_FEX; __FPU_FPSCR = fpscr; #ifdef DEBUG mtfsf.c | 57 ++--- 1 file changed, 22 insertions(+), 35 deletions(-) Notes: 1) I'm really unsure on whether 0xff is frequent or not. So the
Re: [PATCH v2] powerpc ticket locks
On Mon, Feb 10, 2014 at 04:52:17PM +0100, Torsten Duwe wrote: Opinions, anyone? Since the holder thing is a performance thing, not a correctness thing; one thing you could do is something like: static const int OWNER_HASH_SIZE = CONFIG_NR_CPUS * 4; static const int OWNER_HASH_BITS = ilog2(OWNER_HASH_SIZE); u16 lock_owner_array[OWNER_HASH_SIZE] = { 0, }; void set_owner(struct arch_spinlock_t *lock, int owner) { int hash = hash_ptr(lock, OWNER_HASH_BITS); lock_owner_array[hash] = owner; } void yield_to_owner(struct arch_spinlock_t *lock) { int hash = hash_ptr(lock, OWNER_HASH_BITS); int owner = lock_owner_array[hash]; yield_to_cpu(owner); } And call set_owner() after the ticket lock is acquired, and don't bother clearing it again; a new acquire will overwrite, a collision we have to live with. It should on average get you the right yield and does away with having to track the owner field in place. It does however get you an extra cacheline miss on acquire :/ One could consider patching it out when you know your kernel is not running on an overloaded partition. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mtd: m25p80: Make the name of mtd_info fixed
Hi Hou, On Thu, Jan 23, 2014 at 03:29:41AM +, b48...@freescale.com wrote: -Original Message- From: Brian Norris [mailto:computersforpe...@gmail.com] Sent: Thursday, January 23, 2014 10:12 AM To: Hou Zhiqiang-B48286 Cc: linux-...@lists.infradead.org; linuxppc-...@ozlabs.org; Wood Scott- B07421; Hu Mingkai-B21284; Ezequiel Garcia Subject: Re: [PATCH] mtd: m25p80: Make the name of mtd_info fixed On Mon, Jan 06, 2014 at 02:34:29PM +0800, Hou Zhiqiang wrote: --- a/drivers/mtd/devices/m25p80.c +++ b/drivers/mtd/devices/m25p80.c @@ -1012,7 +1012,8 @@ static int m25p_probe(struct spi_device *spi) if (data data-name) flash-mtd.name = data-name; else - flash-mtd.name = dev_name(spi-dev); + flash-mtd.name = kasprintf(GFP_KERNEL, %s.%d, + id-name, spi-chip_select); Changing the mtd.name may have far-reaching consequences for users who already have mtdparts= command lines. But your concern is probably valid for dynamically-determined bus numbers. Perhaps you can edit this patch to only change the name when the busnum is dynamically-allocated? It's a good idea, but in the case of mtd_info's name dynamically-allocated using mtdparts=... in command lines is illegal obviously. I agree that users should never have relied on the dynamically-allocated name. But changing the name for non-dynamic schemes (e.g., where the SPI busnum is a fixed value) is not pleasant for users. Would you tell me what side-effect will be brought by the change of mtd_info's name. I can only think of the mtdparts= command line. Otherwise, I don't think the name is very important. Brian ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
Hi Christoph, On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote: Here is a draft of a patch to make this work with memoryless nodes. The first thing is that we modify node_match to also match if we hit an empty node. In that case we simply take the current slab if its there. If there is no current slab then a regular allocation occurs with the memoryless node. The page allocator will fallback to a possible node and that will become the current slab. Next alloc from a memoryless node will then use that slab. For that we also add some tracking of allocations on nodes that were not satisfied using the empty_node[] array. A successful alloc on a node clears that flag. I would rather avoid the empty_node[] array since its global and there may be thread specific allocation restrictions but it would be expensive to do an allocation attempt via the page allocator to make sure that there is really no page available from the page allocator. With this patch on our test system (I pulled out the numa_mem_id() change, since you Acked Joonsoo's already), on top of 3.13.0 + my kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC patch 1): MemTotal:8264704 kB MemFree: 5924608 kB ... Slab:1402496 kB SReclaimable: 102848 kB SUnreclaim: 1299648 kB And Anton's slabusage reports: slab mem objsslabs used active active kmalloc-16384 207 MB 98.60% 100.00% task_struct 134 MB 97.82% 100.00% kmalloc-8192117 MB 100.00% 100.00% pgtable-2^12111 MB 100.00% 100.00% pgtable-2^10104 MB 100.00% 100.00% For comparison, Anton's patch applied at the same point in the series: meminfo: MemTotal:8264704 kB MemFree: 4150464 kB ... Slab:1590336 kB SReclaimable: 208768 kB SUnreclaim: 1381568 kB slabusage: slab mem objsslabs used active active kmalloc-16384 227 MB 98.63% 100.00% kmalloc-8192130 MB 100.00% 100.00% task_struct 129 MB 97.73% 100.00% pgtable-2^12112 MB 100.00% 100.00% pgtable-2^10106 MB 100.00% 100.00% Consider this patch: Acked-by: Nishanth Aravamudan n...@linux.vnet.ibm.com Tested-by: Nishanth Aravamudan n...@linux.vnet.ibm.com I was thinking about your concerns about empty_node[]. Would it make sense to use a helper function, rather than direct access to direct_node, such as: bool is_node_empty(int nid) void set_node_empty(int nid, bool empty) which we stub out if !HAVE_MEMORYLESS_NODES to return false and noop respectively? That way only architectures that have memoryless nodes pay the penalty of the array allocation? Thanks, Nish Index: linux/mm/slub.c === --- linux.orig/mm/slub.c 2014-02-03 13:19:22.896853227 -0600 +++ linux/mm/slub.c 2014-02-07 12:44:49.311494806 -0600 @@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa #endif } +static int empty_node[MAX_NUMNODES]; + /* * Issues still to be resolved: * @@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem void *last; void *p; int order; + int alloc_node; BUG_ON(flags GFP_SLAB_BUG_MASK); page = allocate_slab(s, flags (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node); - if (!page) + if (!page) { + if (node != NUMA_NO_NODE) + empty_node[node] = 1; goto out; + } order = compound_order(page); - inc_slabs_node(s, page_to_nid(page), page-objects); + alloc_node = page_to_nid(page); + empty_node[alloc_node] = 0; + inc_slabs_node(s, alloc_node, page-objects); memcg_bind_pages(s, order); page-slab_cache = s; __SetPageSlab(page); @@ -1712,7 +1720,7 @@ static void *get_partial(struct kmem_cac struct kmem_cache_cpu *c) { void *object; - int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node; + int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node; object = get_partial_node(s, get_node(s, searchnode), c, flags); if (object || node != NUMA_NO_NODE) @@ -2107,8 +2115,25 @@ static void flush_all(struct kmem_cache static inline int node_match(struct page *page, int node) { #ifdef CONFIG_NUMA - if (!page || (node != NUMA_NO_NODE page_to_nid(page) != node)) + int page_node; + + /* No data means no match */ + if (!page)
Re: [PATCH v2] mtd: m25p80: Make the name of mtd_info fixed
On Sun, Jan 26, 2014 at 02:16:43PM +0800, Hou Zhiqiang wrote: To give spi flash layout using mtdparts=... in cmdline, we must give mtd_info a fixed name,because the cmdlinepart's parser will match the name given in cmdline with the mtd_info. Now, if use OF node, mtd_info's name will be spi-dev-name. It consists of spi_master-bus_num, and the spi_master-bus_num maybe dynamically fetched. So, give the mtd_info a new fiexd name name.cs, name is name of spi_device_id and cs is chip-select in spi_dev. Signed-off-by: Hou Zhiqiang b48...@freescale.com --- v2: - add check for return value of function kasprintf. - whether the spi_master-bus_num is dynamical is determined by spi controller driver, and it can't be check in this driver. So, we can not initial the mtd_info's name by distinguishing the spi_master bus_num dynamically-allocated or not. How about spi-master-bus_num 0 ? drivers/mtd/devices/m25p80.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/mtd/devices/m25p80.c b/drivers/mtd/devices/m25p80.c index eb558e8..1f494d2 100644 --- a/drivers/mtd/devices/m25p80.c +++ b/drivers/mtd/devices/m25p80.c @@ -1011,8 +1011,12 @@ static int m25p_probe(struct spi_device *spi) if (data data-name) flash-mtd.name = data-name; - else - flash-mtd.name = dev_name(spi-dev); + else { + flash-mtd.name = kasprintf(GFP_KERNEL, %s.%d, + id-name, spi-chip_select); I don't think this name is specific enough. What if there are more than one SPI controller? Then there could be one chip with the same chip-select. You probably still need to incorporate the SPI master somehow, even if it's not by using the bus number directly (because it's dynamic). + if (!flash-mtd.name) + return -ENOMEM; + } flash-mtd.type = MTD_NORFLASH; flash-mtd.writesize = 1; Brian ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFT][PATCH 04/12] drivers/macintosh/adb: change platform power management to use dev_pm_ops
Change adb platform driver to register pm ops using dev_pm_ops instead of legacy pm_ops. .pm hooks call existing legacy suspend and resume interfaces by passing in the right pm state. Signed-off-by: Shuah Khan shuah...@samsung.com --- drivers/macintosh/adb.c | 41 - 1 file changed, 36 insertions(+), 5 deletions(-) diff --git a/drivers/macintosh/adb.c b/drivers/macintosh/adb.c index 04a5049..df6b201 100644 --- a/drivers/macintosh/adb.c +++ b/drivers/macintosh/adb.c @@ -263,7 +263,7 @@ adb_reset_bus(void) /* * notify clients before sleep */ -static int adb_suspend(struct platform_device *dev, pm_message_t state) +static int __adb_suspend(struct platform_device *dev, pm_message_t state) { adb_got_sleep = 1; /* We need to get a lock on the probe thread */ @@ -276,10 +276,25 @@ static int adb_suspend(struct platform_device *dev, pm_message_t state) return 0; } +static int adb_suspend(struct device *dev) +{ + return __adb_suspend(to_platform_device(dev), PMSG_SUSPEND); +} + +static int adb_freeze(struct device *dev) +{ + return __adb_suspend(to_platform_device(dev), PMSG_FREEZE); +} + +static int adb_poweroff(struct device *dev) +{ + return __adb_suspend(to_platform_device(dev), PMSG_HIBERNATE); +} + /* * reset bus after sleep */ -static int adb_resume(struct platform_device *dev) +static int __adb_resume(struct platform_device *dev) { adb_got_sleep = 0; up(adb_probe_mutex); @@ -287,6 +302,11 @@ static int adb_resume(struct platform_device *dev) return 0; } + +static int adb_resume(struct device *dev) +{ + return __adb_resume(to_platform_device(dev)); +} #endif /* CONFIG_PM */ static int __init adb_init(void) @@ -831,14 +851,25 @@ static const struct file_operations adb_fops = { .release= adb_release, }; +#ifdef CONFIG_PM +static const struct dev_pm_ops adb_dev_pm_ops = { + .suspend = adb_suspend, + .resume = adb_resume, + /* Hibernate hooks */ + .freeze = adb_freeze, + .thaw = adb_resume, + .poweroff = adb_poweroff, + .restore = adb_resume, +}; +#endif + static struct platform_driver adb_pfdrv = { .driver = { .name = adb, - }, #ifdef CONFIG_PM - .suspend = adb_suspend, - .resume = adb_resume, + .pm = adb_dev_pm_ops, #endif + }, }; static struct platform_device adb_pfdev = { -- 1.7.10.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFT][PATCH 00/12] change drivers power management to use dev_pm_ops
Change drivers to register pm ops using dev_pm_ops instead of legacy pm_ops. .pm hooks call existing legacy suspend and resume interfaces by passing in the right pm state. Bus drivers suspend and resume routines call .pm driver hooks if found. Shuah Khan (12): arm: change locomo platform and bus power management to use dev_pm_ops arm: change sa platform and bus power management to use dev_pm_ops arm: change scoop platform power management to use dev_pm_ops drivers/macintosh/adb: change platform power managemnet to use dev_pm_ops mmc: change au1xmmc platform power management to use dev_pm_ops mmc: change bfin_sdh platform power management to use dev_pm_ops isa: change isa bus power managemnet to use dev_pm_ops mmc: change cb710-mmc platform power management to use dev_pm_ops mmc: change msm_sdcc platform power management to use dev_pm_ops mmc: change tmio_mmc platform power management to use dev_pm_ops drivers/pcmcia: change ds driver power management to use dev_pm_ops drivers/s390/crypto: change ap_bus driver power management to use dev_pm_ops arch/arm/common/locomo.c | 93 +++--- arch/arm/common/sa.c | 88 +++ arch/arm/common/scoop.c | 44 drivers/base/isa.c | 30 -- drivers/macintosh/adb.c | 41 --- drivers/mmc/host/au1xmmc.c | 43 +++ drivers/mmc/host/bfin_sdh.c | 40 +++--- drivers/mmc/host/cb710-mmc.c | 37 +++-- drivers/mmc/host/msm_sdcc.c | 42 +++ drivers/mmc/host/tmio_mmc.c | 42 +++ drivers/pcmcia/ds.c | 36 +--- drivers/s390/crypto/ap_bus.c | 30 -- 12 files changed, 481 insertions(+), 85 deletions(-) -- 1.7.10.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] Handle vmalloc addresses
Hi Nathan ! Please do a better submission :-) Your subject is to be honest, crap. Something like [PATCH] crypto/nx/nx-842: Fix handling of vmalloc addresses Would have been much more informative. Additionally, this is a patch for drivers/crypto, and while that driver is powerpc-specific meaning I *could* take that patch, it should at least be CCed to the crypto list/maintainer since that would be the normal path for such a patch to be applied. I'm taking it this time around but I know you can do better ! Cheers, Ben. On Wed, 2014-01-29 at 10:34 -0600, Nathan Fontenot wrote: The nx-842 compression driver does not currently handle getting a physical address for vmalloc addresses. The current driver uses __pa() for all addresses which does not properly handle vmalloc addresses and thus causes a failure since we do not pass a proper physical address to phyp. This patch adds a routine to convert an address to a physical address by checking for vmalloc addresses and handling them properly. Signed-off-by: Nathan Fontenot nf...@linux.vnet.ibm.com --- drivers/crypto/nx/nx-842.c | 29 +++-- 1 file changed, 19 insertions(+), 10 deletions(-) Index: linux/drivers/crypto/nx/nx-842.c === --- linux.orig/drivers/crypto/nx/nx-842.c 2014-01-22 08:52:55.0 -0600 +++ linux/drivers/crypto/nx/nx-842.c 2014-01-29 08:25:33.0 -0600 @@ -158,6 +158,15 @@ return sl-entry_nr * sizeof(struct nx842_slentry); } +static inline unsigned long nx842_get_pa(void *addr) +{ + if (is_vmalloc_addr(addr)) + return page_to_phys(vmalloc_to_page(addr)) ++ offset_in_page(addr); + else + return __pa(addr); +} + static int nx842_build_scatterlist(unsigned long buf, int len, struct nx842_scatterlist *sl) { @@ -168,7 +177,7 @@ entry = sl-entries; while (len) { - entry-ptr = __pa(buf); + entry-ptr = nx842_get_pa((void *)buf); nextpage = ALIGN(buf + 1, NX842_HW_PAGE_SIZE); if (nextpage buf + len) { /* we aren't at the end yet */ @@ -370,8 +379,8 @@ op.flags = NX842_OP_COMPRESS; csbcpb = workmem-csbcpb; memset(csbcpb, 0, sizeof(*csbcpb)); - op.csbcpb = __pa(csbcpb); - op.out = __pa(slout.entries); + op.csbcpb = nx842_get_pa(csbcpb); + op.out = nx842_get_pa(slout.entries); for (i = 0; i hdr-blocks_nr; i++) { /* @@ -401,13 +410,13 @@ */ if (likely(max_sync_size == NX842_HW_PAGE_SIZE)) { /* Create direct DDE */ - op.in = __pa(inbuf); + op.in = nx842_get_pa((void *)inbuf); op.inlen = max_sync_size; } else { /* Create indirect DDE (scatterlist) */ nx842_build_scatterlist(inbuf, max_sync_size, slin); - op.in = __pa(slin.entries); + op.in = nx842_get_pa(slin.entries); op.inlen = -nx842_get_scatterlist_size(slin); } @@ -565,7 +574,7 @@ op.flags = NX842_OP_DECOMPRESS; csbcpb = workmem-csbcpb; memset(csbcpb, 0, sizeof(*csbcpb)); - op.csbcpb = __pa(csbcpb); + op.csbcpb = nx842_get_pa(csbcpb); /* * max_sync_size may have changed since compression, @@ -597,12 +606,12 @@ if (likely((inbuf NX842_HW_PAGE_MASK) == ((inbuf + hdr-sizes[i] - 1) NX842_HW_PAGE_MASK))) { /* Create direct DDE */ - op.in = __pa(inbuf); + op.in = nx842_get_pa((void *)inbuf); op.inlen = hdr-sizes[i]; } else { /* Create indirect DDE (scatterlist) */ nx842_build_scatterlist(inbuf, hdr-sizes[i] , slin); - op.in = __pa(slin.entries); + op.in = nx842_get_pa(slin.entries); op.inlen = -nx842_get_scatterlist_size(slin); } @@ -613,12 +622,12 @@ */ if (likely(max_sync_size == NX842_HW_PAGE_SIZE)) { /* Create direct DDE */ - op.out = __pa(outbuf); + op.out = nx842_get_pa((void *)outbuf); op.outlen = max_sync_size; } else { /* Create indirect DDE (scatterlist) */ nx842_build_scatterlist(outbuf, max_sync_size, slout); - op.out = __pa(slout.entries); + op.out = nx842_get_pa(slout.entries); op.outlen = -nx842_get_scatterlist_size(slout); }
[PATCH] powerpc/powernv: Add iommu DMA bypass support for IODA2
this patch adds the support for to create a direct iommu bypass window on IODA2 bridges (such as Power8) allowing to bypass iommu page translation completely for 64-bit DMA capable devices, thus significantly improving DMA performances. Additionally, this adds a hook to the struct iommu_table so that the IOMMU API / VFIO can disable the bypass when external ownership is requested, since in that case, the device will be used by an environment such as userspace or a KVM guest which must not be allowed to bypass translations. Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org --- This is pretty much identical to the old version I posted a while ago, except that it does have the iommu calls to enable/disable which I had forgotten to git add before posting the previous one. arch/powerpc/include/asm/dma-mapping.h| 1 + arch/powerpc/include/asm/iommu.h | 1 + arch/powerpc/kernel/dma.c | 10 ++-- arch/powerpc/kernel/iommu.c | 12 + arch/powerpc/platforms/powernv/pci-ioda.c | 84 +++ arch/powerpc/platforms/powernv/pci.c | 10 arch/powerpc/platforms/powernv/pci.h | 6 ++- arch/powerpc/platforms/powernv/powernv.h | 4 ++ arch/powerpc/platforms/powernv/setup.c| 9 9 files changed, 133 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index e27e9ad..150866b 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -134,6 +134,7 @@ static inline int dma_supported(struct device *dev, u64 mask) } extern int dma_set_mask(struct device *dev, u64 dma_mask); +extern int __dma_set_mask(struct device *dev, u64 dma_mask); #define dma_alloc_coherent(d,s,h,f)dma_alloc_attrs(d,s,h,f,NULL) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index f7a8036..42632c7 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -77,6 +77,7 @@ struct iommu_table { #ifdef CONFIG_IOMMU_API struct iommu_group *it_group; #endif + void (*set_bypass)(struct iommu_table *tbl, bool enable); }; /* Pure 2^n version of get_order */ diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c index 8032b97..ee78f6e 100644 --- a/arch/powerpc/kernel/dma.c +++ b/arch/powerpc/kernel/dma.c @@ -191,12 +191,10 @@ EXPORT_SYMBOL(dma_direct_ops); #define PREALLOC_DMA_DEBUG_ENTRIES (1 16) -int dma_set_mask(struct device *dev, u64 dma_mask) +int __dma_set_mask(struct device *dev, u64 dma_mask) { struct dma_map_ops *dma_ops = get_dma_ops(dev); - if (ppc_md.dma_set_mask) - return ppc_md.dma_set_mask(dev, dma_mask); if ((dma_ops != NULL) (dma_ops-set_dma_mask != NULL)) return dma_ops-set_dma_mask(dev, dma_mask); if (!dev-dma_mask || !dma_supported(dev, dma_mask)) @@ -204,6 +202,12 @@ int dma_set_mask(struct device *dev, u64 dma_mask) *dev-dma_mask = dma_mask; return 0; } +int dma_set_mask(struct device *dev, u64 dma_mask) +{ + if (ppc_md.dma_set_mask) + return ppc_md.dma_set_mask(dev, dma_mask); + return __dma_set_mask(dev, dma_mask); +} EXPORT_SYMBOL(dma_set_mask); u64 dma_get_required_mask(struct device *dev) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index d773dd4..88e3ec6 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1088,6 +1088,14 @@ int iommu_take_ownership(struct iommu_table *tbl) memset(tbl-it_map, 0xff, sz); iommu_clear_tces_and_put_pages(tbl, tbl-it_offset, tbl-it_size); + /* +* Disable iommu bypass, otherwise the user can DMA to all of +* our physical memory via the bypass window instead of just +* the pages that has been explicitly mapped into the iommu +*/ + if (tbl-set_bypass) + tbl-set_bypass(tbl, false); + return 0; } EXPORT_SYMBOL_GPL(iommu_take_ownership); @@ -1102,6 +1110,10 @@ void iommu_release_ownership(struct iommu_table *tbl) /* Restore bit#0 set by iommu_init_table() */ if (tbl-it_offset == 0) set_bit(0, tbl-it_map); + + /* The kernel owns the device now, we can restore the iommu bypass */ + if (tbl-set_bypass) + tbl-set_bypass(tbl, true); } EXPORT_SYMBOL_GPL(iommu_release_ownership); diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 7d6dcc6..3b2b4fb 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -21,6 +21,7 @@ #include linux/irq.h #include linux/io.h #include linux/msi.h +#include linux/memblock.h #include asm/sections.h #include asm/io.h @@ -460,9 +461,39 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev return;
Re: [PATCH v2] powerpc ticket locks
(Linus, Al, a question for you down there about lockref ref size) On Mon, 2014-02-10 at 16:52 +0100, Torsten Duwe wrote: What if I squeeze the bits a little? 4k vCPUs, and 256 physical, as a limit to stay within 32 bits? At the cost that unlock may become an ll/sc operation again. I could think about a trick against that. But alas, hw_cpu_id is 16 bit, which makes a lookup table neccessary :-/ Doing another round of yields for lockrefs now doesn't sound so bad any more. So, the ticketpair has to be 16:16 so we can avoid the atomic on unlock That leaves us with 32 bits to put the ref and the owner. The question is how big the ref really has to be and can we have a reasonable failure mode if it overflows ? If we limit ourselves to, for example, 16-bit for the ref in lockref, then we can have the second 32-bit split between the owner and the ref. If we limit ourselves to 4k CPUs, then we get 4 more bits of ref ... So the question is, is it reasonable to have the ref smaller than 32-bit... Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] powerpc ticket locks
On Tue, Feb 11, 2014 at 01:44:20PM +1100, Benjamin Herrenschmidt wrote: That leaves us with 32 bits to put the ref and the owner. The question is how big the ref really has to be and can we have a reasonable failure mode if it overflows ? If we limit ourselves to, for example, 16-bit for the ref in lockref, then we can have the second 32-bit split between the owner and the ref. If we limit ourselves to 4k CPUs, then we get 4 more bits of ref ... So the question is, is it reasonable to have the ref smaller than 32-bit... Every time you open a file, you bump dentry refcount. Something like libc or ld.so will be opened on just about every execve(), so I'd say that 16 bits is far too low. If nothing else, 32 bits might be too low on 64bit boxen... ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] powerpc ticket locks
On Tue, 2014-02-11 at 02:56 +, Al Viro wrote: So the question is, is it reasonable to have the ref smaller than 32-bit... Every time you open a file, you bump dentry refcount. Something like libc or ld.so will be opened on just about every execve(), so I'd say that 16 bits is far too low. If nothing else, 32 bits might be too low on 64bit boxen... So back to square 1 ... we can't implement together lockref, ticket locks, and our lock confer mechanism within 64-bit. I see two options at this stage. Both require a custom implementation of lockref for powerpc, so some ifdef's such that we can replace the generic implementation completely. - We can use a small ref, and when it's too big, overflow into a larger one, falling back to the old style lock + ref (an overflow bit or a compare with ) - We can have lockref build it's own lock out of the ticketpair and ref, keeping the owner in a separate word. The owner doesn't strictly need to be atomic. Both are gross though :( Anybody has a better idea ? Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc/spufs: Remove MAX_USER_PRIO define
Current ppc64_defconfig fails with: arch/powerpc/platforms/cell/spufs/sched.c:86:0: error: MAX_USER_PRIO redefined [-Werror] cc1: all warnings being treated as errors 6b6350f1 introduced a generic MAX_USER_PRIO macro to sched/prio.h, which is causing the conflit. Use that one instead of our own. Signed-off-by: Jeremy Kerr j...@ozlabs.org --- Ingo: 6b6350f1 is currently in tip; this fixes a build breakage for spufs --- arch/powerpc/platforms/cell/spufs/sched.c |1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c index 4931838..4a0a64f 100644 --- a/arch/powerpc/platforms/cell/spufs/sched.c +++ b/arch/powerpc/platforms/cell/spufs/sched.c @@ -83,7 +83,6 @@ static struct timer_list spuloadavg_timer; #define MIN_SPU_TIMESLICE max(5 * HZ / (1000 * SPUSCHED_TICK), 1) #define DEF_SPU_TIMESLICE (100 * HZ / (1000 * SPUSCHED_TICK)) -#define MAX_USER_PRIO (MAX_PRIO - MAX_RT_PRIO) #define SCALE_PRIO(x, prio) \ max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_SPU_TIMESLICE) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v1 0/2] powernv: cpufreq support for IBM POWERNV platform
Hi, The following patch series implements the platform driver to support dynamic CPU frequency scaling on IBM POWERNV platforms. This patch series is based on Linux kernel 3.14-rc2 and tested on OPAL v3 based IBM POWERNV platform and IBM POWER8 processor. --Vaidy --- Srivatsa S. Bhat (1): powernv, cpufreq: Add per-core locking to serialize frequency transitions Vaidyanathan Srinivasan (1): powernv: cpufreq driver for powernv platform arch/powerpc/include/asm/reg.h|4 + drivers/cpufreq/Kconfig.powerpc |9 + drivers/cpufreq/Makefile |1 drivers/cpufreq/powernv-cpufreq.c | 286 + 4 files changed, 300 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c -- ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v1 2/2] powernv, cpufreq: Add per-core locking to serialize frequency transitions
From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com On POWER systems, the CPU frequency is controlled at a core-level and hence we need to serialize so that only one of the threads in the core switches the core's frequency at a time. Using a global mutex lock would needlessly serialize _all_ frequency transitions in the system (across all cores). So introduce per-core locking to enable finer-grained synchronization and thereby enhance the speed and responsiveness of the cpufreq driver to varying workload demands. The design of per-core locking is very simple and straight-forward: we first define a Per-CPU lock and use the ones that belongs to the first thread sibling of the core. cpu_first_thread_sibling() macro is used to find the *common* lock for all thread siblings belonging to a core. Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 21 - 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index ea3b630..8240e90 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -24,8 +24,15 @@ #include linux/of.h #include asm/cputhreads.h -/* FIXME: Make this per-core */ -static DEFINE_MUTEX(freq_switch_mutex); +/* Per-Core locking for frequency transitions */ +static DEFINE_PER_CPU(struct mutex, freq_switch_lock); + +#define lock_core_freq(cpu)\ + mutex_lock(per_cpu(freq_switch_lock,\ + cpu_first_thread_sibling(cpu))); +#define unlock_core_freq(cpu) \ + mutex_unlock(per_cpu(freq_switch_lock,\ + cpu_first_thread_sibling(cpu))); #define POWERNV_MAX_PSTATES256 @@ -219,7 +226,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, freqs.new = powernv_freqs[new_index].frequency; freqs.cpu = policy-cpu; - mutex_lock(freq_switch_mutex); + lock_core_freq(policy-cpu); cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE); pr_debug(setting frequency for cpu %d to %d kHz index %d pstate %d, @@ -231,7 +238,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, rc = powernv_set_freq(policy-cpus, new_index); cpufreq_notify_transition(policy, freqs, CPUFREQ_POSTCHANGE); - mutex_unlock(freq_switch_mutex); + unlock_core_freq(policy-cpu); return rc; } @@ -248,7 +255,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = { static int __init powernv_cpufreq_init(void) { - int rc = 0; + int cpu, rc = 0; /* Discover pstates from device tree and init */ @@ -258,6 +265,10 @@ static int __init powernv_cpufreq_init(void) pr_info(powernv-cpufreq disabled\n); return rc; } + /* Init per-core mutex */ + for_each_possible_cpu(cpu) { + mutex_init(per_cpu(freq_switch_lock, cpu)); + } rc = cpufreq_register_driver(powernv_cpufreq_driver); return rc; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v1 1/2] powernv: cpufreq driver for powernv platform
Backend driver to dynamically set voltage and frequency on IBM POWER non-virtualized platforms. Power management SPRs are used to set the required PState. This driver works in conjunction with cpufreq governors like 'ondemand' to provide a demand based frequency and voltage setting on IBM POWER non-virtualized platforms. PState table is obtained from OPAL v3 firmware through device tree. powernv_cpufreq back-end driver would parse the relevant device-tree nodes and initialise the cpufreq subsystem on powernv platform. Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Anton Blanchard an...@samba.org --- arch/powerpc/include/asm/reg.h|4 + drivers/cpufreq/Kconfig.powerpc |9 + drivers/cpufreq/Makefile |1 drivers/cpufreq/powernv-cpufreq.c | 275 + 4 files changed, 289 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 90c06ec..84f92ca 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -271,6 +271,10 @@ #define SPRN_HSRR1 0x13B /* Hypervisor Save/Restore 1 */ #define SPRN_IC0x350 /* Virtual Instruction Count */ #define SPRN_VTB 0x351 /* Virtual Time Base */ +#define SPRN_PMICR 0x354 /* Power Management Idle Control Reg */ +#define SPRN_PMSR 0x355 /* Power Management Status Reg */ +#define SPRN_PMCR 0x374 /* Power Management Control Register */ + /* HFSCR and FSCR bit numbers are the same */ #define FSCR_TAR_LG8 /* Enable Target Address Register */ #define FSCR_EBB_LG7 /* Enable Event Based Branching */ diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc index ca0021a..4a91ab1 100644 --- a/drivers/cpufreq/Kconfig.powerpc +++ b/drivers/cpufreq/Kconfig.powerpc @@ -54,3 +54,12 @@ config PPC_PASEMI_CPUFREQ help This adds the support for frequency switching on PA Semi PWRficient processors. + +config POWERNV_CPUFREQ + tristate CPU frequency scaling for IBM POWERNV platform + depends on PPC_POWERNV + select CPU_FREQ_TABLE + default y + help +This adds support for CPU frequency switching on IBM POWERNV +platform diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index 7494565..0dbb963 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ) += ppc-corenet-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC)+= pmac32-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC64) += pmac64-cpufreq.o obj-$(CONFIG_PPC_PASEMI_CPUFREQ) += pasemi-cpufreq.o +obj-$(CONFIG_POWERNV_CPUFREQ) += powernv-cpufreq.o ## # Other platform drivers diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c new file mode 100644 index 000..ea3b630 --- /dev/null +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -0,0 +1,275 @@ +/* + * POWERNV cpufreq driver for the IBM POWER processors + * + * (C) Copyright IBM 2014 + * + * Author: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#define pr_fmt(fmt)powernv-cpufreq: fmt + +#include linux/module.h +#include linux/cpufreq.h +#include linux/of.h +#include asm/cputhreads.h + +/* FIXME: Make this per-core */ +static DEFINE_MUTEX(freq_switch_mutex); + +#define POWERNV_MAX_PSTATES256 + +static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; + +/* + * Initialize the freq table based on data obtained + * from the firmware passed via device-tree + */ + +static int init_powernv_pstates(void) +{ + struct device_node *power_mgt; + int nr_pstates = 0; + int pstate_min, pstate_max, pstate_nominal; + const __be32 *pstate_ids, *pstate_freqs; + int i; + u32 len_ids, len_freqs; + + power_mgt = of_find_node_by_path(/ibm,opal/power-mgt); + if (!power_mgt) { + pr_warn(power-mgt node not found\n); + return -ENODEV; + } + + if (of_property_read_u32(power_mgt, ibm,pstate-min, pstate_min)) { + pr_warn(ibm,pstate-min node not found\n); + return -ENODEV; + } + + if (of_property_read_u32(power_mgt,
Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
On Mon, Feb 10, 2014 at 11:13:21AM -0800, Nishanth Aravamudan wrote: Hi Christoph, On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote: Here is a draft of a patch to make this work with memoryless nodes. The first thing is that we modify node_match to also match if we hit an empty node. In that case we simply take the current slab if its there. If there is no current slab then a regular allocation occurs with the memoryless node. The page allocator will fallback to a possible node and that will become the current slab. Next alloc from a memoryless node will then use that slab. For that we also add some tracking of allocations on nodes that were not satisfied using the empty_node[] array. A successful alloc on a node clears that flag. I would rather avoid the empty_node[] array since its global and there may be thread specific allocation restrictions but it would be expensive to do an allocation attempt via the page allocator to make sure that there is really no page available from the page allocator. With this patch on our test system (I pulled out the numa_mem_id() change, since you Acked Joonsoo's already), on top of 3.13.0 + my kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC patch 1): MemTotal:8264704 kB MemFree: 5924608 kB ... Slab:1402496 kB SReclaimable: 102848 kB SUnreclaim: 1299648 kB And Anton's slabusage reports: slab mem objsslabs used active active kmalloc-16384 207 MB 98.60% 100.00% task_struct 134 MB 97.82% 100.00% kmalloc-8192117 MB 100.00% 100.00% pgtable-2^12111 MB 100.00% 100.00% pgtable-2^10104 MB 100.00% 100.00% For comparison, Anton's patch applied at the same point in the series: meminfo: MemTotal:8264704 kB MemFree: 4150464 kB ... Slab:1590336 kB SReclaimable: 208768 kB SUnreclaim: 1381568 kB slabusage: slab mem objsslabs used active active kmalloc-16384 227 MB 98.63% 100.00% kmalloc-8192130 MB 100.00% 100.00% task_struct 129 MB 97.73% 100.00% pgtable-2^12112 MB 100.00% 100.00% pgtable-2^10106 MB 100.00% 100.00% Consider this patch: Acked-by: Nishanth Aravamudan n...@linux.vnet.ibm.com Tested-by: Nishanth Aravamudan n...@linux.vnet.ibm.com Hello, I still think that there is another problem. Your report about CONFIG_SLAB said that SLAB uses just 200MB. Below is your previous report. Ok, with your patches applied and CONFIG_SLAB enabled: MemTotal:8264640 kB MemFree: 7119680 kB Slab: 207232 kB SReclaimable: 32896 kB SUnreclaim: 174336 kB The number on CONFIG_SLUB with these patches tell us that SLUB uses 1.4GB. There is large difference on slab usage. And, I should note that number of active objects on slabinfo can be wrong on some situation, since it doesn't consider cpu slab (and cpu partial slab). I recommend to confirm page_to_nid() and other things as I mentioned earlier. Thanks. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
Hi James, On Mon, Feb 10, 2014 at 11:03:07AM -0600, James Yang wrote: [snipped] Ok, if you have measured that method1 is faster than method2, let us go for it. I believe method2 would be faster if you had a large out-of-order execution window, because more parallelism can be extracted from it, but this is probably only true for high end cores, which do not need FPU emulation in the first place. Yeah, 8548 can issue 2 SFX instructions per cycle which is what the compiler generated. Then it is method1. The other place where we can optimize is the generation of FEX. Here is my current patch: diff --git a/arch/powerpc/math-emu/mtfsf.c b/arch/powerpc/math-emu/mtfsf.c index dbce92e..b57b3fa8 100644 --- a/arch/powerpc/math-emu/mtfsf.c +++ b/arch/powerpc/math-emu/mtfsf.c @@ -11,48 +11,35 @@ mtfsf(unsigned int FM, u32 *frB) u32 mask; u32 fpscr; - if (FM == 0) + if (likely(FM == 0xff)) + mask = 0x; + else if (unlikely(FM == 0)) return 0; - - if (FM == 0xff) - mask = 0x9fff; else { - mask = 0; - if (FM (1 0)) - mask |= 0x9000; - if (FM (1 1)) - mask |= 0x0f00; - if (FM (1 2)) - mask |= 0x00f0; - if (FM (1 3)) - mask |= 0x000f; - if (FM (1 4)) - mask |= 0xf000; - if (FM (1 5)) - mask |= 0x0f00; - if (FM (1 6)) - mask |= 0x00f0; - if (FM (1 7)) - mask |= 0x000f; + mask = (FM 1); + mask |= (FM 3) 0x10; + mask |= (FM 6) 0x100; + mask |= (FM 9) 0x1000; + mask |= (FM 12) 0x1; + mask |= (FM 15) 0x10; + mask |= (FM 18) 0x100; + mask |= (FM 21) 0x1000; + mask *= 15; Needs to also mask out bits 1 and 2, they aren't to be set from frB. mask = 0x9FFF; Look at the following lines: } - __FPU_FPSCR = ~(mask); - __FPU_FPSCR |= (frB[1] mask); + fpscr = ((__FPU_FPSCR ~mask) | (frB[1] mask)) + ~(FPSCR_VX | FPSCR_FEX); It's here (masking FPSCR_VX and FPSCR_FEX). Actually the previous code was redundant, it cleared FEX and VX in the mask computation and later again when recomputing them. Clearing them once should be enough. - __FPU_FPSCR = ~(FPSCR_VX); - if (__FPU_FPSCR (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI | + if (fpscr (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI | FPSCR_VXZDZ | FPSCR_VXIMZ | FPSCR_VXVC | FPSCR_VXSOFT | FPSCR_VXSQRT | FPSCR_VXCVI)) - __FPU_FPSCR |= FPSCR_VX; - - fpscr = __FPU_FPSCR; - fpscr = ~(FPSCR_FEX); - if (((fpscr FPSCR_VX) (fpscr FPSCR_VE)) || - ((fpscr FPSCR_OX) (fpscr FPSCR_OE)) || - ((fpscr FPSCR_UX) (fpscr FPSCR_UE)) || - ((fpscr FPSCR_ZX) (fpscr FPSCR_ZE)) || - ((fpscr FPSCR_XX) (fpscr FPSCR_XE))) - fpscr |= FPSCR_FEX; + fpscr |= FPSCR_VX; + + /* The bit order of exception enables and exception status +* is the same. Simply shift and mask to check for enabled +* exceptions. +*/ + if (fpscr (fpscr 22) 0xf8) fpscr |= FPSCR_FEX; __FPU_FPSCR = fpscr; #ifdef DEBUG mtfsf.c | 57 ++--- 1 file changed, 22 insertions(+), 35 deletions(-) Notes: 1) I'm really unsure on whether 0xff is frequent or not. So the likely() statement at the beginning may be wrong. Actually, if it is not very likely, it might be better to remove the special casef for FM==0xff. A look at GCC sources shows that it never generates a mask of 0xff. From glibc sources, there vast majority of cases uses 0x1, only isnan() uses 0xff. Can't handle all cases here. That's why I would go for the simplest possible code. Conditionals are expensive and minimizing cache footprint is often the best measure of performance for infrequently used code. With this in mind, I would get rid of all the tests for special FM values and rely on the optimized generic case. 2) it may be better to remove the check for FM==0, after all, the instruction effectively becomes a nop, and generating the instruction in the first place would be too stupid for words. Hmm a heavy no-op. I wonder if it is heavier than a sync. In theory not. It contains the equivalent of several isync (taking an exception and returning from it), but not any synchronization wrt the memory accesses. Gabriel ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org