Re: [PATCH 1/2] powerpc/sstep: Add emulation support for ‘setb’ instruction

2021-04-23 Thread Gabriel Paubert
On Thu, Apr 22, 2021 at 06:26:16PM -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Fri, Apr 23, 2021 at 12:16:18AM +0200, Gabriel Paubert wrote:
> > On Thu, Apr 22, 2021 at 02:13:34PM -0500, Segher Boessenkool wrote:
> > > On Fri, Apr 16, 2021 at 05:44:52PM +1000, Daniel Axtens wrote:
> > > > Sathvika Vasireddy  writes:
> > > > > + if ((regs->ccr) & (1 << (31 - ra)))
> > > > > + op->val = -1;
> > > > > + else if ((regs->ccr) & (1 << (30 - ra)))
> > > > > + op->val = 1;
> > > > > + else
> > > > > + op->val = 0;
> > > 
> > > It imo is clearer if written
> > > 
> > >   if ((regs->ccr << ra) & 0x8000)
> > >   op->val = -1;
> > >   else if ((regs->ccr << ra) & 0x4000)
> > >   op->val = 1;
> > >   else
> > >   op->val = 0;
> > > 
> > > but I guess not everyone agrees :-)
> > 
> > But this can be made jump free :-):
> > 
> > int tmp = regs->ccr << ra;
> > op->val = (tmp >> 31) | ((tmp >> 30) & 1);
> 
> The compiler will do so automatically (or think of some better way to
> get the same result); in source code, what matters most is readability,
> or clarity in general (also clarity to the compiler).

I just did a test (trivial code attached) and the original code always
produces one conditional branch at -O2, at least with the cross-compiler
I have on Debian (gcc 8.3). I have tested both -m32 and -m64. The 64 bit
version produces an unnecessary "extsw", so I wrote the second version
splitting the setting of the return value which gets rid of it.

The second "if" is fairly simple to optimize and the compiler does it
properly.

Of course with my suggestion the compiler does not produce any branch. 
But it needs a really good comment.


> 
> (Right shifts of negative numbers are implementation-defined in C,
> fwiw -- but work like you expect in GCC).

Well, I'm not worried about it, since I'd expect a compiler that does
logical right shifts on signed valued to break so much code that it
would be easily noticed (also in the kernel).


> 
> > (IIRC the srawi instruction sign-extends its result to 64 bits).
> 
> If you consider it to work on 32-bit inputs, yeah, that is a simple way
> to express it.
> 
> > > > CR field:  76543210
> > > > bit:  0123 0123 0123 0123 0123 0123 0123 0123
> > > > normal bit #: 0.31
> > > > ibm bit #:   31.0
> > > 
> > > The bit numbers in CR fields are *always* numbered left-to-right.  I
> > > have never seen anyone use LE for it, anyway.
> > > 
> > > Also, even people who write LE have the bigger end on the left normally
> > > (they just write some things right-to-left, and other things
> > > left-to-right).
> > 
> > Around 1985, I had a documentation for the the National's 32032
> > (little-endian) processor family, and all the instruction encodings were
> > presented with the LSB on the left and MSB on the right.
> 
> Ouch!  Did they write "regular" numbers with the least significant digit
> on the left as well?

No, they were not that sadistic!

At least instructions were a whole number of bytes, unlike the iAPX432
where jumps needed to encode target addresses down to the bit level.

> 
> > BTW on these processors, the immediate operands and the offsets (1, 2 or
> > 4 bytes) for the addressing modes were encoded in big-endian byte order,
> > but I digress. Consistency is overrated ;-)
> 
> Inconsistency is the spice of life, yeah :-)

;-)

Gabriel

 long setb_cond(int cr, int shift)
{
	long ret;
	if ((cr << shift) & 0x8000)
		ret = -1;
	else if ((cr << shift) & 0x4000)
		ret = 1;
	else
		ret = 0;
	return ret;
}

long setb_uncond(int cr, int shift)
{
	int tmp = cr << shift;
	long ret;
	ret = (tmp >> 31) | ((tmp >> 30) & 1);
	return ret;
}

long setb_uncond2(int cr, int shift)
{
	int tmp = cr << shift;
	long ret;
	ret = (tmp >> 31);
	ret |= ((tmp >> 30) & 1);
	return ret;
}


Re: [PATCH 1/2] powerpc/sstep: Add emulation support for ‘setb’ instruction

2021-04-22 Thread Gabriel Paubert


Hi,

On Thu, Apr 22, 2021 at 02:13:34PM -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Fri, Apr 16, 2021 at 05:44:52PM +1000, Daniel Axtens wrote:
> > Sathvika Vasireddy  writes:
> > Ok, if I've understood correctly...
> > 
> > > + ra = ra & ~0x3;
> > 
> > This masks off the bits of RA that are not part of BTF:
> > 
> > ra is in [0, 31] which is [0b0, 0b1]
> > Then ~0x3 = ~0b00011
> > ra = ra & 0b11100
> > 
> > This gives us then,
> > ra = btf << 2; or
> > btf = ra >> 2;
> 
> Yes.  In effect, you want the offset in bits of the CR field, which is
> just fine like this.  But a comment would not hurt.
> 
> > Let's then check to see if your calculations read the right fields.
> > 
> > > + if ((regs->ccr) & (1 << (31 - ra)))
> > > + op->val = -1;
> > > + else if ((regs->ccr) & (1 << (30 - ra)))
> > > + op->val = 1;
> > > + else
> > > + op->val = 0;
> 
> It imo is clearer if written
> 
>   if ((regs->ccr << ra) & 0x8000)
>   op->val = -1;
>   else if ((regs->ccr << ra) & 0x4000)
>   op->val = 1;
>   else
>   op->val = 0;
> 
> but I guess not everyone agrees :-)
> 

But this can be made jump free :-):

int tmp = regs->ccr << ra;
op->val = (tmp >> 31) | ((tmp >> 30) & 1);

(IIRC the srawi instruction sign-extends its result to 64 bits).



> > CR field:  76543210
> > bit:  0123 0123 0123 0123 0123 0123 0123 0123
> > normal bit #: 0.31
> > ibm bit #:   31.0
> 
> The bit numbers in CR fields are *always* numbered left-to-right.  I
> have never seen anyone use LE for it, anyway.
> 
> Also, even people who write LE have the bigger end on the left normally
> (they just write some things right-to-left, and other things
> left-to-right).

Around 1985, I had a documentation for the the National's 32032
(little-endian) processor family, and all the instruction encodings were
presented with the LSB on the left and MSB on the right.

BTW on these processors, the immediate operands and the offsets (1, 2 or
4 bytes) for the addressing modes were encoded in big-endian byte order,
but I digress. Consistency is overrated ;-)

Gabriel


> 
> > Checkpatch does have one complaint:
> > 
> > CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'regs->ccr'
> > #30: FILE: arch/powerpc/lib/sstep.c:1971:
> > +   if ((regs->ccr) & (1 << (31 - ra)))
> > 
> > I don't really mind the parenteses: I think you are safe to ignore
> > checkpatch here unless someone else complains :)
> 
> I find them annoying.  If there are too many parentheses, it is hard to
> see at a glance what groups where.  Also, a suspicious reader might
> think there is something special going on (with macros for example).
> 
> This is simple code of course, but :-)
> 
> > If you do end up respinning the patch, I think it would be good to make
> > the maths a bit clearer. I think it works because a left shift of 2 is
> > the same as multiplying by 4, but it would be easier to follow if you
> > used a temporary variable for btf.
> 
> It is very simple.  The BFA instruction field is closely related to the
> BI instruction field, which is 5 bits, and selects one of the 32 bits in
> the CR.  If you have "BFA00 BFA01 BFA10 BFA11", that gives the bit
> numbers of all four bits in the selected CR field.  So the "& ~3" does
> all you need.  It is quite pretty :-)
> 
> 
> Segher
 



Re: [PATCH] powerpc/traps: Declare unrecoverable_exception() as __noreturn

2021-02-10 Thread Gabriel Paubert
On Thu, Feb 11, 2021 at 06:34:43AM +, Christophe Leroy wrote:
> unrecoverable_exception() is never expected to return, most callers
> have an infiniteloop in case it returns.
> 
> Ensure it really never returns by terminating it with a BUG(), and
> declare it __no_return.
> 
> It always GCC to really simplify functions calling it. In the exemple below,

s/always/allows ?

(Otherwise I can't parse it.)

> it avoids the stack frame in the likely fast path and avoids code duplication
> for the exit.

Indeed, nice code generation improvement.

> 
> With this patch:
> 
>   0348 :
>348:   81 43 00 84 lwz r10,132(r3)
>34c:   71 48 00 02 andi.   r8,r10,2
>350:   41 82 00 2c beq 37c 
>354:   71 4a 40 00 andi.   r10,r10,16384
>358:   40 82 00 20 bne 378 
>35c:   80 62 00 70 lwz r3,112(r2)
>360:   74 63 00 01 andis.  r3,r3,1
>364:   40 82 00 28 bne 38c 
>368:   7d 40 00 a6 mfmsr   r10
>36c:   7c 11 13 a6 mtspr   81,r0
>370:   7c 12 13 a6 mtspr   82,r0
>374:   4e 80 00 20 blr
>378:   48 00 00 00 b   378 

Infinite loop (seems to be on test of MSR_PR)?

Gabriel

>37c:   94 21 ff f0 stwur1,-16(r1)
>380:   7c 08 02 a6 mflrr0
>384:   90 01 00 14 stw r0,20(r1)
>388:   48 00 00 01 bl  388 
>   388: R_PPC_REL24unrecoverable_exception
>38c:   38 e2 00 70 addir7,r2,112
>390:   3d 00 00 01 lis r8,1
>394:   7c c0 38 28 lwarx   r6,0,r7
>398:   7c c6 40 78 andcr6,r6,r8
>39c:   7c c0 39 2d stwcx.  r6,0,r7
>3a0:   40 a2 ff f4 bne 394 
>3a4:   38 60 00 01 li  r3,1
>3a8:   4b ff ff c0 b   368 
> 
> Without this patch:
> 
>   0348 :
>348:   94 21 ff f0 stwur1,-16(r1)
>34c:   93 e1 00 0c stw r31,12(r1)
>350:   7c 7f 1b 78 mr  r31,r3
>354:   81 23 00 84 lwz r9,132(r3)
>358:   71 2a 00 02 andi.   r10,r9,2
>35c:   41 82 00 34 beq 390 
>360:   71 29 40 00 andi.   r9,r9,16384
>364:   40 82 00 28 bne 38c 
>368:   80 62 00 70 lwz r3,112(r2)
>36c:   74 63 00 01 andis.  r3,r3,1
>370:   40 82 00 3c bne 3ac 
>374:   7d 20 00 a6 mfmsr   r9
>378:   7c 11 13 a6 mtspr   81,r0
>37c:   7c 12 13 a6 mtspr   82,r0
>380:   83 e1 00 0c lwz r31,12(r1)
>384:   38 21 00 10 addir1,r1,16
>388:   4e 80 00 20 blr
>38c:   48 00 00 00 b   38c 
>390:   7c 08 02 a6 mflrr0
>394:   90 01 00 14 stw r0,20(r1)
>398:   48 00 00 01 bl  398 
>   398: R_PPC_REL24unrecoverable_exception
>39c:   80 01 00 14 lwz r0,20(r1)
>3a0:   81 3f 00 84 lwz r9,132(r31)
>3a4:   7c 08 03 a6 mtlrr0
>3a8:   4b ff ff b8 b   360 
>3ac:   39 02 00 70 addir8,r2,112
>3b0:   3d 40 00 01 lis r10,1
>3b4:   7c e0 40 28 lwarx   r7,0,r8
>3b8:   7c e7 50 78 andcr7,r7,r10
>3bc:   7c e0 41 2d stwcx.  r7,0,r8
>3c0:   40 a2 ff f4 bne 3b4 
>3c4:   38 60 00 01 li  r3,1
>3c8:   4b ff ff ac b   374 
> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/interrupt.h | 2 +-
>  arch/powerpc/kernel/interrupt.c  | 1 -
>  arch/powerpc/kernel/traps.c  | 2 ++
>  3 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/interrupt.h 
> b/arch/powerpc/include/asm/interrupt.h
> index dcff30e3919b..fa8bfb91f8df 100644
> --- a/arch/powerpc/include/asm/interrupt.h
> +++ b/arch/powerpc/include/asm/interrupt.h
> @@ -411,7 +411,7 @@ DECLARE_INTERRUPT_HANDLER(altivec_assist_exception);
>  DECLARE_INTERRUPT_HANDLER(CacheLockingException);
>  DECLARE_INTERRUPT_HANDLER(SPEFloatingPointException);
>  DECLARE_INTERRUPT_HANDLER(SPEFloatingPointRoundException);
> -DECLARE_INTERRUPT_HANDLER(unrecoverable_exception);
> +DECLARE_INTERRUPT_HANDLER(unrecoverable_exception) __noreturn;
>  DECLARE_INTERRUPT_HANDLER(WatchdogException);
>  DECLARE_INTERRUPT_HANDLER(kernel_bad_stack);
>  
> diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
> index eca3be36c18c..7e7106641ca9 100644
> --- a/arch/powerpc/kernel/interrupt.c
> +++ b/arch/powerpc/kernel/interrupt.c
> @@ -440,7 +440,6 @@ notrace unsigned long interrupt_exit_user_prepare(struct 
> pt_regs *regs, unsigned
>   return ret;
>  }
>  
> -void unrecoverable_exception(struct pt_regs *regs);
>  void preempt_schedule_irq(void);
>  
>  notrace unsigned long 

Re: [PATCH v4 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()

2021-02-01 Thread Gabriel Paubert
On Mon, Feb 01, 2021 at 09:55:44AM -0600, Christopher M. Riedl wrote:
> On Thu Jan 28, 2021 at 4:38 AM CST, David Laight wrote:
> > From: Christopher M. Riedl
> > > Sent: 28 January 2021 04:04
> > > 
> > > Reuse the "safe" implementation from signal.c except for calling
> > > unsafe_copy_from_user() to copy into a local buffer.
> > > 
> > > Signed-off-by: Christopher M. Riedl 
> > > ---
> > >  arch/powerpc/kernel/signal.h | 33 +
> > >  1 file changed, 33 insertions(+)
> > > 
> > > diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
> > > index 2559a681536e..c18402d625f1 100644
> > > --- a/arch/powerpc/kernel/signal.h
> > > +++ b/arch/powerpc/kernel/signal.h
> > > @@ -53,6 +53,33 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
> > > *task, void __user *from);
> > >   [i], label);\
> > >  } while (0)
> > > 
> > > +#define unsafe_copy_fpr_from_user(task, from, label) do {
> > > \
> > > + struct task_struct *__t = task; \
> > > + u64 __user *__f = (u64 __user *)from;   \
> > > + u64 buf[ELF_NFPREG];\
> >
> > How big is that buffer?
> > Isn't is likely to be reasonably large compared to a reasonable
> > kernel stack frame.
> > Especially since this isn't even a leaf function.
> >
> 
> I think Christophe answered this - I don't really have an opinion either
> way. What would be a 'reasonable' kernel stack frame for reference?

See include/linux/poll.h, where the limit is of the order of 800 bytes
and the number of entries in an on stack array is chosen at compile time
(different between 32 and 64 bit for example).

The values are used in do_sys_poll, which, with almost 1000 bytes of
stack footprint, appears close to the top of "make checkstack". In
addition do_sys_poll has to call the ->poll function of every file
descriptor in its table, so it is not a tail function.

This 264 bytes array looks reasonable, but please use 'make checkstack'
to verify that the function's total stack usage stays within reason.

Gabriel

> 
> > > + int i;  \
> > > + \
> > > + unsafe_copy_from_user(buf, __f, ELF_NFPREG * sizeof(double),\
> >
> > That really ought to be sizeof(buf).
> >
> 
> Agreed, I will fix this. Thanks!
> 
> > David
> >
> >
> > > + label); \
> > > + for (i = 0; i < ELF_NFPREG - 1; i++)\
> > > + __t->thread.TS_FPR(i) = buf[i]; \
> > > + __t->thread.fp_state.fpscr = buf[i];\
> > > +} while (0)
> > > +
> > > +#define unsafe_copy_vsx_from_user(task, from, label) do {
> > > \
> > > + struct task_struct *__t = task; \
> > > + u64 __user *__f = (u64 __user *)from;   \
> > > + u64 buf[ELF_NVSRHALFREG];   \
> > > + int i;  \
> > > + \
> > > + unsafe_copy_from_user(buf, __f, \
> > > + ELF_NVSRHALFREG * sizeof(double),   \
> > > + label); \
> > > + for (i = 0; i < ELF_NVSRHALFREG ; i++)  \
> > > + __t->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = buf[i];  \
> > > +} while (0)
> > > +
> > > +
> > >  #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > >  #define unsafe_copy_ckfpr_to_user(to, task, label)   do {
> > > \
> > >   struct task_struct *__t = task; \
> > > @@ -80,6 +107,10 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
> > > *task, void __user *from);
> > >   unsafe_copy_to_user(to, (task)->thread.fp_state.fpr,\
> > >   ELF_NFPREG * sizeof(double), label)
> > > 
> > > +#define unsafe_copy_fpr_from_user(task, from, label) 
> > > \
> > > + unsafe_copy_from_user((task)->thread.fp_state.fpr, from,\
> > > + ELF_NFPREG * sizeof(double), label)
> > > +
> > >  static inline unsigned long
> > >  copy_fpr_to_user(void __user *to, struct task_struct *task)
> > >  {
> > > @@ -115,6 +146,8 @@ copy_ckfpr_from_user(struct task_struct *task, void 
> > > __user *from)
> > >  #else
> > >  #define unsafe_copy_fpr_to_user(to, task, label) do { } while (0)
> > > 
> > > +#define unsafe_copy_fpr_from_user(task, from, label) do { } while (0)
> > > +
> > >  static inline unsigned long
> > >  copy_fpr_to_user(void __user *to, struct task_struct *task)
> > >  {
> > > --
> > > 2.26.1
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes,
> > MK1 1PT, UK
> > 

Re: [PATCH] powerpc: add compile-time support for lbarx, lwarx

2020-11-08 Thread Gabriel Paubert
On Sat, Nov 07, 2020 at 05:42:57AM -0600, Segher Boessenkool wrote:
> On Sat, Nov 07, 2020 at 08:12:13AM +0100, Gabriel Paubert wrote:
> > On Sat, Nov 07, 2020 at 01:23:28PM +1000, Nicholas Piggin wrote:
> > > ISA v2.06 (POWER7 and up) as well as e6500 support lbarx and lwarx.
> > 
> > Hmm, lwarx exists since original Power AFAIR,
> 
> Almost: it was new on PowerPC.

I stand corrected. Does this mean that Power1 (and 2 I believe) had 
no SMP support?

Gabriel
 



Re: [PATCH] powerpc: add compile-time support for lbarx, lwarx

2020-11-06 Thread Gabriel Paubert
On Sat, Nov 07, 2020 at 01:23:28PM +1000, Nicholas Piggin wrote:
> ISA v2.06 (POWER7 and up) as well as e6500 support lbarx and lwarx.

Hmm, lwarx exists since original Power AFAIR, s/lwarx/lharx/ perhaps?

Same for the title of the patch and the CONFIG variable.

Gabriel

> Add a compile option that allows code to use it, and add support in
> cmpxchg and xchg 8 and 16 bit values.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/Kconfig   |   3 +
>  arch/powerpc/include/asm/cmpxchg.h | 236 -
>  arch/powerpc/platforms/Kconfig.cputype |   5 +
>  3 files changed, 243 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index e9f13fe08492..d231af06f75a 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -266,6 +266,9 @@ config PPC_BARRIER_NOSPEC
>   default y
>   depends on PPC_BOOK3S_64 || PPC_FSL_BOOK3E
>  
> +config PPC_LBARX_LWARX
> + bool
> +
>  config EARLY_PRINTK
>   bool
>   default y
> diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> b/arch/powerpc/include/asm/cmpxchg.h
> index cf091c4c22e5..17fd996dc0d4 100644
> --- a/arch/powerpc/include/asm/cmpxchg.h
> +++ b/arch/powerpc/include/asm/cmpxchg.h
> @@ -77,10 +77,76 @@ u32 __cmpxchg_##type##sfx(volatile void *p, u32 old, u32 
> new) \
>   * the previous value stored there.
>   */
>  
> +#ifndef CONFIG_PPC_LBARX_LWARX
>  XCHG_GEN(u8, _local, "memory");
>  XCHG_GEN(u8, _relaxed, "cc");
>  XCHG_GEN(u16, _local, "memory");
>  XCHG_GEN(u16, _relaxed, "cc");
> +#else
> +static __always_inline unsigned long
> +__xchg_u8_local(volatile void *p, unsigned long val)
> +{
> + unsigned long prev;
> +
> + __asm__ __volatile__(
> +"1:  lbarx   %0,0,%2 \n"
> +"stbcx.  %3,0,%2 \n\
> + bne-1b"
> + : "=" (prev), "+m" (*(volatile unsigned char *)p)
> + : "r" (p), "r" (val)
> + : "cc", "memory");
> +
> + return prev;
> +}
> +
> +static __always_inline unsigned long
> +__xchg_u8_relaxed(u8 *p, unsigned long val)
> +{
> + unsigned long prev;
> +
> + __asm__ __volatile__(
> +"1:  lbarx   %0,0,%2\n"
> +"stbcx.  %3,0,%2\n"
> +"bne-1b"
> + : "=" (prev), "+m" (*p)
> + : "r" (p), "r" (val)
> + : "cc");
> +
> + return prev;
> +}
> +
> +static __always_inline unsigned long
> +__xchg_u16_local(volatile void *p, unsigned long val)
> +{
> + unsigned long prev;
> +
> + __asm__ __volatile__(
> +"1:  lharx   %0,0,%2 \n"
> +"sthcx.  %3,0,%2 \n\
> + bne-1b"
> + : "=" (prev), "+m" (*(volatile unsigned short *)p)
> + : "r" (p), "r" (val)
> + : "cc", "memory");
> +
> + return prev;
> +}
> +
> +static __always_inline unsigned long
> +__xchg_u16_relaxed(u16 *p, unsigned long val)
> +{
> + unsigned long prev;
> +
> + __asm__ __volatile__(
> +"1:  lharx   %0,0,%2\n"
> +"sthcx.  %3,0,%2\n"
> +"bne-1b"
> + : "=" (prev), "+m" (*p)
> + : "r" (p), "r" (val)
> + : "cc");
> +
> + return prev;
> +}
> +#endif
>  
>  static __always_inline unsigned long
>  __xchg_u32_local(volatile void *p, unsigned long val)
> @@ -198,11 +264,12 @@ __xchg_relaxed(void *ptr, unsigned long x, unsigned int 
> size)
>   (__typeof__(*(ptr))) __xchg_relaxed((ptr),  \
>   (unsigned long)_x_, sizeof(*(ptr)));\
>  })
> +
>  /*
>   * Compare and exchange - if *p == old, set it to new,
>   * and return the old value of *p.
>   */
> -
> +#ifndef CONFIG_PPC_LBARX_LWARX
>  CMPXCHG_GEN(u8, , PPC_ATOMIC_ENTRY_BARRIER, PPC_ATOMIC_EXIT_BARRIER, 
> "memory");
>  CMPXCHG_GEN(u8, _local, , , "memory");
>  CMPXCHG_GEN(u8, _acquire, , PPC_ACQUIRE_BARRIER, "memory");
> @@ -211,6 +278,173 @@ CMPXCHG_GEN(u16, , PPC_ATOMIC_ENTRY_BARRIER, 
> PPC_ATOMIC_EXIT_BARRIER, "memory");
>  CMPXCHG_GEN(u16, _local, , , "memory");
>  CMPXCHG_GEN(u16, _acquire, , PPC_ACQUIRE_BARRIER, "memory");
>  CMPXCHG_GEN(u16, _relaxed, , , "cc");
> +#else
> +static __always_inline unsigned long
> +__cmpxchg_u8(volatile unsigned char *p, unsigned long old, unsigned long new)
> +{
> + unsigned int prev;
> +
> + __asm__ __volatile__ (
> + PPC_ATOMIC_ENTRY_BARRIER
> +"1:  lbarx   %0,0,%2 # __cmpxchg_u8\n\
> + cmpw0,%0,%3\n\
> + bne-2f\n"
> +"stbcx.  %4,0,%2\n\
> + bne-1b"
> + PPC_ATOMIC_EXIT_BARRIER
> + "\n\
> +2:"
> + : "=" (prev), "+m" (*p)
> + : "r" (p), "r" (old), "r" (new)
> + : "cc", "memory");
> +
> + return prev;
> +}
> +
> +static __always_inline unsigned long
> +__cmpxchg_u8_local(volatile unsigned char *p, unsigned long old,
> + unsigned long new)
> +{
> + unsigned int prev;
> +
> + __asm__ __volatile__ (
> +"1:  lbarx   %0,0,%2 # __cmpxchg_u8\n\
> + cmpw0,%0,%3\n\
> + bne-2f\n"
> +"stbcx.  %4,0,%2\n\
> + bne-1b"
> + "\n\
> +2:"
> + : "=" (prev), "+m" (*p)
> + : "r" (p), "r" 

Re: kernel since 5.6 do not boot anymore on Apple PowerBook

2020-08-28 Thread Gabriel Paubert
On Thu, Aug 27, 2020 at 12:39:06PM +0200, Christophe Leroy wrote:
> Hi,
> 
> Le 27/08/2020 à 10:28, Giuseppe Sacco a écrit :
> > Il giorno gio, 27/08/2020 alle 09.46 +0200, Giuseppe Sacco ha scritto:
> > > Il giorno gio, 27/08/2020 alle 00.28 +0200, Giuseppe Sacco ha scritto:
> > > > Hello Christophe,
> > > > 
> > > > Il giorno mer, 26/08/2020 alle 15.53 +0200, Christophe Leroy ha
> > > > scritto:
> > > > [...]
> > > > > If there is no warning, then the issue is something else, bad luck.
> > > > > 
> > > > > Could you increase the loglevel and try again both with and without
> > > > > VMAP_STACK ? Maybe we'll get more information on where it stops.
> > > > 
> > > > The problem is related to the CPU frequency changes. This is where the
> > > > system stop: cpufreq get the CPU frequency limits and then start the
> > > > default governor (performance) and then calls function
> > > > cpufreq_gov_performance_limits() that never returns.
> > > > 
> > > > Rebuilding after enabling pr_debug for cpufreq.c, I've got some more
> > > > lines of output:
> > > > 
> > > > cpufreq: setting new policy for CPU 0: 667000 - 867000 kHz
> > > > cpufreq: new min and max freqs are 667000 - 867000 kHz
> > > > cpufreq: governor switch
> > > > cpufreq: cpufreq_init_governor: for CPU 0
> > > > cpufreq: cpufreq_start_governor: for CPU 0
> > > > cpufreq: target for CPU 0: 867000 kHz, relation 1, requested 867000 kHz
> > > > cpufreq: __target_index: cpu: 0, oldfreq: 667000, new freq: 867000
> > > > cpufreq: notification 0 of frequency transition to 867000 kHz
> > > > cpufreq: saving 133328 as reference value for loops_per_jiffy; freq is 
> > > > 667000 kHz
> > > > 
> > > > no more lines are printed. I think this output only refers to the
> > > > notification sent prior to the frequency change.
> > > > 
> > > > I was thinking that selecting a governor that would run at 667mHz would
> > > > probably skip the problem. I added cpufreq.default_governor=powersave
> > > > to the command line parameters but it did not work: the selected
> > > > governor was still performance and the system crashed.
> > > 
> > > I kept following the thread and found that CPU frequency is changed in
> > > function pmu_set_cpu_speed() in file drivers/cpufreq/pmac32-cpufreq.c.
> > > As first thing, the function calls the macro preempt_disable() and this
> > > is where it stops.
> > 
> > Sorry, I made a mistake. The real problem is down, on the same
> > function, when it calls low_sleep_handler(). This is where the problem
> > probably is.
> > 
> 
> 
> Great, you spotted the problem.
> 
> I see what it is, it is in low_sleep_handler() in
> arch/powerpc/platforms/powermac/sleep.S
> 
> All critical registers are saved on the stack. At restore, they are restore
> BEFORE re-enabling MMU (because they are needed for that). But when we have
> VMAP_STACK, the stack can hardly be accessed without the MMU enabled.
> tophys() doesn't work for virtual stack addresses.
> 
> Therefore, the low_sleep_handler() has to be reworked for using an area in
> the linear mem instead of the stack.

Actually there is already one, at the end of sleep.S, there is a 4 byte
area called sleep_storage. It should be enough to move there the CPU state
which has to be restored before the MMU is enabled:
- SDR1
- BATs
- SPRG (they might stay on the stack by reordering the code unless I
  miss something)

Note that save_cpu_setup/restore_cpu_setup also use another static
storage area to save other SPRs.

Using static area would not work on SMP, but all PPC based Apple laptops
are single core and single thread.

Gabriel




Re: [PATCH] powerpc: Update documentation of ISA versions for Power10

2020-08-25 Thread Gabriel Paubert
On Tue, Aug 25, 2020 at 09:45:07PM +1000, Jordan Niethe wrote:
> Update the CPU to ISA Version Mapping document to include Power10 and
> ISA v3.1.
> 
> Signed-off-by: Jordan Niethe 
> ---
>  Documentation/powerpc/isa-versions.rst | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/Documentation/powerpc/isa-versions.rst 
> b/Documentation/powerpc/isa-versions.rst
> index a363d8c1603c..72aff1eaaea1 100644
> --- a/Documentation/powerpc/isa-versions.rst
> +++ b/Documentation/powerpc/isa-versions.rst
> @@ -7,6 +7,7 @@ Mapping of some CPU versions to relevant ISA versions.
>  = 
> 
>  CPU   Architecture version
>  = 
> 
> +Power10   Power ISA v3.1
>  Power9Power ISA v3.0B
>  Power8Power ISA v2.07
>  Power7Power ISA v2.06
> @@ -32,6 +33,7 @@ Key Features
>  == ==
>  CPUVMX (aka. Altivec)
>  == ==
> +Power10Yes
>  Power9 Yes
>  Power8 Yes
>  Power7 Yes
> @@ -47,6 +49,7 @@ PPC970 Yes
>  == 
>  CPUVSX
>  == 
> +Power10Yes
>  Power9 Yes
>  Power8 Yes
>  Power7 Yes
> @@ -62,6 +65,7 @@ PPC970 No
>  == 
>  CPUTransactional Memory
>  == 
> +Power10Yes
>  Power9 Yes (* see transactional_memory.txt)
>  Power8 Yes
>  Power7 No

Huh? 

Transactional memory has been removed from the architecture for Power10. 

Gabriel
 



Re: Re:Re: [PATCH] powerpc: Fix a bug in __div64_32 if divisor is zero

2020-08-22 Thread Gabriel Paubert
On Sun, Aug 23, 2020 at 12:54:33AM +0800, Guohua Zhong wrote:
> >In generic version in lib/math/div64.c, there is no checking of 'base' 
> >either.
> >Do we really want to add this check in the powerpc version only ?
> 
> >The only user of __div64_32() is do_div() in 
> >include/asm-generic/div64.h. Wouldn't it be better to do the check there ?
> 
> >Christophe
> 
> Yet, I have noticed that there is no checking of 'base' in these functions.
> But I am not sure how to check is better.As we know that the result is 
> undefined when divisor is zero. It maybe good to print error and dump stack.
>  Let the process to know that the divisor is zero by sending SIGFPE. 
> 
> diff --git a/include/asm-generic/div64.h b/include/asm-generic/div64.h
> index a3b98c86f077..161c656ee3ee 100644
> --- a/include/asm-generic/div64.h
> +++ b/include/asm-generic/div64.h
> @@ -43,6 +43,11 @@
>  # define do_div(n,base) ({ \
> uint32_t __base = (base);   \
> uint32_t __rem; \
> + if (unlikely(base == 0)) {  \
> + pr_err("do_div base=%d\n",base);\
> + dump_stack();   \
> + force_sig(SIGFPE);  \
> + }  
> 

I suspect this will generate a strong reaction. SIGFPE is for user space
instruction attempting a division by zero. A division by zero in the
kernel is a kernel bug, period, and you don't want to kill a user
process for this reason.

If it happens in an interrupt, the context of the kernel may not even be
related to the current process.

Many other architectures (x86 for example) already trigger an exception
on a division by zero but the handler will find that the exception
happened in kernel context and generate an Oops, not raise a signal in a
(possibly innocent) userland process.

Gabriel

> Then it also needto add this checking in functions of
> div64_s64(), div64_u64(), div64_u64_rem(), div_s64_rem and div_u64_rem () 
> in include/linux/math64.h
> 
> + if (unlikely(divisor == 0)) {
> + pr_err("%s divisor=0\n",__func__);
> + dump_stack();
> + force_sig(SIGFPE);
> + }
> 
> Guohua
> 
> >>lwz r5,0(r3)# get the dividend into r5/r6
> >>lwz r6,4(r3)
> >>cmplw   r5,r4
> >>@@ -52,6 +55,7 @@ __div64_32:
> >>  4:stw r7,0(r3)# return the quotient in *r3
> >>stw r8,4(r3)
> >>mr  r3,r6   # return the remainder in r3
> >>+5: # return if divisor r4 is zero
> >>blr
> >>  
> >>  /*
> >>diff --git a/arch/powerpc/lib/div64.S b/arch/powerpc/lib/div64.S
> >>index 3d5426e7dcc4..1cc9bcabf678 100644
> >>--- a/arch/powerpc/lib/div64.S
> >>+++ b/arch/powerpc/lib/div64.S
> >>@@ -13,6 +13,9 @@
> >>  #include 
> >>  
> >>  _GLOBAL(__div64_32)
> >>+   li  r9,0
> >>+   cmplw   r4,r9   # check if divisor r4 is zero
> >>+   beq 5f  # jump to label 5 if r4(divisor) is zero
> >>lwz r5,0(r3)# get the dividend into r5/r6
> >>lwz r6,4(r3)
> >>cmplw   r5,r4
> >>@@ -52,4 +55,5 @@ _GLOBAL(__div64_32)
> >>  4:stw r7,0(r3)# return the quotient in *r3
> >>stw r8,4(r3)
> >>mr  r3,r6   # return the remainder in r3
> >>+5: # return if divisor r4 is zero
> >>blr
> >>
> 
 



Re: [PATCH v2 2/5] powerpc: Allow 4224 bytes of stack expansion for the signal frame

2020-07-27 Thread Gabriel Paubert
On Fri, Jul 24, 2020 at 07:25:25PM +1000, Michael Ellerman wrote:
> We have powerpc specific logic in our page fault handling to decide if
> an access to an unmapped address below the stack pointer should expand
> the stack VMA.
> 
> The code was originally added in 2004 "ported from 2.4". The rough
> logic is that the stack is allowed to grow to 1MB with no extra
> checking. Over 1MB the access must be within 2048 bytes of the stack
> pointer, or be from a user instruction that updates the stack pointer.
> 
> The 2048 byte allowance below the stack pointer is there to cover the
> 288 byte "red zone" as well as the "about 1.5kB" needed by the signal
> delivery code.
> 
> Unfortunately since then the signal frame has expanded, and is now
> 4224 bytes on 64-bit kernels with transactional memory enabled.

Are there really users of transactional memory in the wild? 

Just asking because Power10 removes TM, and Power9 has had some issues
with it AFAICT.

Getting rid of it (if possible) would result in smaller signal frames,
with simpler signal delivery code (probably slightly faster also).

Gabriel
 



Re: [PATCH] KVM: PPC: Book3S HV: Fix typos in comments

2020-03-05 Thread Gabriel Paubert
On Fri, Mar 06, 2020 at 11:26:36AM +1100, Gustavo Romero wrote:
> Fix typos found in comments about the parameter passed
> through r5 to kvmppc_{save,restore}_tm_hv functions.

Actually "iff" is a common shorthand in some fields and not necessarily
a spelling error:

https://en.wikipedia.org/wiki/If_and_only_if

Gabriel
> 
> Signed-off-by: Gustavo Romero 
> ---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index dbc2fec..a55dbe8 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -3121,7 +3121,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
>   * Save transactional state and TM-related registers.
>   * Called with r3 pointing to the vcpu struct and r4 containing
>   * the guest MSR value.
> - * r5 is non-zero iff non-volatile register state needs to be maintained.
> + * r5 is non-zero if non-volatile register state needs to be maintained.
>   * If r5 == 0, this can modify all checkpointed registers, but
>   * restores r1 and r2 before exit.
>   */
> @@ -3194,7 +3194,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_XER_SO_BUG)
>   * Restore transactional state and TM-related registers.
>   * Called with r3 pointing to the vcpu struct
>   * and r4 containing the guest MSR value.
> - * r5 is non-zero iff non-volatile register state needs to be maintained.
> + * r5 is non-zero if non-volatile register state needs to be maintained.
>   * This potentially modifies all checkpointed registers.
>   * It restores r1 and r2 from the PACA.
>   */
> -- 
> 1.8.3.1
> 


Re: [PosibleSpam] Re: z constraint in powerpc inline assembly ?

2020-01-16 Thread Gabriel Paubert
On Thu, Jan 16, 2020 at 07:57:29AM -0600, Segher Boessenkool wrote:
> On Thu, Jan 16, 2020 at 09:06:08AM +0100, Gabriel Paubert wrote:
> > On Thu, Jan 16, 2020 at 07:11:36AM +0100, Christophe Leroy wrote:
> > > Hi Segher,
> > > 
> > > I'm trying to see if we could enhance TCP checksum calculations by 
> > > splitting
> > > inline assembly blocks to give GCC the opportunity to mix it with other
> > > stuff, but I'm getting difficulties with the carry.
> > > 
> > > As far as I can read in the documentation, the z constraint represents
> > > '‘XER[CA]’ carry bit (part of the XER register)'
> > 
> > Well, the documentation is very optimisitic. From the GCC source code
> > (thanks for switching to git last week-end ;-)), it is clear that the
> > carry is not, for the time being, properly modeled. 
> 
> What?  It certainly *is*, I spent ages on that back in 2014 and before.
> See gcc.gnu.org/PR64180 etc.
> 
> You can not put the carry as input or output to an asm, of course: no C
> variable can be assigned to it.
> 
> We don't do the "flag outputs" thing, either, as it is largely useless
> for Power (and using it would often make *worse* code).
> 
> If you want to access a carry, write C code that does that operation.
> The compiler knows how to optimise it well.
> 
> > Right now, in the machine description, all setters and users of the carry
> > are in the same block of generated instructions.
> 
> No, they are not.  For over five years now.  (Since GCC 5).
> 
> > For a start, all single instructions patterns that set the carry (and
> > do not use it) as a side effect should mention the they clobber the 
> > carry, otherwise inserting one between a setter and a user of the carry 
> > would break.
> 
> And they do.
>

Apologies, I don't know how I could misread the .md files this badly.
Indeed I see everything now that you mention it.

I'm still a bit surprised that I have found zero "z" constraints in the
whole gcc/config/rs6000 directory. Everything seems to be CA_REGNO.

> All asms that change the carry should mention that, too, but this is
> automatically done for all inline asms, because there was a lot of code
> in the wild that does not clobber it.

I was not aware of this, anyway I would always put as correct as
possible clobbers for my inline assembly code.

> 
> > This includes all arithmetic right shift (sra[wd]{,i}, 
> > subfic, addic{,\.} and I may have forgotten some.
> 
> {add,subf}{ic,c,e,ze,me} and sra[wd][i] and their dots.  Sure.  And
> mcrxr and mcrxrx and mfxer and mtxer.  That's about it.

Yes, but are last ones (the moves) are ever generated by the compiler?

Looking at the source (again) it seems that even lswi has disappeared.

> 
> We don't model the second carry at all yet btw, in GCC.  Not too many
> people know it exists even, so no big loss there.
> 

Anyway, I couldn't use it. I tried to buy a Talos II at work but
management made it too complex to negotiate. The problem was not the
money, but the paperwork :-(. Now my most powerful PPC machine is a 17" 
Powerbook G4.

> (One nasty was that addi. does not exist, so we used addic. where it was
> wanted before, so that had to change.)
> 
> 
> Segher


Regards,
Gabriel


Re: z constraint in powerpc inline assembly ?

2020-01-16 Thread Gabriel Paubert
On Thu, Jan 16, 2020 at 07:11:36AM +0100, Christophe Leroy wrote:
> Hi Segher,
> 
> I'm trying to see if we could enhance TCP checksum calculations by splitting
> inline assembly blocks to give GCC the opportunity to mix it with other
> stuff, but I'm getting difficulties with the carry.
> 
> As far as I can read in the documentation, the z constraint represents
> '‘XER[CA]’ carry bit (part of the XER register)'

Well, the documentation is very optimisitic. From the GCC source code
(thanks for switching to git last week-end ;-)), it is clear that the
carry is not, for the time being, properly modeled. 

Right now, in the machine description, all setters and users of the carry
are in the same block of generated instructions.

For a start, all single instructions patterns that set the carry (and
do not use it) as a side effect should mention the they clobber the 
carry, otherwise inserting one between a setter and a user of the carry 
would break. This includes all arithmetic right shift (sra[wd]{,i}, 
subfic, addic{,\.} and I may have forgotten some.

If you want to future proof your code just in case, you should also add
an "xer" clobber to all instruction sequences that may modify the carry
bit. But any inline assembly that touches XER might break if GCC is
ugraded to properly model the carry bit, and a lot of code might need to
be audited.

Gabriel

> 
> I've tried the following, but I get errors. Can you help ?
> 
> unsigned long cksum(unsigned long a, unsigned long b, unsigned long c)
> {
>   unsigned long sum;
>   unsigned long carry;
> 
>   asm("addc %0, %2, %3" : "=r"(sum), "=z"(carry) : "r"(a), "r"(b));
>   asm("adde %0, %0, %2" : "+r"(sum), "+z"(carry) : "r"(c));
>   asm("addze %0, %0" : "+r"(sum) : "z"(carry));
> 
>   return sum;
> }
> 
> 
> 
> csum.c: In function 'cksum':
> csum.c:6:2: error: inconsistent operand constraints in an 'asm'
>   asm("addc %0, %2, %3" : "=r"(sum), "=z"(carry) : "r"(a), "r"(b));
>   ^
> csum.c:7:2: error: inconsistent operand constraints in an 'asm'
>   asm("adde %0, %0, %2" : "+r"(sum), "+z"(carry) : "r"(c));
>   ^
> csum.c:8:2: error: inconsistent operand constraints in an 'asm'
>   asm("addze %0, %0" : "+r"(sum) : "z"(carry));
>   ^
> 
> Thanks
> Christophe
> 


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to, from}_user helpers

2019-09-05 Thread Gabriel Paubert
On Thu, Sep 05, 2019 at 11:09:35AM +0200, Andreas Schwab wrote:
> On Sep 05 2019, Aleksa Sarai  wrote:
> 
> > diff --git a/lib/struct_user.c b/lib/struct_user.c
> > new file mode 100644
> > index ..7301ab1bbe98
> > --- /dev/null
> > +++ b/lib/struct_user.c
> > @@ -0,0 +1,182 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/*
> > + * Copyright (C) 2019 SUSE LLC
> > + * Copyright (C) 2019 Aleksa Sarai 
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#define BUFFER_SIZE 64
> > +
> > +/*
> > + * "memset(p, 0, size)" but for user space buffers. Caller must have 
> > already
> > + * checked access_ok(p, size).
> > + */
> > +static int __memzero_user(void __user *p, size_t s)
> > +{
> > +   const char zeros[BUFFER_SIZE] = {};
> 
> Perhaps make that static?

On SMP? It should at least be per cpu, and I'm not even sure
with preemption.

Gabriel

> 
> Andreas.
> 
> -- 
> Andreas Schwab, sch...@linux-m68k.org
> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> "And now for something completely different."


Re: [PATCH v2 3/6] powerpc: Convert flush_icache_range & friends to C

2019-09-03 Thread Gabriel Paubert
On Tue, Sep 03, 2019 at 01:31:57PM -0500, Segher Boessenkool wrote:
> On Tue, Sep 03, 2019 at 07:05:19PM +0200, Christophe Leroy wrote:
> > Le 03/09/2019 à 18:04, Segher Boessenkool a écrit :
> > >(Why are they separate though?  It could just be one loop var).
> > 
> > Yes it could just be a single loop var, but in that case it would have 
> > to be reset at the start of the second loop, which means we would have 
> > to pass 'addr' for resetting the loop anyway,
> 
> Right, I noticed that after hitting send, as usual.
> 
> > so I opted to do it 
> > outside the inline asm by using to separate loop vars set to their 
> > starting value outside the inline asm.
> 
> The thing is, the way it is written now, it will get separate registers
> for each loop (with proper earlyclobbers added).  Not that that really
> matters of course, it just feels wrong :-)

After "mtmsr %3", it is always possible to copy %0 to %3 and use it as
an address register for the second loop. One register less to allocate
for the compiler. Constraints of course have to be adjusted.

Gabriel
> 
> 
> Segher


Re: [PATCH v13 00/10] powerpc: Switch to CONFIG_THREAD_INFO_IN_TASK

2019-01-24 Thread Gabriel Paubert
On Thu, Jan 24, 2019 at 04:58:41PM +0100, Christophe Leroy wrote:
> 
> 
> Le 24/01/2019 à 16:01, Christophe Leroy a écrit :
> > 
> > 
> > Le 24/01/2019 à 10:43, Christophe Leroy a écrit :
> > > 
> > > 
> > > On 01/24/2019 01:06 AM, Michael Ellerman wrote:
> > > > Christophe Leroy  writes:
> > > > > Le 12/01/2019 à 10:55, Christophe Leroy a écrit :
> > > > > > The purpose of this serie is to activate
> > > > > > CONFIG_THREAD_INFO_IN_TASK which
> > > > > > moves the thread_info into task_struct.
> > > > > > 
> > > > > > Moving thread_info into task_struct has the following advantages:
> > > > > > - It protects thread_info from corruption in the case of stack
> > > > > > overflows.
> > > > > > - Its address is harder to determine if stack addresses are
> > > > > > leaked, making a number of attacks more difficult.
> > > > > 
> > > > > I ran null_syscall and context_switch benchmark selftests
> > > > > and the result
> > > > > is surprising. There is slight degradation in context_switch and a
> > > > > significant one on null_syscall:
> > > > > 
> > > > > Without the serie:
> > > > > 
> > > > > ~# chrt -f 98 ./context_switch --no-altivec --no-vector --no-fp
> > > > > 55542
> > > > > 55562
> > > > > 55564
> > > > > 55562
> > > > > 55568
> > > > > ...
> > > > > 
> > > > > ~# ./null_syscall
> > > > >  2546.71 ns 336.17 cycles
> > > > > 
> > > > > 
> > > > > With the serie:
> > > > > 
> > > > > ~# chrt -f 98 ./context_switch --no-altivec --no-vector --no-fp
> > > > > 55138
> > > > > 55142
> > > > > 55152
> > > > > 55144
> > > > > 55142
> > > > > 
> > > > > ~# ./null_syscall
> > > > >  3479.54 ns 459.30 cycles
> > > > > 
> > > > > So 0,8% less context switches per second and 37% more time
> > > > > for one syscall ?
> > > > > 
> > > > > Any idea ?
> > > > 
> > > > What platform is that on?
> > > 
> > > It is on the 8xx
> 
> On the 83xx, I have a slight improvment:
> 
> Without the serie:
> 
> root@vgoippro:~# ./null_syscall
> 921.44 ns 307.15 cycles
> 
> With the serie:
> 
> root@vgoippro:~# ./null_syscall
> 918.78 ns 306.26 cycles
> 

The 8xx has very low cache associativity, something like 2, right?

In this case it may be access patterns which change the number of
cache line transfers when you move things around. 

Try to move things around in main(), for example allocate a variable of
~1kB on the stack in the function that performs the null_syscalls (use 
the variable before and after the loop, to avoid clever compiler
optimizations).

Gabriel


> Christophe
> 
> > > 
> > > > 
> > > > On 64-bit we have to turn one mtmsrd into two and that's obviously a
> > > > slow down. But I don't see that you've done anything similar in 32-bit
> > > > code.
> > > > 
> > > > I assume it's patch 8 that causes the slow down?
> > > 
> > > I have not digged into it yet, but why patch 8 ?
> > > 
> > 
> > The increase of null_syscall duration happens with patch 5 when we
> > activate CONFIG_THREAD_INFO_IN_TASK.
> > 


Re: [PATCH v2 13/29] arch: add split IPC system calls where needed

2019-01-18 Thread Gabriel Paubert
On Fri, Jan 18, 2019 at 05:18:19PM +0100, Arnd Bergmann wrote:
> The IPC system call handling is highly inconsistent across architectures,
> some use sys_ipc, some use separate calls, and some use both.  We also
> have some architectures that require passing IPC_64 in the flags, and
> others that set it implicitly.
> 
> For the additon of a y2083 safe semtimedop() system call, I chose to only

It's not critical, but there are two typos in that line: 
additon -> addition
2083 -> 2038

Gabriel

> support the separate entry points, but that requires first supporting
> the regular ones with their own syscall numbers.
> 
> The IPC_64 is now implied by the new semctl/shmctl/msgctl system
> calls even on the architectures that require passing it with the ipc()
> multiplexer.
> 
> I'm not adding the new semtimedop() or semop() on 32-bit architectures,
> those will get implemented using the new semtimedop_time64() version
> that gets added along with the other time64 calls.
> Three 64-bit architectures (powerpc, s390 and sparc) get semtimedop().
> 
> Signed-off-by: Arnd Bergmann 


Re: [PATCH] powerpc/4xx/ocm: fix compiler error

2018-12-25 Thread Gabriel Paubert
On Sat, Dec 22, 2018 at 05:04:51PM -0600, Segher Boessenkool wrote:
> On Sat, Dec 22, 2018 at 08:37:28PM +0100, christophe leroy wrote:
> > Le 22/12/2018 à 18:16, Segher Boessenkool a écrit :
> > >On Sat, Dec 22, 2018 at 02:08:02PM +0100, christophe leroy wrote:
> > >>
> > >>Usually, Guarded implies no exec (at least on 6xx and 8xx).
> > >
> > >Huh?  What do you mean here?
> > 
> > From the 885 Reference Manual:
> > 
> > Address translation: the EA is translated by using the MMU’s TLB 
> > mechanism. Instructions are not fetched from no-execute or guarded 
> > memory and data accesses are not executed speculatively to or from the 
> > guarded memory.
> > 
> > 6.1.3.4 Instruction TLB Error Exception (0x01300)
> > This type of exception occurs as a result of one of the following 
> > conditions if MSR[IR] = 1:
> > - The EA cannot be translated.
> > - The fetch access violates memory protection
> > - The fetch access is to guarded memory
> > 
> > 
> > From e300core reference manual:
> > 
> > Translation Exception Conditions:
> > Exception condition: Instruction fetch from guarded memory
> > with MSR[IR] = 1 ==> ISI interrupt SRR1[3] = 1
> 
> Right, but you said 6xx as well, i.e. pure PowerPC.
> 
> If for example IR=0 you cannot have N=1, but you do have G=1.  There is
> no case where G=1 implies N=1 afaik, or where fetch is prohibited some
> other way (causes an ISI, say).

>From the 750fx user's manual, similar wording can be found for other
processors (except 601), including the ones that use software TLB load
like 603/603e (the test for guarded memory is in the sample code for
the tlb miss handler):

"
An ISI exception occurs when no higher priority exception exists and an
attempt to fetch the next instruction fails. This exception is implemented 
as it is defined by the PowerPC architecture (OEA), and is taken for the
following conditions:
•   The effective address cannot be translated.
•   The fetch access is to a no-execute segment (SR[N] = 1).

•   The fetch access is to guarded storage and MSR[IR] = 1.

•   The fetch access is to a segment for which SR[T] is set.
•   The fetch access violates memory protection.
When an ISI exception is taken, instruction fetching resumes at offset
0x00400 from the physical base indicated by MSR[IP].
"

Gabriel


Re: Looking for architecture papers

2018-10-04 Thread Gabriel Paubert
On Thu, Oct 04, 2018 at 10:41:13AM +0300, Raz wrote:
> Frankly, the more I read the more perplexed I get. For example,
> according to BOOK III-S, chapter 3,
> the MSR bits are differ from the ones described in
> arch/powerpc/include/asm/reg.h.
> Bit zero, is LE, but in the book it is 64-bit mode.

Just a problem of bit order definitions: IBM hardware definitions use
big-endian bit ordering, where bit 0 is the most significant bit.

> 
> Would someone be kind to explain what I do not understand?
> 
> Thank you

Gabriel


Re: [PATCH 1/2] powerpc/pkeys: preallocate execute_only key only if the key is available.

2018-06-30 Thread Gabriel Paubert
On Fri, Jun 29, 2018 at 09:58:37PM -0300, Thiago Jung Bauermann wrote:
> 
> Gabriel Paubert  writes:
> 
> > On Thu, Jun 28, 2018 at 11:56:34PM -0300, Thiago Jung Bauermann wrote:
> >>
> >> Hello,
> >>
> >> Ram Pai  writes:
> >>
> >> > Key 2 is preallocated and reserved for execute-only key. In rare
> >> > cases if key-2 is unavailable, mprotect(PROT_EXEC) will behave
> >> > incorrectly. NOTE: mprotect(PROT_EXEC) uses execute-only key.
> >> >
> >> > Ensure key 2 is available for preallocation before reserving it for
> >> > execute_only purpose.  Problem noticed by Michael Ellermen.
> >>
> >> Since "powerpc/pkeys: Preallocate execute-only key" isn't upstream yet,
> >> this patch could be squashed into it.
> >>
> >> > Signed-off-by: Ram Pai 
> >> > ---
> >> >  arch/powerpc/mm/pkeys.c |   14 +-
> >> >  1 files changed, 9 insertions(+), 5 deletions(-)
> >> >
> >> > diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
> >> > index cec990c..0b03914 100644
> >> > --- a/arch/powerpc/mm/pkeys.c
> >> > +++ b/arch/powerpc/mm/pkeys.c
> >> > @@ -19,6 +19,7 @@
> >> >  u64  pkey_amr_mask; /* Bits in AMR not to be touched */
> >> >  u64  pkey_iamr_mask;/* Bits in AMR not to be touched */
> >> >  u64  pkey_uamor_mask;   /* Bits in UMOR not to be touched */
> >> > +int  execute_only_key = 2;
> >> >
> >> >  #define AMR_BITS_PER_PKEY 2
> >> >  #define AMR_RD_BIT 0x1UL
> >> > @@ -26,7 +27,6 @@
> >> >  #define IAMR_EX_BIT 0x1UL
> >> >  #define PKEY_REG_BITS (sizeof(u64)*8)
> >> >  #define pkeyshift(pkey) (PKEY_REG_BITS - ((pkey+1) * AMR_BITS_PER_PKEY))
> >> > -#define EXECUTE_ONLY_KEY 2
> >> >
> >> >  static void scan_pkey_feature(void)
> >> >  {
> >> > @@ -122,8 +122,12 @@ int pkey_initialize(void)
> >> >  #else
> >> >  os_reserved = 0;
> >> >  #endif
> >> > +
> >> > +if ((pkeys_total - os_reserved) <= execute_only_key)
> >> > +execute_only_key = -1;
> >> > +
> >> >  /* Bits are in LE format. */
> >> > -reserved_allocation_mask = (0x1 << 1) | (0x1 << 
> >> > EXECUTE_ONLY_KEY);
> >> > +reserved_allocation_mask = (0x1 << 1) | (0x1 << 
> >> > execute_only_key);
> >>
> >> My understanding is that left-shifting by a negative amount is undefined
> >> behavior in C. A quick test tells me that at least on the couple of
> >> machines I tested, 1 < -1 = 0. Does GCC guarantee that behavior?
> >
> > Not in general. It probably always works on Power because of the definition
> > of the machine instruction for shifts with variable amount (consider the
> > shift amount unsigned and take it modulo twice the width of the operand),
> 
> Ok, thanks for confirming.
> 
> > but for example it fails on x86 (1<<-1 gives 0x8000).
> 
> Strange, this works on my laptop with an Intel(R) Core(TM) i5-7300U CPU:
> 
> $ cat blah.c
> #include 
> 
> int main(int argc, char *argv[])
> {
> printf("1 << -1 = %llx\n", (unsigned long long) 1 << -1);
> return 0;
> }
> $ make blah
> cc blah.c   -o blah
> blah.c: In function 'main':
> blah.c:5:52: warning: left shift count is negative [-Wshift-count-negative]
>   printf("1 << -1 = %llx\n", (unsigned long long) 1 << -1);
> ^~
> $ ./blah
> 1 << -1 = 0
> 
> Even if I change the cast and printf format to int, the result is the
> same. Or am I doing it wrong?

Try something more dynamic, 1 << -1 is evaluated at compile time by gcc,
and when you get a warning, the compile time expression evaluation does
not always give the same result as the run time machine instructions.

To test this I wrote (yes, the error checking is approximate, it would
be better to use strtol):

/***/
#include 
#include 

int main(int argc, char **argv)
{
int val, amount, valid;
if (argc != 3) {
printf("Needs 2 arguments!\n");
return EXIT_FAILURE;
}
valid = sscanf(argv[1], "%d", );
if (valid == 1) {
valid = sscanf(argv[2], "%d&qu

Re: [PATCH 1/2] powerpc/pkeys: preallocate execute_only key only if the key is available.

2018-06-29 Thread Gabriel Paubert
On Thu, Jun 28, 2018 at 11:56:34PM -0300, Thiago Jung Bauermann wrote:
> 
> Hello,
> 
> Ram Pai  writes:
> 
> > Key 2 is preallocated and reserved for execute-only key. In rare
> > cases if key-2 is unavailable, mprotect(PROT_EXEC) will behave
> > incorrectly. NOTE: mprotect(PROT_EXEC) uses execute-only key.
> >
> > Ensure key 2 is available for preallocation before reserving it for
> > execute_only purpose.  Problem noticed by Michael Ellermen.
> 
> Since "powerpc/pkeys: Preallocate execute-only key" isn't upstream yet,
> this patch could be squashed into it.
> 
> > Signed-off-by: Ram Pai 
> > ---
> >  arch/powerpc/mm/pkeys.c |   14 +-
> >  1 files changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
> > index cec990c..0b03914 100644
> > --- a/arch/powerpc/mm/pkeys.c
> > +++ b/arch/powerpc/mm/pkeys.c
> > @@ -19,6 +19,7 @@
> >  u64  pkey_amr_mask;/* Bits in AMR not to be touched */
> >  u64  pkey_iamr_mask;   /* Bits in AMR not to be touched */
> >  u64  pkey_uamor_mask;  /* Bits in UMOR not to be touched */
> > +int  execute_only_key = 2;
> >
> >  #define AMR_BITS_PER_PKEY 2
> >  #define AMR_RD_BIT 0x1UL
> > @@ -26,7 +27,6 @@
> >  #define IAMR_EX_BIT 0x1UL
> >  #define PKEY_REG_BITS (sizeof(u64)*8)
> >  #define pkeyshift(pkey) (PKEY_REG_BITS - ((pkey+1) * AMR_BITS_PER_PKEY))
> > -#define EXECUTE_ONLY_KEY 2
> >
> >  static void scan_pkey_feature(void)
> >  {
> > @@ -122,8 +122,12 @@ int pkey_initialize(void)
> >  #else
> > os_reserved = 0;
> >  #endif
> > +
> > +   if ((pkeys_total - os_reserved) <= execute_only_key)
> > +   execute_only_key = -1;
> > +
> > /* Bits are in LE format. */
> > -   reserved_allocation_mask = (0x1 << 1) | (0x1 << EXECUTE_ONLY_KEY);
> > +   reserved_allocation_mask = (0x1 << 1) | (0x1 << execute_only_key);
> 
> My understanding is that left-shifting by a negative amount is undefined
> behavior in C. A quick test tells me that at least on the couple of
> machines I tested, 1 < -1 = 0. Does GCC guarantee that behavior? 

Not in general. It probably always works on Power because of the definition 
of the machine instruction for shifts with variable amount (consider the 
shift amount unsigned and take it modulo twice the width of the operand), 
but for example it fails on x86 (1<<-1 gives 0x8000).

> If so, a comment pointing this out would make this less confusing.

Unless I miss something, this code is run once at boot, so its
performance is irrelevant.

In this case simply rewrite it as:

reserved_allocation_mask = 0x1 << 1;
if ( (pkeys_total - os_reserved) <= execute_only_key) {
execute_only_key = -1;
} else {
reserved_allocation_mask = (0x1 << 1) | (0x1 << 
execute_only_key);
}

Caveat, I have assumed that the code will either:
- only run once,
- pkeys_total and os_reserved are int, not unsigned

> 
> > initial_allocation_mask  = reserved_allocation_mask | (0x1 << PKEY_0);
> >
> > /* register mask is in BE format */
> > @@ -132,11 +136,11 @@ int pkey_initialize(void)
> >
> > pkey_iamr_mask = ~0x0ul;
> > pkey_iamr_mask &= ~(0x3ul << pkeyshift(PKEY_0));
> > -   pkey_iamr_mask &= ~(0x3ul << pkeyshift(EXECUTE_ONLY_KEY));
> > +   pkey_iamr_mask &= ~(0x3ul << pkeyshift(execute_only_key));
> >
> > pkey_uamor_mask = ~0x0ul;
> > pkey_uamor_mask &= ~(0x3ul << pkeyshift(PKEY_0));
> > -   pkey_uamor_mask &= ~(0x3ul << pkeyshift(EXECUTE_ONLY_KEY));
> > +   pkey_uamor_mask &= ~(0x3ul << pkeyshift(execute_only_key));
> 
> Here the behaviour is undefined in C as well, given that pkeyshift(-1) =
> 64, which is the total number of bits in the left operand. Does GCC
> guarantee that the result will be 0 here as well?

Same answer: very likely on Power, not portable.

Gabriel


Re: [PATCH v3 00/12] macintosh: Resolve various PMU driver problems

2018-06-27 Thread Gabriel Paubert
On Wed, Jun 27, 2018 at 05:39:15PM +1200, Michael Schmitz wrote:
> Ben,
> 
> Am 27.06.2018 um 15:27 schrieb Benjamin Herrenschmidt:
> > On Wed, 2018-06-27 at 13:08 +1000, Michael Ellerman wrote:
> > > > I will rewrite patch 10/12 after Arnd's fixes and this series have all
> > > > made their way through both powerpc and m68k trees, and submit it
> > > > separately.
> > > 
> > > drivers/macintosh is supposedly maintained by Ben, but I'm not sure this
> > > series is high on his todo list.
> > 
> > We need at least to pull out a couple of powerbooks we have sitting in
> > drawers and try it out.
> 
> No need to root around in drawers - I'm sitting in front of one (G3
> Titanium) writing this. Let me know if that would help (I've forgotten most
> of what I knew about powerpc kernels).

A very rare model, all G3 I've seen were plastic, all Titanium were G4 :-)

Cheers,
Gabriel
(still using my Pismo from time to time, but much less since
sleep was broken)


Re: [PATCH v4 3/4] powerpc/lib: implement strlen() in assembly

2018-06-08 Thread Gabriel Paubert
On Fri, Jun 08, 2018 at 10:20:41AM +, Christophe Leroy wrote:
> The generic implementation of strlen() reads strings byte per byte.
> 
> This patch implements strlen() in assembly based on a read of entire
> words, in the same spirit as what some other arches and glibc do.
> 
> On a 8xx the time spent in strlen is reduced by 2/3 for long strings.
> 
> strlen() selftest on an 8xx provides the following values:
> 
> Before the patch (ie with the generic strlen() in lib/string.c):
> 
> len 256 : time = 0.803648
> len 16  : time = 0.062989
> len 4   : time = 0.026269
> 
> After the patch:
> 
> len 256 : time = 0.267791  ==>  66% improvment
> len 16  : time = 0.037902  ==>  41% improvment
> len 4   : time = 0.026124  ==>  no degradation
> 
> Signed-off-by: Christophe Leroy 
> ---
> Not tested on PPC64.
> 
> Changes in v4:
>  - Added alignment of the loop
>  - doing the andc only if still not 0 as it happends only for bytes above 
> 0x7f which is pretty rare in a string
> 
> Changes in v3:
>  - Made it common to PPC32 and PPC64
> 
> Changes in v2:
>  - Moved handling of unaligned strings outside of the main path as it is very 
> unlikely.
>  - Removed the verification of the fourth byte in case none of the three 
> first ones are NUL.
> 
> 
>  arch/powerpc/include/asm/asm-compat.h |  4 +++
>  arch/powerpc/include/asm/string.h |  1 +
>  arch/powerpc/lib/string.S | 57 
> +++
>  3 files changed, 62 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/asm-compat.h 
> b/arch/powerpc/include/asm/asm-compat.h
> index 7f2a7702596c..0e99fe7570c0 100644
> --- a/arch/powerpc/include/asm/asm-compat.h
> +++ b/arch/powerpc/include/asm/asm-compat.h
> @@ -20,8 +20,10 @@
>  
>  /* operations for longs and pointers */
>  #define PPC_LL   stringify_in_c(ld)
> +#define PPC_LLU  stringify_in_c(ldu)
>  #define PPC_STL  stringify_in_c(std)
>  #define PPC_STLU stringify_in_c(stdu)
> +#define PPC_ROTLIstringify_in_c(rotldi)
>  #define PPC_LCMPIstringify_in_c(cmpdi)
>  #define PPC_LCMPLI   stringify_in_c(cmpldi)
>  #define PPC_LCMP stringify_in_c(cmpd)
> @@ -53,8 +55,10 @@
>  
>  /* operations for longs and pointers */
>  #define PPC_LL   stringify_in_c(lwz)
> +#define PPC_LLU  stringify_in_c(lwzu)
>  #define PPC_STL  stringify_in_c(stw)
>  #define PPC_STLU stringify_in_c(stwu)
> +#define PPC_ROTLIstringify_in_c(rotlwi)
>  #define PPC_LCMPIstringify_in_c(cmpwi)
>  #define PPC_LCMPLI   stringify_in_c(cmplwi)
>  #define PPC_LCMP stringify_in_c(cmpw)
> diff --git a/arch/powerpc/include/asm/string.h 
> b/arch/powerpc/include/asm/string.h
> index 9b8cedf618f4..8fdcb532de72 100644
> --- a/arch/powerpc/include/asm/string.h
> +++ b/arch/powerpc/include/asm/string.h
> @@ -13,6 +13,7 @@
>  #define __HAVE_ARCH_MEMCHR
>  #define __HAVE_ARCH_MEMSET16
>  #define __HAVE_ARCH_MEMCPY_FLUSHCACHE
> +#define __HAVE_ARCH_STRLEN
>  
>  extern char * strcpy(char *,const char *);
>  extern char * strncpy(char *,const char *, __kernel_size_t);
> diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S
> index 4b41970e9ed8..238f61e2024f 100644
> --- a/arch/powerpc/lib/string.S
> +++ b/arch/powerpc/lib/string.S
> @@ -67,3 +67,60 @@ _GLOBAL(memchr)
>  2:   li  r3,0
>   blr
>  EXPORT_SYMBOL(memchr)
> +
> +_GLOBAL(strlen)
> + andi.   r9, r3, (SZL - 1)
> + addir10, r3, -SZL
> + bne-1f
> +2:   lis r6, 0x8080
> + ori r6, r6, 0x8080  /* r6 = 0x80808080 (himagic) */
> +#ifdef CONFIG_PPC64
> + rldimi  r6, r6, 32, 0   /* r6 = 0x8080808080808080 (himagic) */
> +#endif
> + PPC_ROTLI  r7, r6, 1/* r7 = 0x01010101(01010101) (lomagic)*/
> + .balign IFETCH_ALIGN_BYTES
> +3:   PPC_LLU r9, SZL(r10)
> + /* ((x - lomagic) & ~x & himagic) == 0 means no byte in x is NUL */
> + subfr8, r7, r9
> + and.r8, r8, r6
> + beq+3b
> + andc.   r8, r8, r9
> + beq+3b
> +#ifdef CONFIG_PPC64
> + rldicl. r8, r9, 8, 56
> + beq 20f
> + rldicl. r8, r9, 16, 56
> + beq 21f
> + rldicl. r8, r9, 24, 56
> + beq 22f
> + rldicl. r8, r9, 32, 56
> + beq 23f
> + addir10, r10, 4
> +#endif
> + rlwinm. r8, r9, 0, 0xff00
> + beq 20f
> + rlwinm. r8, r9, 0, 0x00ff
> + beq 21f
> + rlwinm. r8, r9, 0, 0xff00
> + beq 22f
> +23:  subfr3, r3, r10

Actually these rlwinm. can likely be replaced by a single
cntlzw /cntlzd; for 32 bit something like:

cntlzw  r8,r9
subfr3,r3,r10   
srwir8,r8,3
add r3,r3,r8
blr

and similar for 64 bit but with cntlzd.

Gabriel


> + addir3, r3, 3
> + blr
> +22:  subfr3, r3, r10
> + addir3, r3, 2
> + blr
> +21:  subfr3, r3, r10
> + addir3, r3, 1
> + blr
> +19:  addir10, r10, (SZL - 1)
> +20:  subf   

Re: administrivia: mails containing HTML attachments

2018-05-16 Thread Gabriel Paubert
On Wed, May 16, 2018 at 02:48:29PM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> I have decided that any email sent to the linuxppc-dev mailing list
> that contains an HTML attachment (or is just an HTML email) will be
> rejected.  The vast majority of such mail are spam (and I have to spend
> time dropping them manually at the moment) and, I presume, anyone on
> this list is capable of sending no HTML email.

Go ahead, I couldn't agree more.

The correlation between spam and "HTeMaiL" is indeed extremely large.

Cheers,
Gabriel


Re: [PATCH v2 01/10] powerpc: Add security feature flags for Spectre/Meltdown

2018-03-27 Thread Gabriel Paubert
On Tue, Mar 27, 2018 at 11:01:44PM +1100, Michael Ellerman wrote:
> This commit adds security feature flags to reflect the settings we
> receive from firmware regarding Spectre/Meltdown mitigations.
> 
> The feature names reflect the names we are given by firmware on bare
> metal machines. See the hostboot source for details.
> 
> Arguably these could be firmware features, but that then requires them
> to be read early in boot so they're available prior to asm feature
> patching, but we don't actually want to use them for patching. We may
> also want to dynamically update them in future, which would be
> incompatible with the way firmware features work (at the moment at
> least). So for now just make them separate flags.
> 
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/include/asm/security_features.h | 65 
> 
>  arch/powerpc/kernel/Makefile |  2 +-
>  arch/powerpc/kernel/security.c   | 15 +++
>  3 files changed, 81 insertions(+), 1 deletion(-)
>  create mode 100644 arch/powerpc/include/asm/security_features.h
>  create mode 100644 arch/powerpc/kernel/security.c
> 
> 
> v2: Rebased on top of LPM changes.
> 
> diff --git a/arch/powerpc/include/asm/security_features.h 
> b/arch/powerpc/include/asm/security_features.h
> new file mode 100644
> index ..db00ad2c72c2
> --- /dev/null
> +++ b/arch/powerpc/include/asm/security_features.h
> @@ -0,0 +1,65 @@
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +/*
> + * Security related feature bit definitions.
> + *
> + * Copyright 2018, Michael Ellerman, IBM Corporation.
> + */
> +
> +#ifndef _ASM_POWERPC_SECURITY_FEATURES_H
> +#define _ASM_POWERPC_SECURITY_FEATURES_H
> +
> +
> +extern unsigned long powerpc_security_features;
> +
> +static inline void security_ftr_set(unsigned long feature)
> +{
> + powerpc_security_features |= feature;
> +}
> +
> +static inline void security_ftr_clear(unsigned long feature)
> +{
> + powerpc_security_features &= ~feature;
> +}
> +
> +static inline bool security_ftr_enabled(unsigned long feature)
> +{
> + return !!(powerpc_security_features & feature);
> +}
> +
> +
> +// Features indicating support for Spectre/Meltdown mitigations
> +
> +// The L1-D cache can be flushed with ori r30,r30,0
> +#define SEC_FTR_L1D_FLUSH_ORI30  0x0001ull
> +
> +// The L1-D cache can be flushed with mtspr 882,r0 (aka SPRN_TRIG2)
> +#define SEC_FTR_L1D_FLUSH_TRIG2  0x0002ull
> +
> +// ori r31,r31,0 acts as a speculation barrier
> +#define SEC_FTR_SPEC_BAR_ORI31   0x0004ull
> +
> +// Speculation past bctr is disabled
> +#define SEC_FTR_BCCTRL_SERIALISED0x0008ull

Nitpicks: 

1) bcctr or bcctrL ?

2) seraliaZe seems to be more popular than serialiSe in the kernel
   (1769 hits from "grep -ir serializ", 264 with the "s")
   Still needs to grep for both in any case, bummer!


Gabriel
> +
> +// Entries in L1-D are private to a SMT thread
> +#define SEC_FTR_L1D_THREAD_PRIV  0x0010ull
> +
> +// Indirect branch prediction cache disabled
> +#define SEC_FTR_COUNT_CACHE_DISABLED 0x0020ull
> +
> +
> +// Features indicating need for Spectre/Meltdown mitigations
> +
> +// The L1-D cache should be flushed on MSR[HV] 1->0 transition (hypervisor 
> to guest)
> +#define SEC_FTR_L1D_FLUSH_HV 0x0040ull
> +
> +// The L1-D cache should be flushed on MSR[PR] 0->1 transition (kernel to 
> userspace)
> +#define SEC_FTR_L1D_FLUSH_PR 0x0080ull
> +
> +// A speculation barrier should be used for bounds checks (Spectre variant 1)
> +#define SEC_FTR_BNDS_CHK_SPEC_BAR0x0100ull
> +
> +// Firmware configuration indicates user favours security over performance
> +#define SEC_FTR_FAVOUR_SECURITY  0x0200ull
> +
> +#endif /* _ASM_POWERPC_SECURITY_FEATURES_H */
> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> index 1b6bc7fba996..d458c45e5004 100644
> --- a/arch/powerpc/kernel/Makefile
> +++ b/arch/powerpc/kernel/Makefile
> @@ -42,7 +42,7 @@ obj-$(CONFIG_VDSO32)+= vdso32/
>  obj-$(CONFIG_PPC_WATCHDOG)   += watchdog.o
>  obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
>  obj-$(CONFIG_PPC_BOOK3S_64)  += cpu_setup_ppc970.o cpu_setup_pa6t.o
> -obj-$(CONFIG_PPC_BOOK3S_64)  += cpu_setup_power.o
> +obj-$(CONFIG_PPC_BOOK3S_64)  += cpu_setup_power.o security.o
>  obj-$(CONFIG_PPC_BOOK3S_64)  += mce.o mce_power.o
>  obj-$(CONFIG_PPC_BOOK3E_64)  += exceptions-64e.o idle_book3e.o
>  obj-$(CONFIG_PPC64)  += vdso64/
> diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
> new file mode 100644
> index ..4ccba00d224c
> --- /dev/null
> +++ b/arch/powerpc/kernel/security.c
> @@ -0,0 +1,15 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +//
> +// Security related flags and so on.
> +//
> +// Copyright 2018, 

Re: RFC on writel and writel_relaxed

2018-03-22 Thread Gabriel Paubert
On Thu, Mar 22, 2018 at 08:25:43PM +1100, Oliver wrote:
> On Thu, Mar 22, 2018 at 7:20 PM, Gabriel Paubert <paub...@iram.es> wrote:
> > On Thu, Mar 22, 2018 at 04:24:24PM +1100, Oliver wrote:
> >> On Thu, Mar 22, 2018 at 1:35 AM, David Laight <david.lai...@aculab.com> 
> >> wrote:
> >> >> x86 has compiler barrier inside the relaxed() API so that code does not
> >> >> get reordered. ARM64 architecturally guarantees device writes to be 
> >> >> observed
> >> >> in order.
> >> >
> >> > There are places where you don't even need a compile barrier between
> >> > every write.
> >> >
> >> > I had horrid problems getting some ppc code (for a specific embedded SoC)
> >> > optimised to have no extra barriers.
> >> > I ended up just writing through 'pointer to volatile' and adding an
> >> > explicit 'eieio' between the block of writes and status read.
> >>
> >> This is what you are supposed to do. For accesses to MMIO (cache
> >> inhibited + guarded) storage the Power ISA guarantees that load-load
> >> and store-store pairs of accesses will always occur in program order,
> >> but there's no implicit ordering between load-store or store-load
> >
> > And even for load store, eieio is not always necessary, in the important
> > case of reading and writing to the same address, when modifying bits in
> > a control register for example.
> >
> > Typically also loads will be moved ahead of stores, but not the other
> > way around, so in practice you won't notice a missed eieio in this case.
> > This does not mean you should not insert it.
> 
> Yep, but it doesn't really help us here. The generic accessors need to cope
> with the general case.

A generic accessor for modifying fields in a device register might be an 
useful addition to the current set. This is a fairly frequent operation.

Actually I did add macros to do exactly this in drivers for our own 
hardware here almost 20 years ago. I was fed up with writing
writel(readl(reg) & mask | value, reg), especially when reg was not
that simple (one device had over 100 registers). The macros obviously 
guaranteed that both accesses would be to the same register, something 
easy to get wrong with cut and paste.

> 
> >> pairs. In those cases you need an explicit eieio barrier between the
> >> two accesses. At the HW level you can think of the CPU as having
> >> separate queues for MMIO loads and stores. Accesses will be added to
> >> the respective queue in program order, but there's no synchronisation
> >> between the two queues. If the CPU is doing write combining it's easy
> >> to imagine the whole store queue being emptied in one big gulp before
> >> the load queue is even touched.
> >
> > Is write combining allowed on guarded storage?
> >
> > 
> > From PowerISA_V3.0.pdf, Book2, section 1.6.2 "Caching inhibited":
> >
> > "No combining occurs if the storage is also Guarded"
> 
> Yeah it's not allowed. That's what I get for handwaving examples ;)

At least it means that, for cache-inhibited guarded storage, there is a 
one to one correspondance between instructions and bus cycles. The only 
issue left is ordering ;)

Gabriel


Re: RFC on writel and writel_relaxed

2018-03-22 Thread Gabriel Paubert
On Thu, Mar 22, 2018 at 04:24:24PM +1100, Oliver wrote:
> On Thu, Mar 22, 2018 at 1:35 AM, David Laight  wrote:
> >> x86 has compiler barrier inside the relaxed() API so that code does not
> >> get reordered. ARM64 architecturally guarantees device writes to be 
> >> observed
> >> in order.
> >
> > There are places where you don't even need a compile barrier between
> > every write.
> >
> > I had horrid problems getting some ppc code (for a specific embedded SoC)
> > optimised to have no extra barriers.
> > I ended up just writing through 'pointer to volatile' and adding an
> > explicit 'eieio' between the block of writes and status read.
> 
> This is what you are supposed to do. For accesses to MMIO (cache
> inhibited + guarded) storage the Power ISA guarantees that load-load
> and store-store pairs of accesses will always occur in program order,
> but there's no implicit ordering between load-store or store-load

And even for load store, eieio is not always necessary, in the important
case of reading and writing to the same address, when modifying bits in
a control register for example.

Typically also loads will be moved ahead of stores, but not the other
way around, so in practice you won't notice a missed eieio in this case.
This does not mean you should not insert it.

> pairs. In those cases you need an explicit eieio barrier between the
> two accesses. At the HW level you can think of the CPU as having
> separate queues for MMIO loads and stores. Accesses will be added to
> the respective queue in program order, but there's no synchronisation
> between the two queues. If the CPU is doing write combining it's easy
> to imagine the whole store queue being emptied in one big gulp before
> the load queue is even touched.

Is write combining allowed on guarded storage? 


>From PowerISA_V3.0.pdf, Book2, section 1.6.2 "Caching inhibited":

"No combining occurs if the storage is also Guarded"

Gabriel


Re: Wifi (B43) broken on G4 PowerBooks with kernel v4.15

2018-01-31 Thread Gabriel Paubert
On Wed, Jan 31, 2018 at 10:03:46AM +, James Hogan wrote:
> On Wed, Jan 31, 2018 at 10:37:30AM +0100, Gabriel Paubert wrote:
> > Hi,
> > 
> > yesterday I recompiled the kernel on my late 2005 G4 PowerBook, and the
> > Wifi stopped working. After comparing the configuration is turns out
> > that a change to a Kconfig condition disabled SSB support which is
> > necessary for these chips.
> > 
> > The graph of configuration options is quite messy, and I'm not sure that
> > I fully understand it.
> > 
> > Nevertheless the following patch fixes the regression, but there might
> > be a better way to solve the problem.
> > 
> > If you pick up this trivial patch as is, you may add:
> > 
> > Signed-off-by: Gabriel Paubert <paub...@iram.es>
> > 
> > diff --git a/drivers/ssb/Kconfig b/drivers/ssb/Kconfig
> > index 71c7376..d189db6 100644
> > --- a/drivers/ssb/Kconfig
> > +++ b/drivers/ssb/Kconfig
> > @@ -32,7 +32,7 @@ config SSB_BLOCKIO
> >  
> >  config SSB_PCIHOST_POSSIBLE
> > bool
> > -   depends on SSB && (PCI = y || PCI = SSB) && PCI_DRIVERS_LEGACY
> > +   depends on SSB && (PCI = y || PCI = SSB) && (PCI_DRIVERS_LEGACY || 
> > B43_SSB)
> > default y
> >  
> >  config SSB_PCIHOST
> > 
> 
> Yes, really sorry about that. There is a patch here:
> https://patchwork.kernel.org/patch/10185397/

Ok, thanks, it's even better than my patch. I was not aware of that
patch: AFAIK it was not posted to one of the mailing I track, and it had
not percolated to Linus'tree which I pulled just before posting.

Sorry for the noise.

Cheers,
Gabriel


Wifi (B43) broken on G4 PowerBooks with kernel v4.15

2018-01-31 Thread Gabriel Paubert
Hi,

yesterday I recompiled the kernel on my late 2005 G4 PowerBook, and the
Wifi stopped working. After comparing the configuration is turns out
that a change to a Kconfig condition disabled SSB support which is
necessary for these chips.

The graph of configuration options is quite messy, and I'm not sure that
I fully understand it.

Nevertheless the following patch fixes the regression, but there might
be a better way to solve the problem.

If you pick up this trivial patch as is, you may add:

Signed-off-by: Gabriel Paubert <paub...@iram.es>

diff --git a/drivers/ssb/Kconfig b/drivers/ssb/Kconfig
index 71c7376..d189db6 100644
--- a/drivers/ssb/Kconfig
+++ b/drivers/ssb/Kconfig
@@ -32,7 +32,7 @@ config SSB_BLOCKIO
 
 config SSB_PCIHOST_POSSIBLE
bool
-   depends on SSB && (PCI = y || PCI = SSB) && PCI_DRIVERS_LEGACY
+   depends on SSB && (PCI = y || PCI = SSB) && (PCI_DRIVERS_LEGACY || 
B43_SSB)
default y
 
 config SSB_PCIHOST



Re: [PATCH] macintosh: Add module license to ans-lcd

2018-01-29 Thread Gabriel Paubert
On Mon, Jan 29, 2018 at 01:33:08PM -0600, Larry Finger wrote:
> In kernel 4.15, the modprobe step on my PowerBook G5 started complaining that

PowerBook G5? Really, could you send a pic! :-)

> there was no module license for ans-lcd.
> 
> Signed-off-by: Larry Finger 
> ---
>  drivers/macintosh/ans-lcd.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/macintosh/ans-lcd.c b/drivers/macintosh/ans-lcd.c
> index 1de81d922d8a..c8e078b911c7 100644
> --- a/drivers/macintosh/ans-lcd.c
> +++ b/drivers/macintosh/ans-lcd.c
> @@ -201,3 +201,4 @@ anslcd_exit(void)
>  
>  module_init(anslcd_init);
>  module_exit(anslcd_exit);
> +MODULE_LICENSE("GPL v2");
> -- 
> 2.16.1
> 


Re: [RFC PATCH 3/8] powerpc/64s: put the per-cpu data_offset in r14

2017-12-20 Thread Gabriel Paubert
On Thu, Dec 21, 2017 at 12:52:01AM +1000, Nicholas Piggin wrote:
> Shifted left by 16 bits, so the low 16 bits of r14 remain available.
> This allows per-cpu pointers to be dereferenced with a single extra
> shift whereas previously it was a load and add.
> ---
>  arch/powerpc/include/asm/paca.h   |  5 +
>  arch/powerpc/include/asm/percpu.h |  2 +-
>  arch/powerpc/kernel/entry_64.S|  5 -
>  arch/powerpc/kernel/head_64.S |  5 +
>  arch/powerpc/kernel/setup_64.c| 11 +--
>  5 files changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index cd6a9a010895..4dd4ac69e84f 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -35,6 +35,11 @@
>  
>  register struct paca_struct *local_paca asm("r13");
>  #ifdef CONFIG_PPC_BOOK3S
> +/*
> + * The top 32-bits of r14 is used as the per-cpu offset, shifted by 
> PAGE_SHIFT.

Top 32, really? It's 48 in later comments.

Gabriel

> + * The per-cpu could be moved completely to vmalloc space if we had large
> + * vmalloc page mapping? (no, must access it in real mode).
> + */
>  register u64 local_r14 asm("r14");
>  #endif
>  
> diff --git a/arch/powerpc/include/asm/percpu.h 
> b/arch/powerpc/include/asm/percpu.h
> index dce863a7635c..1e0d79d30eac 100644
> --- a/arch/powerpc/include/asm/percpu.h
> +++ b/arch/powerpc/include/asm/percpu.h
> @@ -12,7 +12,7 @@
>  
>  #include 
>  
> -#define __my_cpu_offset local_paca->data_offset
> +#define __my_cpu_offset (local_r14 >> 16)
>  
>  #endif /* CONFIG_SMP */
>  #endif /* __powerpc64__ */
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 592e4b36065f..6b0e3ac311e8 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -262,11 +262,6 @@ system_call_exit:
>  BEGIN_FTR_SECTION
>   stdcx.  r0,0,r1 /* to clear the reservation */
>  END_FTR_SECTION_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS)
> - LOAD_REG_IMMEDIATE(r10, 0xdeadbeefULL << 32)
> - mfspr   r11,SPRN_PIR
> - or  r10,r10,r11
> - tdner10,r14
> -
>   andi.   r6,r8,MSR_PR
>   ld  r4,_LINK(r1)
>  
> diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
> index 5a9ec06eab14..cdb710f43681 100644
> --- a/arch/powerpc/kernel/head_64.S
> +++ b/arch/powerpc/kernel/head_64.S
> @@ -413,10 +413,7 @@ generic_secondary_common_init:
>   b   kexec_wait  /* next kernel might do better   */
>  
>  2:   SET_PACA(r13)
> - LOAD_REG_IMMEDIATE(r14, 0xdeadbeef << 32)
> - mfspr   r3,SPRN_PIR
> - or  r14,r14,r3
> - std r14,PACA_R14(r13)
> + ld  r14,PACA_R14(r13)
>  
>  #ifdef CONFIG_PPC_BOOK3E
>   addir12,r13,PACA_EXTLB  /* and TLB exc frame in another  */
> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> index 9a4c5bf35d92..f4a96ebb523a 100644
> --- a/arch/powerpc/kernel/setup_64.c
> +++ b/arch/powerpc/kernel/setup_64.c
> @@ -192,8 +192,8 @@ static void __init fixup_boot_paca(void)
>   get_paca()->data_offset = 0;
>   /* Mark interrupts disabled in PACA */
>   irq_soft_mask_set(IRQ_SOFT_MASK_STD);
> - /* Set r14 and paca_r14 to debug value */
> - get_paca()->r14 = (0xdeadbeefULL << 32) | mfspr(SPRN_PIR);
> + /* Set r14 and paca_r14 to zero */
> + get_paca()->r14 = 0;
>   local_r14 = get_paca()->r14;
>  }
>  
> @@ -761,7 +761,14 @@ void __init setup_per_cpu_areas(void)
>   for_each_possible_cpu(cpu) {
>  __per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
>   paca[cpu].data_offset = __per_cpu_offset[cpu];
> +
> + BUG_ON(paca[cpu].data_offset & (PAGE_SIZE-1));
> + BUG_ON(paca[cpu].data_offset >= (1UL << (64 - 16)));
> +
> + /* The top 48 bits are used for per-cpu data */
> + paca[cpu].r14 |= paca[cpu].data_offset << 16;
>   }
> + local_r14 = paca[smp_processor_id()].r14;
>  }
>  #endif
>  
> -- 
> 2.15.0


Re: [PATCH v9 29/51] mm/mprotect, powerpc/mm/pkeys, x86/mm/pkeys: Add sysfs interface

2017-12-19 Thread Gabriel Paubert
On Mon, Dec 18, 2017 at 03:15:51PM -0800, Ram Pai wrote:
> On Mon, Dec 18, 2017 at 02:28:14PM -0800, Dave Hansen wrote:
> > On 12/18/2017 02:18 PM, Ram Pai wrote:
> > > b) minimum number of keys available to the application.
> > >   if libraries consumes a few, they could provide a library
> > >   interface to the application informing the number available to
> > >   the application.  The library interface can leverage (b) to
> > >   provide the information.
> > 
> > OK, let's see a real user of this including a few libraries.  Then we'll
> > put it in the kernel.
> > 
> > > c) types of disable-rights supported by keys.
> > >   Helps the application to determine the types of disable-features
> > >   available. This is helpful, otherwise the app has to 
> > >   make pkey_alloc() call with the corresponding parameter set
> > >   and see if it suceeds or fails. Painful from an application
> > >   point of view, in my opinion.
> > 
> > Again, let's see a real-world use of this.  How does it look?  How does
> > an app "fall back" if it can't set a restriction the way it wants to?
> > 
> > Are we *sure* that such an interface makes sense?  For instance, will it
> > be possible for some keys to be execute-disable while other are only
> > write-disable?
> 
> Can it be on x86?
> 
> its not possible on ppc. the keys can *all* be  the-same-attributes-disable 
> all the
> time.
> 
> However you are right. Its conceivable that some arch could provide a
> feature where it can be x-attribute-disable for key 'a' and
> y-attribute-disable for key 'b'.  But than its a bit of a headache
> for an application to consume such a feature.
> 
> Ben: I recall you requesting this feature.  Thoughts?
> 
> > 
> > > I think on x86 you look for some hardware registers to determine
> > > which hardware features are enabled by the kernel.
> > 
> > No, we use CPUID.  It's a part of the ISA that's designed for
> > enumerating CPU and (sometimes) OS support for CPU features.
> > 
> > > We do not have generic support for something like that on ppc.  The
> > > kernel looks at the device tree to determine what hardware features
> > > are available. But does not have mechanism to tell the hardware to
> > > track which of its features are currently enabled/used by the
> > > kernel; atleast not for the memory-key feature.
> > 
> > Bummer.  You're missing out.
> > 
> > But, you could still do this with a syscall.  "Hey, kernel, do you
> > support this feature?"
> 
> or do powerpc specific sysfs interface.
> or a debugfs interface.

getauxval(3) ?

With AT_HWCAP or HWCAP2 as parameter already gives information about
features supported by the hardware and the kernel.

Taking one bit to expose the availability of protection keys to
applications does not look impossible.

Do I miss something obvious?

Gabriel


Re: [PATCH][V2] crypto/nx: fix spelling mistake: "availavle" -> "available"

2017-11-14 Thread Gabriel Paubert
On Tue, Nov 14, 2017 at 02:32:17PM +, Colin King wrote:
> From: Colin Ian King 
> 
> Trivial fix to spelling mistake in pr_err error message text. Also
> fix spelling mistake in proceeding comment.

s/proceeding/preceding/ ?

> 
> Signed-off-by: Colin Ian King 
> ---
>  drivers/crypto/nx/nx-842-powernv.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/crypto/nx/nx-842-powernv.c 
> b/drivers/crypto/nx/nx-842-powernv.c
> index f2246a5abcf6..1e87637c412d 100644
> --- a/drivers/crypto/nx/nx-842-powernv.c
> +++ b/drivers/crypto/nx/nx-842-powernv.c
> @@ -743,8 +743,8 @@ static int nx842_open_percpu_txwins(void)
>   }
>  
>   if (!per_cpu(cpu_txwin, i)) {
> - /* shoudn't happen, Each chip will have NX engine */
> - pr_err("NX engine is not availavle for CPU %d\n", i);
> + /* shouldn't happen, Each chip will have NX engine */
> + pr_err("NX engine is not available for CPU %d\n", i);
>   return -EINVAL;
>   }
>   }
> -- 
> 2.14.1


Re: [PATCH v3 4/5] powerpc/lib/sstep: Add prty instruction emulation

2017-07-26 Thread Gabriel Paubert
On Tue, Jul 25, 2017 at 06:08:05PM +1000, Balbir Singh wrote:
> On Tue, 2017-07-25 at 13:33 +1000, Matt Brown wrote:
> > This adds emulation for the prtyw and prtyd instructions.
> > Tested for logical correctness against the prtyw and prtyd instructions
> > on ppc64le.
> > 
> > Signed-off-by: Matt Brown 
> > ---
> > v3:
> > - optimised using the Giles-Miller method of side-ways addition
> > v2:
> > - fixed opcodes
> > - fixed bitshifting and typecast errors
> > - merged do_prtyw and do_prtyd into single function
> > ---
> >  arch/powerpc/lib/sstep.c | 27 +++
> >  1 file changed, 27 insertions(+)
> > 
> > diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
> > index 6a79618..0bcf631 100644
> > --- a/arch/powerpc/lib/sstep.c
> > +++ b/arch/powerpc/lib/sstep.c
> > @@ -655,6 +655,25 @@ static nokprobe_inline void do_bpermd(struct pt_regs 
> > *regs, unsigned long v1,
> > regs->gpr[ra] = perm;
> >  }
> >  #endif /* __powerpc64__ */
> > +/*
> > + * The size parameter adjusts the equivalent prty instruction.
> > + * prtyw = 32, prtyd = 64
> > + */
> > +static nokprobe_inline void do_prty(struct pt_regs *regs, unsigned long v,
> > +   int size, int ra)
> > +{
> > +   unsigned long long res = v;
> > +
> > +   res = (0x0001000100010001 & res) + (0x0001000100010001 & (res >> 8));
> > +   res = (0x00070007 & res) + (0x00070007 & (res >> 16));
> > +   if (size == 32) {   /* prtyw */
> > +   regs->gpr[ra] = (0x00010001 & res);
> > +   return;
> > +   }
> > +
> > +   res = (0x000f & res) + (0x000f & (res >> 32));
> > +   regs->gpr[ra] = res & 1;/*prtyd */
> 
> Looks good, you can also xor instead of adding, but the masks would need
> to change in that case.

Actually, I'd prefer to use xor for this case, since you hardly ever
need any mask, except at the end. Most masks are only needed because of
carry propagation, which the xor completely avoid.

In other words:

unsigned long long res = v ^ (v >> 8);
res ^= res >> 16; 
if (size == 32) {
regs->gpr[ra] = res & 0x00010001;
return;
}
res ^= res >> 32;
regs-> gpr[ra] = res & 1;

should work, but I've not even compiled it.

Gabriel


Re: [PATCH v3 2/5] powerpc/lib/sstep: Add popcnt instruction emulation

2017-07-26 Thread Gabriel Paubert
On Tue, Jul 25, 2017 at 01:33:17PM +1000, Matt Brown wrote:
> This adds emulations for the popcntb, popcntw, and popcntd instructions.
> Tested for correctness against the popcnt{b,w,d} instructions on ppc64le.
> 
> Signed-off-by: Matt Brown 
> ---
> v3:
>   - optimised using the Giles-Miller method of side-ways addition
> v2:
>   - fixed opcodes
>   - fixed typecasting
>   - fixed bitshifting error for both 32 and 64bit arch
> ---
>  arch/powerpc/lib/sstep.c | 40 +++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
> index 87d277f..c1f9cdb 100644
> --- a/arch/powerpc/lib/sstep.c
> +++ b/arch/powerpc/lib/sstep.c
> @@ -612,6 +612,32 @@ static nokprobe_inline void do_cmpb(struct pt_regs 
> *regs, unsigned long v1,
>   regs->gpr[rd] = out_val;
>  }
>  
> +/*
> + * The size parameter is used to adjust the equivalent popcnt instruction.
> + * popcntb = 8, popcntw = 32, popcntd = 64
> + */
> +static nokprobe_inline void do_popcnt(struct pt_regs *regs, unsigned long v1,
> + int size, int ra)
> +{
> + unsigned long long out = v1;
> +
> + out = (0x & out) + (0x & (out >> 1));

This can be simplified in a less obvious way as:
out -= (out >> 1) & 0x;

which maps each pair of bits according to the following:
00 -> 00
01 -> 01
10 -> 01
11 -> 10

This should save one instruction.

> + out = (0x & out) + (0x & (out >> 2));

Ok, but now each nibble is between 0 and 4, so the addition of two
nibbles can't overflow or generate carry into the higher one.

> + out = (0x0f0f0f0f0f0f0f0f & out) + (0x0f0f0f0f0f0f0f0f & (out >> 4));

out += out >> 4;
out &= 0x0f0f0f0f0f0f0f0f;

which should also save one instruction

> + if (size == 8) {/* popcntb */
> + regs->gpr[ra] = out;
> + return;
> + }

At this point each count occupies at least one byte and can no more
overflow, so masking is only needed before returning.

> + out = (0x001f001f001f001f & out) + (0x001f001f001f001f & (out >> 8));
out += out >> 8;

> + out = (0x003f003f & out) + (0x003f003f & (out >> 16));

out += out >> 16;

> + if (size == 32) {   /* popcntw */
> + regs->gpr[ra] = out;
regs->gpr[ra] = out & 0x003f003f;

> + return;
> + }
> + out = (0x007f & out) + (0x007f & (out >> 32));
out = (out + (out >> 32)) & 0x7f;


Gabriel

> + regs->gpr[ra] = out;/* popcntd */
> +}
> +
>  static nokprobe_inline int trap_compare(long v1, long v2)
>  {
>   int ret = 0;
> @@ -1194,6 +1220,10 @@ int analyse_instr(struct instruction_op *op, struct 
> pt_regs *regs,
>   regs->gpr[ra] = regs->gpr[rd] & ~regs->gpr[rb];
>   goto logical_done;
>  
> + case 122:   /* popcntb */
> + do_popcnt(regs, regs->gpr[rd], 8, ra);
> + goto logical_done;
> +
>   case 124:   /* nor */
>   regs->gpr[ra] = ~(regs->gpr[rd] | regs->gpr[rb]);
>   goto logical_done;
> @@ -1206,6 +1236,10 @@ int analyse_instr(struct instruction_op *op, struct 
> pt_regs *regs,
>   regs->gpr[ra] = regs->gpr[rd] ^ regs->gpr[rb];
>   goto logical_done;
>  
> + case 378:   /* popcntw */
> + do_popcnt(regs, regs->gpr[rd], 32, ra);
> + goto logical_done;
> +
>   case 412:   /* orc */
>   regs->gpr[ra] = regs->gpr[rd] | ~regs->gpr[rb];
>   goto logical_done;
> @@ -1217,7 +1251,11 @@ int analyse_instr(struct instruction_op *op, struct 
> pt_regs *regs,
>   case 476:   /* nand */
>   regs->gpr[ra] = ~(regs->gpr[rd] & regs->gpr[rb]);
>   goto logical_done;
> -
> +#ifdef __powerpc64__
> + case 506:   /* popcntd */
> + do_popcnt(regs, regs->gpr[rd], 64, ra);
> + goto logical_done;
> +#endif
>   case 922:   /* extsh */
>   regs->gpr[ra] = (signed short) regs->gpr[rd];
>   goto logical_done;
> -- 
> 2.9.3


Re: [PATCH 3/5] powerpc/lib/sstep: Add bpermd instruction emulation

2017-07-13 Thread Gabriel Paubert
On Thu, Jul 13, 2017 at 01:25:46PM +1000, Matt Brown wrote:
> This adds emulation for the bpermd instruction.
> 
> Signed-off-by: Matt Brown 
> ---
>  arch/powerpc/lib/sstep.c | 23 +++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
> index cf69987..603654d 100644
> --- a/arch/powerpc/lib/sstep.c
> +++ b/arch/powerpc/lib/sstep.c
> @@ -637,6 +637,21 @@ static nokprobe_inline void do_popcnt(struct pt_regs 
> *regs, unsigned long v1,
>   regs->gpr[ra] = out_val;
>  }
>  
> +static nokprobe_inline void do_bpermd(struct pt_regs *regs, unsigned long v1,
> + unsigned long v2, int ra)
> +{
> + unsigned int idx, i;
> + unsigned char perm;
> +
> + perm = 0x0;
> + for (i = 0; i < 8; i++) {
> + idx = (v1 >> (i * 8)) & 0xff;
> + if (idx < 64)
> + perm |= (v2 & (1 << idx)) >> (idx - i);
> + }
> + regs->gpr[ra] = 0 | perm;

Huh? What's the point of doing an or with 0?

The compiler will eliminate it, but it just confuses the reader.

Gabriel
> +}
> +
>  static nokprobe_inline int trap_compare(long v1, long v2)
>  {
>   int ret = 0;
> @@ -1274,6 +1289,14 @@ int analyse_instr(struct instruction_op *op, struct 
> pt_regs *regs,
>   goto logical_done;
>  #endif
>  
> +#ifdef __powerpc64__
> + case 2396736:   /* bpermd */
> + val = regs->gpr[rd];
> + val2 = regs->gpr[rb];
> + do_bpermd(regs, val, val2, ra);
> + goto logical_done;
> +#endif
> +
>  /*
>   * Shift instructions
>   */
> -- 
> 2.9.3


Re: [PATCH] powerpc: Avoid panic during boot due to divide by zero in init_cache_info()

2017-03-06 Thread Gabriel Paubert
On Sun, Mar 05, 2017 at 11:24:56AM -0600, Segher Boessenkool wrote:
> On Sun, Mar 05, 2017 at 05:58:37PM +0100, Gabriel Paubert wrote:
> > > > Erk sorry. One of the static checkers spotted it, but I hadn't got
> > > > around to fixing it because it seemed to not actually blow up, guess
> > > > not.
> > > 
> > > The PowerPC divw etc. instructions do not trap by themselves, but recent
> > > GCC inserts trap instructions on code paths that are always undefined
> > > behaviour (like, dividing by zero).
> > 
> > Is it systematic or does it depend from, e.g., optimization levels?
> 
> In this case it needs -fisolate-erroneous-paths-dereference which is
> default at -O2 and higher.

Great, another optimization-dependent behaviour. :-(

But this is not the most serious issue: on PPC, when you #include
, the numeric_limits::traps is false on PPC,
and on no other architecture that I know of (in practice this trap
reflects the hardware behaviour on division by zero).

By generating a trap in this case, I believe that the compiler violates
a contract given by , and the standard.

I'd certainly prefer a compile time warning, easily convertible to an
error.

> 
> > Is there anything in the standards about this feature?
> 
> The compiler can do whatever it likes with code that has undefined
> behaviour.  With this optimisation it a) can compile the conforming
> code to something better; and b) undefined behaviour will trap instead
> of doing something random (which often is exploitable).

It may be undefined, but I believe that the numeric_limits<>::traps
value clearly prohibits generating a trap in this case.

Gabriel


Re: [PATCH] powerpc: Avoid panic during boot due to divide by zero in init_cache_info()

2017-03-05 Thread Gabriel Paubert
On Sun, Mar 05, 2017 at 06:37:37AM -0600, Segher Boessenkool wrote:
> On Sun, Mar 05, 2017 at 09:26:47PM +1100, Michael Ellerman wrote:
> > > I see a panic in early boot when building with a recent gcc toolchain.
> > > The issue is a divide by zero, which is undefined. Older toolchains
> > > let us get away with it:
> > >
> > > int foo(int a) { return a / 0; }
> > >
> > > foo:
> > >   li 9,0
> > >   divw 3,3,9
> > >   extsw 3,3
> > >   blr
> > >
> > > But newer ones catch it:
> > >
> > > foo:
> > >   trap
> > >
> > > Add a check to avoid the divide by zero.
> > 
> > Erk sorry. One of the static checkers spotted it, but I hadn't got
> > around to fixing it because it seemed to not actually blow up, guess
> > not.
> 
> The PowerPC divw etc. instructions do not trap by themselves, but recent
> GCC inserts trap instructions on code paths that are always undefined
> behaviour (like, dividing by zero).


Is it systematic or does it depend from, e.g., optimization levels?

Is there anything in the standards about this feature?

Gabriel


Re: [PATCH 0/2] Allow configurable stack size (especially 32k on PPC64)

2017-02-21 Thread Gabriel Paubert
On Tue, Feb 21, 2017 at 09:24:38AM +1300, Hamish Martin wrote:
> This patch series adds the ability to configure the THREAD_SHIFT value and
> thereby alter the stack size on powerpc systems. We are particularly 
> interested
> in configuring for a 32k stack on PPC64.
> 
> Using an NXP T2081 (e6500 PPC64 cores) we are observing stack overflows as a
> result of applying a DTS overlay containing some I2C devices. Our scenario is
> an ethernet switch chassis with plug-in cards. The I2C is driven from the 
> T2081
> through a PCA9548 mux on the main board. When we detect insertion of the 
> plugin
> card we schedule work for a call to of_overlay_create() to install a DTS
> overlay for the plugin board. This DTS overlay contains a further PCA9548 mux
> with more devices hanging off it including a PCA9539 GPIO expander. The
> ultimate installed I2C tree is:
> 
> T2081 --- PCA9548 MUX --- PCA9548 MUX --- PCA9539 GPIO Expander
> 
> When we install the overlay the devices described in the overlay are probed 
> and
> we see a large number of stack frames used as a result. If this is coupled 
> with
> an interrupt happening that requires moderate to high stack use we observe
> stack corruption. Here is an example long stack (from a 4.10-rc8 kernel) that
> does not show corruption but does demonstrate the length and frame sizes
> involved.
> 
> DepthSize   Location(72 entries)
> -   
>   0)13872 128   .__raise_softirq_irqoff+0x1c/0x130
>   1)13744 144   .raise_softirq+0x30/0x70
>   2)13600 112   .invoke_rcu_core+0x54/0x70
>   3)13488 336   .rcu_check_callbacks+0x294/0xde0
>   4)13152 128   .update_process_times+0x40/0x90
>   5)13024 144   .tick_sched_handle.isra.16+0x40/0xb0
>   6)12880 144   .tick_sched_timer+0x6c/0xe0
>   7)12736 272   .__hrtimer_run_queues+0x1a0/0x4b0
>   8)12464 208   .hrtimer_interrupt+0xe8/0x2a0
>   9)12256 160   .__timer_interrupt+0xdc/0x330
>  10)12096 160   .timer_interrupt+0x138/0x190
>  11)11936 752   exc_0x900_common+0xe0/0xe4
>  12)11184 128   .ftrace_ops_no_ops+0x11c/0x230
>  13)11056 176   .ftrace_ops_test.isra.12+0x30/0x50
>  14)10880 160   .ftrace_ops_no_ops+0xd4/0x230
>  15)10720 112   ftrace_call+0x4/0x8
>  16)10608 176   .lock_timer_base+0x3c/0xf0
>  17)10432 144   .try_to_del_timer_sync+0x2c/0x90
>  18)10288 128   .del_timer_sync+0x60/0x80
>  19)10160 256   .schedule_timeout+0x1fc/0x490
>  20) 9904 208   .i2c_wait+0x238/0x290
>  21) 9696 256   .mpc_xfer+0x4e4/0x570
>  22) 9440 208   .__i2c_transfer+0x158/0x6d0
>  23) 9232 192   .pca954x_reg_write+0x70/0x110
>  24) 9040 160   .__i2c_mux_master_xfer+0xb4/0xf0
>  25) 8880 208   .__i2c_transfer+0x158/0x6d0
>  26) 8672 192   .pca954x_reg_write+0x70/0x110
>  27) 8480 144   .pca954x_select_chan+0x68/0xa0
>  28) 8336 160   .__i2c_mux_master_xfer+0x64/0xf0
>  29) 8176 208   .__i2c_transfer+0x158/0x6d0
>  30) 7968 144   .i2c_transfer+0x98/0x130
>  31) 7824 320   .i2c_smbus_xfer_emulated+0x168/0x600
>  32) 7504 208   .i2c_smbus_xfer+0x1c0/0x5d0
>  33) 7296 192   .i2c_smbus_write_byte_data+0x50/0x70
>  34) 7104 144   .pca953x_write_single+0x6c/0xe0
>  35) 6960 192   .pca953x_gpio_direction_output+0xa4/0x160
>  36) 6768 160   ._gpiod_direction_output_raw+0xec/0x460
>  37) 6608 160   .gpiod_hog+0x98/0x250
>  38) 6448 176   .of_gpiochip_add+0xdc/0x1c0
>  39) 6272 256   .gpiochip_add_data+0x4f4/0x8c0
>  40) 6016 144   .devm_gpiochip_add_data+0x64/0xf0
>  41) 5872 208   .pca953x_probe+0x2b4/0x5f0
>  42) 5664 144   .i2c_device_probe+0x224/0x2e0
>  43) 5520 160   .really_probe+0x244/0x380
>  44) 5360 160   .bus_for_each_drv+0x94/0x100
>  45) 5200 160   .__device_attach+0x118/0x160
>  46) 5040 144   .bus_probe_device+0xe8/0x100
>  47) 4896 208   .device_add+0x500/0x6c0
>  48) 4688 144   .i2c_new_device+0x1f8/0x240
>  49) 4544 256   .of_i2c_register_device+0x160/0x280
>  50) 4288 192   .i2c_register_adapter+0x238/0x630
>  51) 4096 208   .i2c_mux_add_adapter+0x3f8/0x540
>  52) 3888 192   .pca954x_probe+0x234/0x370
>  53) 3696 144   .i2c_device_probe+0x224/0x2e0
>  54) 3552 160   .really_probe+0x244/0x380
>  55) 3392 160   .bus_for_each_drv+0x94/0x100
>  56) 3232 160   .__device_attach+0x118/0x160
>  57) 3072 144   .bus_probe_device+0xe8/0x100
>  58) 2928 208   .device_add+0x500/0x6c0
>  59) 2720 144   .i2c_new_device+0x1f8/0x240
>  60) 2576 256   .of_i2c_register_device+0x160/0x280
>  61) 2320 144   .of_i2c_notify+0x12c/0x1d0
>  62) 2176 160   .notifier_call_chain+0x8c/0x100
>  63) 2016 160   

Re: [PATCH v3 11/11] powerpc: rewrite local_t using soft_irq

2016-11-15 Thread Gabriel Paubert
Hi,

On Mon, Nov 14, 2016 at 11:34:52PM +0530, Madhavan Srinivasan wrote:
> Local atomic operations are fast and highly reentrant per CPU counters.
> Used for percpu variable updates. Local atomic operations only guarantee
> variable modification atomicity wrt the CPU which owns the data and
> these needs to be executed in a preemption safe way.
> 
> Here is the design of this patch. Since local_* operations
> are only need to be atomic to interrupts (IIUC), we have two options.
> Either replay the "op" if interrupted or replay the interrupt after
> the "op". Initial patchset posted was based on implementing local_* operation
> based on CR5 which replay's the "op". Patchset had issues in case of
> rewinding the address pointor from an array. This make the slow patch

s/patch/path/ ?

> really slow. Since CR5 based implementation proposed using __ex_table to find
> the rewind addressr, this rasied concerns about size of __ex_table and 
> vmlinux.
> 
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-December/123115.html
> 
> But this patch uses, powerpc_local_irq_pmu_save to soft_disable
> interrupts (including PMIs). After finishing the "op", 
> powerpc_local_irq_pmu_restore()
> called and correspondingly interrupts are replayed if any occured.
> 
> patch re-write the current local_* functions to use arch_local_irq_disbale.
> Base flow for each function is
> 
> {
>   powerpc_local_irq_pmu_save(flags)
>   load
>   ..
>   store
>   powerpc_local_irq_pmu_restore(flags)
> }
> 
> Reason for the approach is that, currently l[w/d]arx/st[w/d]cx.
> instruction pair is used for local_* operations, which are heavy
> on cycle count and they dont support a local variant. So to
> see whether the new implementation helps, used a modified
> version of Rusty's benchmark code on local_t.
> 
> https://lkml.org/lkml/2008/12/16/450
> 
> Modifications to Rusty's benchmark code:
> - Executed only local_t test
> 
> Here are the values with the patch.
> 
> Time in ns per iteration
> 
> Local_t Without Patch   With Patch
> 
> _inc28  8
> _add28  8
> _read   3   3
> _add_return   28  7
> 
> Currently only asm/local.h has been rewrite, and also

s/rewrite/rewritten/

> the entire change is tested only in PPC64 (pseries guest)
> and PPC64 host (LE)

I'm really wondering whether this is the kind of thing that would
benefit from transactions. But transactional memory was only implemented
on Power7 and later IIRC, and we still need to support machines without
transactional memory. 

Boot time patching would probably be very cumbersome, so I believe
that the only sensible thing to do would be to add a config option 
that would generate a kernel that only runs on machines with 
transactional memory.

And then benchmark...

Gabriel

> 
> Reviewed-by: Nicholas Piggin 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/powerpc/include/asm/local.h | 201 
> +++
>  1 file changed, 201 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/local.h 
> b/arch/powerpc/include/asm/local.h
> index b8da91363864..7d117c07b0b1 100644
> --- a/arch/powerpc/include/asm/local.h
> +++ b/arch/powerpc/include/asm/local.h
> @@ -3,6 +3,9 @@
>  
>  #include 
>  #include 
> +#include 
> +
> +#include 
>  
>  typedef struct
>  {
> @@ -14,6 +17,202 @@ typedef struct
>  #define local_read(l)atomic_long_read(&(l)->a)
>  #define local_set(l,i)   atomic_long_set(&(l)->a, (i))
>  
> +#ifdef CONFIG_PPC64
> +
> +static __inline__ void local_add(long i, local_t *l)
> +{
> + long t;
> + unsigned long flags;
> +
> + powerpc_local_irq_pmu_save(flags);
> + __asm__ __volatile__(
> + PPC_LL" %0,0(%2)\n\
> + add %0,%1,%0\n"
> + PPC_STL" %0,0(%2)\n"
> + : "=" (t)
> + : "r" (i), "r" (&(l->a.counter)));
> + powerpc_local_irq_pmu_restore(flags);
> +}
> +
> +static __inline__ void local_sub(long i, local_t *l)
> +{
> + long t;
> + unsigned long flags;
> +
> + powerpc_local_irq_pmu_save(flags);
> + __asm__ __volatile__(
> + PPC_LL" %0,0(%2)\n\
> + subf%0,%1,%0\n"
> + PPC_STL" %0,0(%2)\n"
> + : "=" (t)
> + : "r" (i), "r" (&(l->a.counter)));
> + powerpc_local_irq_pmu_restore(flags);
> +}
> +
> +static __inline__ long local_add_return(long a, local_t *l)
> +{
> + long t;
> + unsigned long flags;
> +
> + powerpc_local_irq_pmu_save(flags);
> + __asm__ __volatile__(
> + PPC_LL" %0,0(%2)\n\
> + add %0,%1,%0\n"
> + PPC_STL "%0,0(%2)\n"
> + : "=" (t)
> + : "r" (a), "r" (&(l->a.counter))
> + : "memory");
> + powerpc_local_irq_pmu_restore(flags);
> +
> + return t;
> +}
> +
> +#define local_add_negative(a, l) (local_add_return((a), (l)) < 0)
> +
> +static __inline__ long 

Re: [PATCH v3 2/5] firmware: annotate thou shalt not request fw on init or probe

2016-08-24 Thread Gabriel Paubert
On Tue, Aug 23, 2016 at 05:45:04PM -0700, mcg...@kernel.org wrote:

[snip]
> ---
>  Documentation/firmware_class/README|  20 
>  drivers/base/Kconfig   |   2 +-
>  .../request_firmware-avoid-init-probe-init.cocci   | 130 
> +
>  3 files changed, 151 insertions(+), 1 deletion(-)
>  create mode 100644 
> scripts/coccinelle/api/request_firmware-avoid-init-probe-init.cocci
> 
> diff --git a/Documentation/firmware_class/README 
> b/Documentation/firmware_class/README
> index cafdca8b3b15..056d1cb9d365 100644
> --- a/Documentation/firmware_class/README
> +++ b/Documentation/firmware_class/README
> @@ -93,6 +93,26 @@
> user contexts to request firmware asynchronously, but can't be called
> in atomic contexts.
>  
> +Requirements:
> +=
> +
> +You should avoid at all costs requesting firmware on both init and probe 
> paths
> +of your device driver. Reason for this is the complexity needed to ensure a
> +firmware will be available for a driver early in boot through different
> +build configurations. Consider built-in drivers needing firmware early, or
> +consider a driver assuming it will only get firmware after pivot_root().
> +
> +Drivers that really need firmware early should use stuff the firmware in

Minor grammatical nit: s/use//

> +initramfs or consider using CONFIG_EXTRA_FIRMWARE. Using initramfs is much
> +more portable to more distributions as not all distributions wish to enable
> +CONFIG_EXTRA_FIRMWARE. Should a driver require the firmware being built-in
> +it should depend on CONFIG_EXTRA_FIRMWARE. There is no current annotation for
> +requiring a firmware on initramfs.
> +
> +If you're a maintainer you can help police this with:
> +
> +$ export 
> COCCI=scripts/coccinelle/api/request_firmware-avoid-init-probe-init.cocci
> +$ make coccicheck MODE=report
>  
>   about in-kernel persistence:
>   ---


Re: [PATCH] powerpc/8xx: fix single_step debug

2016-08-18 Thread Gabriel Paubert
On Thu, Aug 18, 2016 at 12:13:21PM +0200, Christophe Leroy wrote:
> 
> 
> Le 18/08/2016 à 11:58, Gabriel Paubert a écrit :
> >On Thu, Aug 18, 2016 at 11:44:20AM +0200, Christophe Leroy wrote:
> >>SPRN_ICR must be read for clearing the internal freeze signal which
> >>is asserted by the single step exception, otherwise the timebase and
> >>decrementer remain freezed
> >
> >Minor nit: s/freezed/frozen/
> >
> >If the timebase and decrementer are frozen even for a few cycles, this
> >probably upsets timekeeping. I consider this a completely stupid design
> >decision, and maybe I'm not alone.
> >
> >Gabriel
> 
> We could also unset TBF bit (TimeBase Freeze enable) in TBSCR
> register (today it is set in
> arch/powerpc/platforms/8xx/m8xx_setup.c) but then it would impact
> debug done with an external BDM system which expects the decrementer
> and TB frozen when it freezes the execution.

Ok, I believe that systematically setting it is a mistake, but then I'm
always a bit nervous about screwing up timekeeping (it certainly is
always a very bad idea when you are driving telescopes).

Gabriel


Re: [PATCH] powerpc/8xx: fix single_step debug

2016-08-18 Thread Gabriel Paubert
On Thu, Aug 18, 2016 at 11:44:20AM +0200, Christophe Leroy wrote:
> SPRN_ICR must be read for clearing the internal freeze signal which
> is asserted by the single step exception, otherwise the timebase and
> decrementer remain freezed

Minor nit: s/freezed/frozen/

If the timebase and decrementer are frozen even for a few cycles, this
probably upsets timekeeping. I consider this a completely stupid design
decision, and maybe I'm not alone.

Gabriel

> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/reg_8xx.h | 1 +
>  arch/powerpc/kernel/traps.c| 8 
>  2 files changed, 9 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/reg_8xx.h 
> b/arch/powerpc/include/asm/reg_8xx.h
> index feaf641..6dae71f 100644
> --- a/arch/powerpc/include/asm/reg_8xx.h
> +++ b/arch/powerpc/include/asm/reg_8xx.h
> @@ -17,6 +17,7 @@
>  #define SPRN_DC_DAT  570 /* Read-only data register */
>  
>  /* Misc Debug */
> +#define SPRN_ICR 148
>  #define SPRN_DPDR630
>  #define SPRN_MI_CAM  816
>  #define SPRN_MI_RAM0 817
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index 2cb5892..0f1f0ce 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -400,8 +400,16 @@ static inline int check_io_access(struct pt_regs *regs)
>  #define REASON_TRAP  0x2
>  
>  #define single_stepping(regs)((regs)->msr & MSR_SE)
> +#ifdef CONFIG_PPC_8xx
> +static inline void clear_single_step(struct pt_regs *regs)
> +{
> + regs->msr &= ~MSR_SE;
> + mfspr(SPRN_ICR);
> +}
> +#else
>  #define clear_single_step(regs)  ((regs)->msr &= ~MSR_SE)
>  #endif
> +#endif
>  
>  #if defined(CONFIG_4xx)
>  int machine_check_4xx(struct pt_regs *regs)
> -- 
> 2.1.0


Re: [PATCH] powerpc/32: Remove one insn in __bswapdi2

2016-08-12 Thread Gabriel Paubert
On Thu, Aug 11, 2016 at 05:11:19PM -0500, Segher Boessenkool wrote:
> On Thu, Aug 11, 2016 at 11:34:37PM +0200, Gabriel Paubert wrote:
> > On the other hand gcc did at the time a very poor job (quite an
> > understatement) at bswapdi when compiling for 64 bit processors 
> > (see the example).
> > 
> > But what do modern compilers generate for bswapdi these days? Do they
> > still call the library or not?
> 
> Nope.

Great, could then these functions be removed from misc_32.S, or are
compilers that use libcalls still supported for kernel builds?

> 
> > After all, bswapdi on 32 bit processors only takes 6 instructions if the
> > input and output registers don't overlap.
> 
> For this testcase:
> ===
> typedef unsigned long long u64;
> u64 bs(u64 x) { return __builtin_bswap64(x); }
> ===
> 
> we get with -m32:
> ===
> bs:
>   mr 9,3
>   rotlwi 3,4,24
>   rlwimi 3,4,8,8,15
>   rlwimi 3,4,8,24,31
>   rotlwi 4,9,24
>   rlwimi 4,9,8,8,15
>   rlwimi 4,9,8,24,31
>   blr

In this case the compiler is constrained by the fact that the input and
ouput registers are the same. When inlined with other things it can
probably perform better scheduling and interleaving of operations.


> ===
> 
> and with -m64:
> ===
> .L.bs:
>   srdi 10,3,32
>   mr 9,3
>   rotlwi 3,3,24
>   rotlwi 8,10,24
>   rlwimi 3,9,8,8,15
>   rlwimi 8,10,8,8,15
>   rlwimi 3,9,8,24,31
>   rlwimi 8,10,8,24,31
>   sldi 3,3,32
>   or 3,3,8
>   blr
> ===
> 

As demonstrated here where the two halves of the 64 bit quantity
are byte swapped in an interleaved fashion. Not perfect (I think
that with proper ordering the last 2 instructions could be replaced
by a rldimi), but reasonable.

> Neither as tight as possible, but neither horrible either.
> 

Indeed.

Gabriel


Re: [PATCH] powerpc/32: Remove one insn in __bswapdi2

2016-08-11 Thread Gabriel Paubert
On Wed, Aug 10, 2016 at 12:18:15PM +0200, Christophe Leroy wrote:
> 
> 
> Le 10/08/2016 à 10:56, Gabriel Paubert a écrit :
> >On Fri, Aug 05, 2016 at 01:28:02PM +0200, Christophe Leroy wrote:
> >>Signed-off-by: Christophe Leroy <christophe.le...@c-s.fr>
> >>---
> >> arch/powerpc/kernel/misc_32.S | 3 +--
> >> 1 file changed, 1 insertion(+), 2 deletions(-)
> >>
> >>diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
> >>index e025230..e18055c 100644
> >>--- a/arch/powerpc/kernel/misc_32.S
> >>+++ b/arch/powerpc/kernel/misc_32.S
> >>@@ -578,9 +578,8 @@ _GLOBAL(__bswapdi2)
> >>rlwimi  r9,r4,24,0,7
> >>rlwimi  r10,r3,24,0,7
> >>rlwimi  r9,r4,24,16,23
> >>-   rlwimi  r10,r3,24,16,23
> >>+   rlwimi  r4,r3,24,16,23
> >>mr  r3,r9
> >>-   mr  r4,r10
> >>blr
> >>
> >
> >Hmmm, are you sure that it works? rlwimi is a bit special since the
> >first operand is both an input and an output of the instruction.
> >
> >
> 
> Oops, you are right ...

I just found this: 

http://hardwarebug.org/2010/01/14/beware-the-builtins/

the bswapdi2 suggested sequence only needs a single mr instruction, the 
other one is absorbed in a rotlwi.

The scheduling looks poor, but it seems impossible to interleave the
operations between the two halves without adding another instructions,
and the routine is 8 instructions long, which happens to be exactly a
cache line on most 32 bit processors.

On the other hand gcc did at the time a very poor job (quite an
understatement) at bswapdi when compiling for 64 bit processors 
(see the example).

But what do modern compilers generate for bswapdi these days? Do they
still call the library or not?

After all, bswapdi on 32 bit processors only takes 6 instructions if the
input and output registers don't overlap.

Gabriel



Re: [PATCH] powerpc/32: Remove one insn in __bswapdi2

2016-08-10 Thread Gabriel Paubert
On Fri, Aug 05, 2016 at 01:28:02PM +0200, Christophe Leroy wrote:
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/misc_32.S | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
> index e025230..e18055c 100644
> --- a/arch/powerpc/kernel/misc_32.S
> +++ b/arch/powerpc/kernel/misc_32.S
> @@ -578,9 +578,8 @@ _GLOBAL(__bswapdi2)
>   rlwimi  r9,r4,24,0,7
>   rlwimi  r10,r3,24,0,7
>   rlwimi  r9,r4,24,16,23
> - rlwimi  r10,r3,24,16,23
> + rlwimi  r4,r3,24,16,23
>   mr  r3,r9
> - mr  r4,r10
>   blr
>  

Hmmm, are you sure that it works? rlwimi is a bit special since the
first operand is both an input and an output of the instruction.


In other words, did you test the code?

Gabriel


Re: [PATCH] powerpc: Fix crash during static key init on ppc32

2016-08-10 Thread Gabriel Paubert
On Wed, Aug 10, 2016 at 01:17:55AM -0700, Christian Kujau wrote:
> On Wed, 10 Aug 2016, Benjamin Herrenschmidt wrote:
> > We cannot do those initializations from apply_feature_fixups() as
> > this function runs in a very restricted environment in 32-bit where
> > the kernel isn't running at its linked address and the PTRRELOC()
> > macro must be used for any global accesss.
> > 
> > Instead, split them into a separtate steup_feature_keys() function
> > which is called in a more suitable spot on ppc32.
> 
> Wow, cool. With that applied (on top of mainline from some minutes ago), 
> this PowerPC G4 boots again. Thanks!

Just a question, does sleep work on your PowerBook? 

It has been broken for several releases of the kernel on mine, but I've not
had time to investigate it and it might also be a distro/systemd issue.

Regards,
Gabriel


Re: [PATCH 1/3] powerpc: Avoid load hit store in __giveup_fpu() and __giveup_altivec()

2016-05-30 Thread Gabriel Paubert
On Sun, May 29, 2016 at 10:03:50PM +1000, Anton Blanchard wrote:
> From: Anton Blanchard 
> 
> In both __giveup_fpu() and __giveup_altivec() we make two modifications
> to tsk->thread.regs->msr. gcc decides to do a read/modify/write of
> each change, so we end up with a load hit store:
> 
> ld  r9,264(r10)
> rldicl  r9,r9,50,1
> rotldi  r9,r9,14
> std r9,264(r10)
> ...
> ld  r9,264(r10)
> rldicl  r9,r9,40,1
> rotldi  r9,r9,24
> std r9,264(r10)
> 
> Fix this by using a temporary.
> 
> Signed-off-by: Anton Blanchard 
> ---
>  arch/powerpc/kernel/process.c | 16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index e2f12cb..7da1f92 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -139,12 +139,16 @@ EXPORT_SYMBOL(__msr_check_and_clear);
>  #ifdef CONFIG_PPC_FPU
>  void __giveup_fpu(struct task_struct *tsk)
>  {
> + u64 msr;

Huh? Make it an unsigned long please, which is the type of the msr
field in struct pt_regs to work on both 32 and 64 bit processors.

> +
>   save_fpu(tsk);
> - tsk->thread.regs->msr &= ~MSR_FP;
> + msr = tsk->thread.regs->msr;
> + msr &= ~MSR_FP;
>  #ifdef CONFIG_VSX
>   if (cpu_has_feature(CPU_FTR_VSX))
> - tsk->thread.regs->msr &= ~MSR_VSX;
> + msr &= ~MSR_VSX;
>  #endif
> + tsk->thread.regs->msr = msr;
>  }
>  
>  void giveup_fpu(struct task_struct *tsk)
> @@ -219,12 +223,16 @@ static int restore_fp(struct task_struct *tsk) { return 
> 0; }
>  
>  static void __giveup_altivec(struct task_struct *tsk)
>  {
> + u64 msr;
> +

Ditto.

>   save_altivec(tsk);
> - tsk->thread.regs->msr &= ~MSR_VEC;
> + msr = tsk->thread.regs->msr;
> + msr &= ~MSR_VEC;
>  #ifdef CONFIG_VSX
>   if (cpu_has_feature(CPU_FTR_VSX))
> - tsk->thread.regs->msr &= ~MSR_VSX;
> + msr &= ~MSR_VSX;
>  #endif
> + tsk->thread.regs->msr = msr;
>  }
>  
>  void giveup_altivec(struct task_struct *tsk)
> -- 
> 2.7.4
> 

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc32: use stmw/lmw for non volatile registers save/restore

2016-05-24 Thread Gabriel Paubert
On Mon, May 23, 2016 at 10:46:36AM +0200, Christophe Leroy wrote:
> lmw/stmw have a 1 cycle (2 cycles for lmw on some ppc) in addition
> and implies serialising, however it reduces the amount of instructions
> hence the amount of instruction fetch compared to the equivalent
> operation with several lzw/stw. It means less pressure on cache and

Minor typo, s/lzw/lwz/.

> less fetching delays on slow memory.
> When we transfer 20 registers, it is worth it.
> gcc uses stmw/lmw at function entry/exit to save/restore non
> volatile register, so lets also do it that way.
> 
> On powerpc64, we can't use lmw/stmw as it only handles 32 bits, so
> we move longjmp() and setjmp() from misc.S to misc_64.S, and we
> write a 32 bits version in misc_32.S using stmw/lmw
> 
> Signed-off-by: Christophe Leroy 
> ---
> The patch goes on top of "powerpc: inline current_stack_pointer()" or
> requires trivial manual merge in arch/powerpc/kernel/misc.S
> 
>  arch/powerpc/include/asm/ppc_asm.h |  6 ++--
>  arch/powerpc/kernel/misc.S | 61 
> --
>  arch/powerpc/kernel/misc_32.S  | 22 ++
>  arch/powerpc/kernel/misc_64.S  | 61 
> ++
>  4 files changed, 85 insertions(+), 65 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/ppc_asm.h 
> b/arch/powerpc/include/asm/ppc_asm.h
> index 2b31632..e29b649 100644
> --- a/arch/powerpc/include/asm/ppc_asm.h
> +++ b/arch/powerpc/include/asm/ppc_asm.h
> @@ -82,10 +82,8 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
>  #else
>  #define SAVE_GPR(n, base)stw n,GPR0+4*(n)(base)
>  #define REST_GPR(n, base)lwz n,GPR0+4*(n)(base)
> -#define SAVE_NVGPRS(base)SAVE_GPR(13, base); SAVE_8GPRS(14, base); \
> - SAVE_10GPRS(22, base)
> -#define REST_NVGPRS(base)REST_GPR(13, base); REST_8GPRS(14, base); \
> - REST_10GPRS(22, base)
> +#define SAVE_NVGPRS(base)stmw13, GPR0+4*13(base)
> +#define REST_NVGPRS(base)lmw 13, GPR0+4*13(base)
>  #endif
>  
>  #define SAVE_2GPRS(n, base)  SAVE_GPR(n, base); SAVE_GPR(n+1, base)
> diff --git a/arch/powerpc/kernel/misc.S b/arch/powerpc/kernel/misc.S
> index 7ce26d4..9de71d8 100644
> --- a/arch/powerpc/kernel/misc.S
> +++ b/arch/powerpc/kernel/misc.S
> @@ -53,64 +53,3 @@ _GLOBAL(add_reloc_offset)
>  
>   .align  3
>  2:   PPC_LONG 1b
> -
> -_GLOBAL(setjmp)
> - mflrr0
> - PPC_STL r0,0(r3)
> - PPC_STL r1,SZL(r3)
> - PPC_STL r2,2*SZL(r3)
> - mfcrr0
> - PPC_STL r0,3*SZL(r3)
> - PPC_STL r13,4*SZL(r3)
> - PPC_STL r14,5*SZL(r3)
> - PPC_STL r15,6*SZL(r3)
> - PPC_STL r16,7*SZL(r3)
> - PPC_STL r17,8*SZL(r3)
> - PPC_STL r18,9*SZL(r3)
> - PPC_STL r19,10*SZL(r3)
> - PPC_STL r20,11*SZL(r3)
> - PPC_STL r21,12*SZL(r3)
> - PPC_STL r22,13*SZL(r3)
> - PPC_STL r23,14*SZL(r3)
> - PPC_STL r24,15*SZL(r3)
> - PPC_STL r25,16*SZL(r3)
> - PPC_STL r26,17*SZL(r3)
> - PPC_STL r27,18*SZL(r3)
> - PPC_STL r28,19*SZL(r3)
> - PPC_STL r29,20*SZL(r3)
> - PPC_STL r30,21*SZL(r3)
> - PPC_STL r31,22*SZL(r3)
> - li  r3,0
> - blr
> -
> -_GLOBAL(longjmp)
> - PPC_LCMPI r4,0
> - bne 1f
> - li  r4,1
> -1:   PPC_LL  r13,4*SZL(r3)
> - PPC_LL  r14,5*SZL(r3)
> - PPC_LL  r15,6*SZL(r3)
> - PPC_LL  r16,7*SZL(r3)
> - PPC_LL  r17,8*SZL(r3)
> - PPC_LL  r18,9*SZL(r3)
> - PPC_LL  r19,10*SZL(r3)
> - PPC_LL  r20,11*SZL(r3)
> - PPC_LL  r21,12*SZL(r3)
> - PPC_LL  r22,13*SZL(r3)
> - PPC_LL  r23,14*SZL(r3)
> - PPC_LL  r24,15*SZL(r3)
> - PPC_LL  r25,16*SZL(r3)
> - PPC_LL  r26,17*SZL(r3)
> - PPC_LL  r27,18*SZL(r3)
> - PPC_LL  r28,19*SZL(r3)
> - PPC_LL  r29,20*SZL(r3)
> - PPC_LL  r30,21*SZL(r3)
> - PPC_LL  r31,22*SZL(r3)
> - PPC_LL  r0,3*SZL(r3)
> - mtcrf   0x38,r0
> - PPC_LL  r0,0(r3)
> - PPC_LL  r1,SZL(r3)
> - PPC_LL  r2,2*SZL(r3)
> - mtlrr0
> - mr  r3,r4
> - blr
> diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
> index d9c912b..de419e9 100644
> --- a/arch/powerpc/kernel/misc_32.S
> +++ b/arch/powerpc/kernel/misc_32.S
> @@ -1086,3 +1086,25 @@ relocate_new_kernel_end:
>  relocate_new_kernel_size:
>   .long relocate_new_kernel_end - relocate_new_kernel
>  #endif
> +
> +_GLOBAL(setjmp)
> + mflrr0
> + li  r3, 0
> + stw r0, 0(r3)

Huh? Explicitly writing to address 0? Has this code been test run at
least once?

At least move the li r3,0 to just before the blr.

Gabriel

> + stw r1, 4(r3)
> + stw r2, 8(r3)
> + mfcrr12
> + stmwr12, 12(r3)
> + blr
> +
> +_GLOBAL(longjmp)
> + lwz r0, 0(r3)
> + lwz r1, 4(r3)
> + lwz r2, 8(r3)
> + lmw r12, 12(r3)
> + mtcrf   0x38, r12
> + mtlrr0
> + mr. r3, r4
> + bnelr
> + li  r3, 1
> + 

Re: [PATCH] powerpc: inline current_stack_pointer()

2016-05-24 Thread Gabriel Paubert
On Mon, May 23, 2016 at 10:46:02AM +0200, Christophe Leroy wrote:
> current_stack_pointeur() is a single instruction function. it
> It is not worth breaking the execution flow with a bl/blr for a
> single instruction

Are you sure that the result is always the same? 

Calling an external function prevents the compiler from considering the
caller of of current_stack_pointer a leaf function, which certainly 
does not help the compiler, but in a leaf function the compiler is free 
not to establish a new frame.

If the compiler decides to establishes a new frame (typically with 
"stwu r1,-frame_size(r1)"), *r1 is the previous stack pointer, or
the caller's stack pointer, or the current function frame pointer if
I remember correctly the ABI definitions. 

However, if the compiler decides that it can get away without a frame
for the function, *r1 is the stack pointer of the caller's caller.

Depending on the application, this may or may not be important.

By the way, isn't there a GCC builtin which can perform this task,
for example builtin_frame_address()?

Gabriel
> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/reg.h  | 7 ++-
>  arch/powerpc/kernel/misc.S  | 4 
>  arch/powerpc/kernel/ppc_ksyms.c | 2 --
>  3 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index c1e82e9..7ce6777 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -1301,7 +1301,12 @@ static inline unsigned long mfvtb (void)
>  
>  #define proc_trap()  asm volatile("trap")
>  
> -extern unsigned long current_stack_pointer(void);
> +static inline unsigned long current_stack_pointer(void)
> +{
> + register unsigned long *ptr asm("r1");
> +
> + return *ptr;
> +}
>  
>  extern unsigned long scom970_read(unsigned int address);
>  extern void scom970_write(unsigned int address, unsigned long value);
> diff --git a/arch/powerpc/kernel/misc.S b/arch/powerpc/kernel/misc.S
> index 0d43219..7ce26d4 100644
> --- a/arch/powerpc/kernel/misc.S
> +++ b/arch/powerpc/kernel/misc.S
> @@ -114,7 +114,3 @@ _GLOBAL(longjmp)
>   mtlrr0
>   mr  r3,r4
>   blr
> -
> -_GLOBAL(current_stack_pointer)
> - PPC_LL  r3,0(r1)
> - blr
> diff --git a/arch/powerpc/kernel/ppc_ksyms.c b/arch/powerpc/kernel/ppc_ksyms.c
> index 9f01e28..eb5c5dc 100644
> --- a/arch/powerpc/kernel/ppc_ksyms.c
> +++ b/arch/powerpc/kernel/ppc_ksyms.c
> @@ -33,5 +33,3 @@ EXPORT_SYMBOL(store_vr_state);
>  #ifdef CONFIG_EPAPR_PARAVIRT
>  EXPORT_SYMBOL(epapr_hypercall_start);
>  #endif
> -
> -EXPORT_SYMBOL(current_stack_pointer);
> -- 
> 2.1.0
> ___
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc: Discard ffs() function and use builtin_ffs instead

2016-05-13 Thread Gabriel Paubert
On Fri, May 13, 2016 at 04:16:57PM +1000, Michael Ellerman wrote:
> On Thu, 2016-12-05 at 15:32:22 UTC, Christophe Leroy wrote:
> > With the ffs() function as defined in arch/powerpc/include/asm/bitops.h
> > GCC will not optimise the code in case of constant parameter, as shown
> > by the small exemple below.
> > 
> > int ffs_test(void)
> > {
> > return 4 << ffs(31);
> > }
> > 
> > c0012334 :
> > c0012334:   39 20 00 01 li  r9,1
> > c0012338:   38 60 00 04 li  r3,4
> > c001233c:   7d 29 00 34 cntlzw  r9,r9
> > c0012340:   21 29 00 20 subfic  r9,r9,32
> > c0012344:   7c 63 48 30 slw r3,r3,r9
> > c0012348:   4e 80 00 20 blr
> > 
> > With this patch, the same function will compile as follows:
> > 
> > c0012334 :
> > c0012334:   38 60 00 08 li  r3,8
> > c0012338:   4e 80 00 20 blr
> 
> 
> But what code does it generate when it's not a constant?
> 
> And which gcc version first added the builtin version?

It already existed in gcc-2.95, which you do not want to use to compile
anything today but I have in a corner of a chroot environment to maintain
~1997 vintage embedded stuff, running a 2.2.12 kernel!

Hopefully this clears up your concerns :-)

Cheers,
Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3] ppc64/book3s: fix branching to out of line handlers in relocation kernel

2016-04-01 Thread Gabriel Paubert
Hi Michael,

On Fri, Apr 01, 2016 at 05:14:35PM +1100, Michael Ellerman wrote:
> On Wed, 2016-03-30 at 23:49 +0530, Hari Bathini wrote:
> > Some of the interrupt vectors on 64-bit POWER server processors  are
> > only 32 bytes long (8 instructions), which is not enough for the full
> ...
> > Let us fix this undependable code path by moving these OOL handlers below
> > __end_interrupts marker to make sure we also copy these handlers to real
> > address 0x100 when running a relocatable kernel. Because the interrupt
> > vectors branching to these OOL handlers are not long enough to use
> > LOAD_HANDLER() for branching as discussed above.
> > 
> ...
> > changes from v2:
> > 2. Move the OOL handlers before __end_interrupts marker instead of moving 
> > the __end_interrupts marker
> > 3. Leave __end_handlers marker as is.
> 
> Hi Hari,
> 
> Thanks for trying this. In the end I've decided it's not a good option.
> 
> If you build an allmodconfig, and turn on CONFIG_RELOCATABLE, and then look at
> the disassembly, you see this:
> 
>   c0006ffc:   48 00 29 04 b   c0009900 
> <.ret_from_except>
>   
>   c0007000 <__end_handlers>:
> 
> At 0x7000 we have the FWNMI area, which is fixed and can't move. As you see
> above we end up with only 4 bytes of space between the end of the handlers and
> the FWNMI area.

Nitpicking a bit, if I correctly read the above disassembly and there is an 
instuction
at 0x6ffc, the free space is exactly 0! 

> 
> So any tiny change that adds two more instructions prior to 0x7000 will then
> fail to build.

Even one instruction provided I still know how to count.

> 
> None of that's your fault, it's just the nature of the code in there, it's 
> very
> space constrained.

Calling it space very constrained makes you win the understatement of the month 
award, on April fool's day :-)

Regards,
Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [1/1] powerpc/embedded6xx: Make reboot works on MVME5100

2016-03-10 Thread Gabriel Paubert
On Wed, Mar 09, 2016 at 03:26:21PM -0600, Scott Wood wrote:
> On Wed, 2016-03-09 at 11:28 +0100, Gabriel Paubert wrote:
> > On Wed, Mar 09, 2016 at 12:38:18AM -0600, Scott Wood wrote:
> > > On Tue, Mar 08, 2016 at 08:59:12AM +0100, Alessio Igor Bogani wrote:
> > > > The mtmsr() function hangs during restart. Make reboot works on
> > > > MVME5100 removing that function call.
> > > > ---
> > > >  arch/powerpc/platforms/embedded6xx/mvme5100.c | 2 --
> > > >  1 file changed, 2 deletions(-)
> > > 
> > > Missing signoff
> > > 
> > > Do you know why MSR_IP was there to begin with? 
> > 
> > Huh? The patch sets MSR_IP for reset but it is cleared while running Linux.
> > I don't have any MVME5100, however I do have MVME2400/2700 with the same
> > bridge (Hawk), so I can say that the address space layout is quite standard:
> > memory at 0, ROM at the high end of the 32-bit address space. However the
> > reset method is quite different (no external register set on the Hawk).
> 
> I meant why the line of code was there, not why the bit was ever set.  It
> seems odd to me that the pre-reset value of MSR would matter at all --
> shouldn't it get reset along with the rest of the CPU, as you later suggest? 

I don't know, it may create a soft reset. I have the board documentation
and it is not clear (at least not 100% clear to me) whether this reset 
drives the HRESET of SRESET pin, although HRESET is more likely.

>  And if it does matter, then why are Alessio and whoever wrote the code
> (assuming it wasn't an untested copy-and-paste from somewhere else) seeing
> different results regarding whether IP needs to be set or unset?

The different results are perfectly explained by taking a DSI exception
on accessing *restart (the only access to this address in the whole
code, so the hash table has to be filled at this time). With MSR_IP set, 
DSI exception goes to 0xfff00300, and then probably hangs.

> 
> > > Does this board have a switch that determines whether boot vectors 
> > > are high or low (I remember some 83xx boards that did), in which 
> > > case is this fixing one config by breaking another?
> > 
> > For the switch, no AFAICT. And the code is MVME5100 specific so I
> > suspect that it is very unlikely to break other boards.
> >
> > Very likely the source of the problem is that the restart address is
> > remapped 
> > (ioremap) but never accessed while the kernel is running (the only access to
> > *restart is in the reboot routine) so we take a DSI exception to fill the
> > hash 
> > table when attempting to reboot. 
> >
> > It would be enough to move the setting of MSR_IP until after triggering the 
> > restart, but this performs a hard reset of the CPU, which will set MSR_IP 
> > anyway (granted that the CPU will probably set MSR_IP way before the reset 
> > signal comes in).
> 
> How can you perform anything after the reset is triggered?  There might be
> some lag before it takes effect, but depending on that seems hackish and
> unreliable, and why would it make a difference?

It would make a difference if this reset is connected to SRESET, which
does not affect MSR_IP.

> 
> > One way to check this hypothesis would be to introduce a write of 0 to
> > the restart address before setting MSR_IP.
> 
> I'm not familiar with the register interface for this board logic, but what
> would that demonstrate?

Writing a 0 would not trigger the reset, but ensure that the hash table
is filled, then you can change MSR_IP and perform the real reset, and be
sure that you'll end up executing the reset in FLASH, not in RAM.

The alternative solution it to put code at 0x100 that
sets MSR_IP and branches to 0xfff00100.

> 
> > This said restart is declared as u_char *, so the cast in the out_8 
> > register access is useless and ugly.
> 
> Yes, and restart should be __iomem.

Indeed.

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [1/1] powerpc/embedded6xx: Make reboot works on MVME5100

2016-03-09 Thread Gabriel Paubert
On Wed, Mar 09, 2016 at 12:38:18AM -0600, Scott Wood wrote:
> On Tue, Mar 08, 2016 at 08:59:12AM +0100, Alessio Igor Bogani wrote:
> > The mtmsr() function hangs during restart. Make reboot works on
> > MVME5100 removing that function call.
> > ---
> >  arch/powerpc/platforms/embedded6xx/mvme5100.c | 2 --
> >  1 file changed, 2 deletions(-)
> 
> Missing signoff
> 
> Do you know why MSR_IP was there to begin with? 

Huh? The patch sets MSR_IP for reset but it is cleared while running Linux.
I don't have any MVME5100, however I do have MVME2400/2700 with the same
bridge (Hawk), so I can say that the address space layout is quite standard: 
memory at 0, ROM at the high end of the 32-bit address space. However the
reset method is quite different (no external register set on the Hawk).

> Does this board have a switch that determines whether boot vectors 
> are high or low (I remember some 83xx boards that did), in which 
> case is this fixing one config by breaking another?

For the switch, no AFAICT. And the code is MVME5100 specific so I
suspect that it is very unlikely to break other boards.

Very likely the source of the problem is that the restart address is remapped 
(ioremap) but never accessed while the kernel is running (the only access to
*restart is in the reboot routine) so we take a DSI exception to fill the hash 
table when attempting to reboot. 

It would be enough to move the setting of MSR_IP until after triggering the 
restart, but this performs a hard reset of the CPU, which will set MSR_IP 
anyway (granted that the CPU will probably set MSR_IP way before the reset 
signal comes in).

One way to check this hypothesis would be to introduce a write of 0 to
the restart address before setting MSR_IP.

This said restart is declared as u_char *, so the cast in the out_8 
register access is useless and ugly.

Gabriel

P.S.: my MVME24xx/26x/27xx do not run such a modern kernel. 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 2/3] powerpc: Add __raw_rm_writeq() function

2015-12-17 Thread Gabriel Paubert
On Thu, Dec 17, 2015 at 01:43:12PM +1100, Alistair Popple wrote:
> Move __raw_rw_writeq() from platforms/powernv/pci-ioda.c to
  ^
Typo?
(not a big deal)

Gabriel

> include/asm/io.h so that it can be used by other code.
> 
> Signed-off-by: Alistair Popple 
> ---
>  arch/powerpc/include/asm/io.h | 11 +++
>  arch/powerpc/platforms/powernv/pci-ioda.c | 10 --
>  2 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
> index 5879fde..6c1297e 100644
> --- a/arch/powerpc/include/asm/io.h
> +++ b/arch/powerpc/include/asm/io.h
> @@ -385,6 +385,17 @@ static inline void __raw_writeq(unsigned long v, 
> volatile void __iomem *addr)
>  {
>   *(volatile unsigned long __force *)PCI_FIX_ADDR(addr) = v;
>  }
> +
> +/*
> + * Real mode version of the above. stdcix is only supposed to be used
> + * in hypervisor real mode as per the architecture spec.
> + */
> +static inline void __raw_rm_writeq(u64 val, volatile void __iomem *paddr)
> +{
> + __asm__ __volatile__("stdcix %0,0,%1"
> + : : "r" (val), "r" (paddr) : "memory");
> +}
> +
>  #endif /* __powerpc64__ */
>  
>  /*
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 42b4bb2..cb04642 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -116,16 +116,6 @@ static int __init iommu_setup(char *str)
>  }
>  early_param("iommu", iommu_setup);
>  
> -/*
> - * stdcix is only supposed to be used in hypervisor real mode as per
> - * the architecture spec
> - */
> -static inline void __raw_rm_writeq(u64 val, volatile void __iomem *paddr)
> -{
> - __asm__ __volatile__("stdcix %0,0,%1"
> - : : "r" (val), "r" (paddr) : "memory");
> -}
> -
>  static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
>  {
>   return ((flags & (IORESOURCE_MEM_64 | IORESOURCE_PREFETCH)) ==
> -- 
> 2.1.4
> 
> ___
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [P.A. Semi] Does the ethernet interface work on your Electra, Chitra, Nemo, and Athena board?

2015-11-30 Thread Gabriel Paubert
On Mon, Nov 30, 2015 at 06:08:23PM +0100, Christian Zigotzky wrote:
> Hi All,
> 
> I have tested the PA Semi Ethernet with the kernels 4.2.3 and 4.3.0
> today. With the kernel 4.2.3 it works but with the kernel 4.3.0
> final it doesn't work.
> 
> After that I tested some git kernels and release candidates of 4.3.
> 
> Kernel 4.3 git from Tue Sep 01, 2015 -> PA Semi Ethernet works
> Kernel 4.3 git from Wed Sep 02, 2015 -> PA Semi Ethernet works
> Kernel 4.3 git from Thu Sep 03, 2015 -> PA Semi Ethernet works
> Kernel 4.3 git from Fri Sep 04, 2015 -> PA Semi Ethernet doesn't
> work (Merge tag 'powerpc-4.3-1': 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ff474e8ca8547d09cb82ebab56d4c96f9eea01ce)
> Kernel 4.3 git from Sat Sep 05, 2015 -> PA Semi Ethernet doesn't work
> Kernel 4.3 git from Mon Sep 07, 2015 -> PA Semi Ethernet doesn't work
> Kernel 4.3 git from Wed Sep 09, 2015 -> PA Semi Ethernet doesn't work
> Kernel 4.3 git from Fri Sep 11, 2015 -> PA Semi Ethernet doesn't work
> Kernel 4.3 RC1 from Sun Sep 13, 2015 -> PA Semi Ethernet doesn't work
> Kernel 4.3 RC2 from Mon Sep 21, 2015 -> PA Semi Ethernet doesn't work
> 
> The problematic commit must be between Thu Sep 03, 2015 at 09:37 AM
> (UTC +2) and Fri Sep 04, 2015 at 7:38 PM (UTC +2) in the linux git.
> 
> Linux git: Between 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/?ofs=15500
> and 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/?ofs=15200.
> 
> Maybe 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ff474e8ca8547d09cb82ebab56d4c96f9eea01ce.

Unless I miss something, that's the kind of regression that "git bisect" is 
designed to find.

Cheers,
Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] powerpc/mpc5xxx: Avoid dereferencing potentially freed memory

2015-10-16 Thread Gabriel Paubert
On Fri, Oct 16, 2015 at 08:20:13AM +0200, Christophe JAILLET wrote:
> Le 15/10/2015 08:36, Michael Ellerman a écrit :
> >On Thu, 2015-10-15 at 07:56 +0200, Christophe JAILLET wrote:
> >>Use 'of_property_read_u32()' instead of 'of_get_property()'+pointer
> >>dereference in order to avoid access to potentially freed memory.
> >>
> >>Use 'of_get_next_parent()' to simplify the while() loop and avoid the
> >>need of a temp variable.
> >>
> >>Signed-off-by: Christophe JAILLET 
> >>---
> >>v2: Use of_property_read_u32 instead of of_get_property+pointer dereference
> >>*** Untested ***
> >Thanks.
> >
> >Can someone with an mpc5xxx test this?
> >
> >cheers
> >
> 
> Hi,
> I don't think it is an issue, but while looking at another similar
> patch, I noticed that the proposed patch adds a call to
> be32_to_cpup() (within of_property_read_u32).
> Apparently, powerPC is a BE architecture, so this call should be a no-op.

Sadly no more. 32 bit is BE only, but 64 bit can be either BEtter or
LEsser.

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 1/2] powerpc/selftest: Add gettimeofday() benchmark

2015-09-25 Thread Gabriel Paubert
On Fri, Sep 25, 2015 at 12:28:30PM +0300, Denis Kirjanov wrote:
> On 9/25/15, Arnd Bergmann  wrote:
> > On Friday 25 September 2015 14:01:39 Michael Neuling wrote:
> >> This adds a benchmark directory to the powerpc selftests and adds a
> >> gettimeofday() benchmark to it.
> >>
> >> Suggested-by: Michael Ellerman 
> >> Signed-off-by: Michael Neuling 
> >>
> >
> > Any reason for keeping this powerpc specific? It seems generally useful.
> > and portable.
> You're right. Moreover, we can put some comment to the benchmark why
> we've made such decision to add it (reference to the commit
> "powerpc/vdso: Avoid link stack corruption in __get_datapage()")

Why gettimeofday? Isn't clock_gettime the modern variant?

BTW: dows anyone receive 2 copies of every messge in this thread ?

I do, and I suspect that this is due to the Cc: list having both
linuxppc-...@ozlabs.org and linuxppc-dev@lists.ozlabs.org. I removed the
former for this reply.

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 5/8] powerpc/slb: Add documentation to runtime patching of SLB encoding

2015-07-22 Thread Gabriel Paubert
On Wed, Jul 22, 2015 at 03:51:03PM +1000, Michael Ellerman wrote:
 On Tue, 2015-07-21 at 12:28 +0530, Anshuman Khandual wrote:
  From: khand...@linux.vnet.ibm.com khand...@linux.vnet.ibm.com
  
  This patch adds some documentation to 'patch_slb_encoding' function
  explaining about how it clears the existing immediate value in the
  given instruction and inserts a new one there.
  
  diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
  index dcba4c2..8083a9e 100644
  --- a/arch/powerpc/mm/slb.c
  +++ b/arch/powerpc/mm/slb.c
  @@ -278,7 +278,13 @@ void switch_slb(struct task_struct *tsk, struct 
  mm_struct *mm)
   static inline void patch_slb_encoding(unsigned int *insn_addr,
unsigned int immed)
   {
  -   int insn = (*insn_addr  0x) | immed;
  +   /*
  +* Currently this patches only li and cmpldi
  +* instructions with an immediate value. Here it
  +* just clears the existing immediate value from
  +* the instruction and inserts a new one there.
  +*/
  +   unsigned int insn = (*insn_addr  0x) | immed;
  patch_instruction(insn_addr, insn);
   }
 
 
 How about:
 
   /*
* This function patches either an li or a cmpldi instruction with
* a new immediate value. This relies on the fact that both li
* (which is actually ori) and cmpldi both take a 16-bit immediate

Hmm, li is actually encoded as addi with r0 as source register...

* value, and it is situated in the same location in the instruction,
* ie. bits 0-15.

In PPC documentation, it's rather bits 16-31 (big endian bit order).
Or say lower half which is endian agnostic.

Cheers,
Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v4 2/2] selftests/powerpc: Add a test of the switch_endian() syscall

2015-03-28 Thread Gabriel Paubert
On Sat, Mar 28, 2015 at 12:19:10PM +1100, Michael Ellerman wrote:
 This adds a test of the switch_endian() syscall we added in the previous
 commit.
 
 We test it by calling the endian switch syscall, and then executing some
 code in the other endian to check everything went as expected. That code
 checks registers we expect to be maintained are. If the endian switch
 failed to happen that code sequence will be illegal and cause the test
 to abort.
 
 We then switch back to the original endian, do the same checks and
 finally write a success message and exit(0).
 
 Signed-off-by: Michael Ellerman m...@ellerman.id.au
 
 ---
 v3: Have the test switch back to the original endian.
 v4: Add .gitignore.
 Drop the message write in the checking code - it clobbers some regs
 and breaks the second check.
 
  tools/testing/selftests/powerpc/Makefile   |  2 +-
  .../selftests/powerpc/switch_endian/.gitignore |  2 +
  .../selftests/powerpc/switch_endian/Makefile   | 23 +
  .../selftests/powerpc/switch_endian/check.S| 98 
 ++
  .../selftests/powerpc/switch_endian/common.h   |  6 ++
  .../powerpc/switch_endian/switch_endian_test.S | 82 ++
  6 files changed, 212 insertions(+), 1 deletion(-)
  create mode 100644 tools/testing/selftests/powerpc/switch_endian/.gitignore
  create mode 100644 tools/testing/selftests/powerpc/switch_endian/Makefile
  create mode 100644 tools/testing/selftests/powerpc/switch_endian/check.S
  create mode 100644 tools/testing/selftests/powerpc/switch_endian/common.h
  create mode 100644 
 tools/testing/selftests/powerpc/switch_endian/switch_endian_test.S

[snipped]
 diff --git 
 a/tools/testing/selftests/powerpc/switch_endian/switch_endian_test.S 
 b/tools/testing/selftests/powerpc/switch_endian/switch_endian_test.S
 new file mode 100644
 index ..eee0e393600b
 --- /dev/null
 +++ b/tools/testing/selftests/powerpc/switch_endian/switch_endian_test.S
 @@ -0,0 +1,82 @@
 +#include common.h
 +
 + .data
 + .balign 8
 +message:
 + .ascii success: switch_endian_test\n\0
 +
 + .section .toc
 + .balign 8
 +pattern:
 + .llong 0x
 +
 + .text
 +FUNC_START(_start)
 + /* Load the pattern */
 + ld  r15, pattern@TOC(%r2)
 +
 + /* Setup CR, only CR2-CR4 are maintained */
 + lis r3, 0x00FF
 + ori r3, r3, 0xF000
 + mtcrr3
 +
 + /* Load the pattern slightly modified into the registers */
 + mr  r3, r15
 + addir4, r15, 4
 +
 + addir5, r15, 32
 + mtlrr5
 +
 + addir5, r15, 5
 + addir6, r15, 6
 + addir7, r15, 7
 + addir8, r15, 8
 +
 + /* r9 - r12 are clobbered */
 +
 + addir13, r15, 13
 + addir14, r15, 14
 +
 + /* Skip r15 we're using it */
 +
 + addir16, r15, 16
 + addir16, r15, 16

Duplicate?


 + addir17, r15, 17
 + addir18, r15, 18
 + addir19, r15, 19
 + addir20, r15, 20
 + addir21, r15, 21
 + addir22, r15, 22
 + addir23, r15, 23
 + addir24, r15, 24
 + addir25, r15, 25
 + addir26, r15, 26
 + addir27, r15, 27
 + addir28, r15, 28
 + addir29, r15, 29
 + addir30, r15, 30
 + addir31, r15, 31
 +
 + /*
 +  * Call the syscall to switch endian.
 +  * It clobbers r9-r12, XER, CTR and CR0-1,5-7.
 +  */
 + li r0, __NR_switch_endian
 + sc
 +
 +#include check-reversed.S
 +
 + /* Flip back, r0 already has the switch syscall number */
 + .long   0x0244  /* sc */
 +
 +#include check.S
 +
 + li  r0, __NR_write
 + li  r3, 1   /* stdout */
 + ld  r4, message@got(%r2)
 + li  r5, 28  /* strlen(message3) */
 + sc
 + li  r0, __NR_exit
 + li  r3, 0
 + sc
 + b   .
 -- 
 2.1.0

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: AW: SPE Interrupt context (was how to make use of SPE instructions)

2015-01-30 Thread Gabriel Paubert
On Fri, Jan 30, 2015 at 05:37:29AM +, Markus Stockhausen wrote:
  Von: Scott Wood [scottw...@freescale.com]
  Gesendet: Freitag, 30. Januar 2015 01:49
  An: Markus Stockhausen
  Cc: Michael Ellerman; linuxppc-dev@lists.ozlabs.org; Herbert Xu
  Betreff: Re: AW: SPE  Interrupt context (was how to make use of SPE 
  instructions)
  
  On Wed, 2015-01-28 at 05:00 +, Markus Stockhausen wrote:
 Von: Scott Wood [scottw...@freescale.com]
 Gesendet: Mittwoch, 28. Januar 2015 05:21
 An: Markus Stockhausen
 Cc: Michael Ellerman; linuxppc-dev@lists.ozlabs.org; Herbert Xu
 Betreff: Re: SPE  Interrupt context (was how to make use of SPE 
 instructions)

 Hi Scott,

 thanks for your helpful feedback. As you might have seen I sent a 
 first
 patch for the sha256 kernel module that takes care about preemption.

 Herbert Xu noticed that my module won't run in for IPsec as all
 work will be done from interrupt context. Do you have a tip how I can
 mitigate the check I implemented:

 static bool spe_usable(void)
 {
   return !in_interrupt();
 }

 Intel guys have something like that

 bool irq_fpu_usable(void)
 {
   return !in_interrupt() ||
 interrupted_user_mode() ||
 interrupted_kernel_fpu_idle();
 }

 But I have no idea how to transfer it to the PPC/SPE case.
   
I'm not sure what sort of tip you're looking for, other than
implementing it myself. :-)
  
   Hi Scott,
  
   maybe I did not explain it correctly. interrupted_kernel_fpu_idle()
   is x86 specific. The same applies to interrupted_user_mode().
   I'm just searching for a similar feature in the PPC/SPE world.
  
  There isn't one.
  
   I can see that enable_kernel_spe() does something with the
   MSR_SPE flag, but I have no idea  how to determine if I'm allowed
   to enable SPE although I'm inside an interrupt context.
  
  As with x86, you'd want to check whether the kernel interrupted
  userspace.  I don't know what x86 is doing with TS, but on PPC you might
  check whether the interrupted thread had MSR_FP enabled.
  
   I'm asking because from the previous posts I conclude that
   running SPE instructions inside an interrupt might be critical.
   Because of registers not being saved?
  
  Yes.  Currently callers of enable_kernel_spe() only need to disable
  preemption, not interrupts.
  
   Or can I just save the register contents myself and interrupt
   context is no longer a showstopper?
  
  If you only need a small number of registers that might be reasonable,
  but if you need a bunch then you don't want to save them when you don't
  have to.
  
  Another option is to change enable_kernel_spe() to require interrupts to
  be disabled.
 
 Phew, that is going deeper than I expected. 
 
 I'm a newbie in the topic of interrupts and FPU/SPE registers. Nevertheless
 enforcing enable_kernel_spe() to only be available outside of interrupt
 context sounds too restrictive for me. Also checking for thread/CPU flags 
 of an interrupted process is nothing I can or want to implement. There
 might be the risk that I'm starting something that will be too complex
 for me.
 
 BUT! Given the fact that SPE registers are only extended GPRs and my
 algorithm needs just 10 of them I can live with the following design.
 
 - I must already save several non-volatile registers. Putting the 64 bit 
 values 
 into them would require me to save their contents with evstdd instead of 
 stw. Of course stack alignment to 8 bytes required. So only a few alignment
 instructions needed additionally during initialization.

On most PPC ABI the stack is guaranteed to be aligned to a 16 byte
boundary. In some it may be only 8, but I can't remember any 4 byte
only alignment.

I checked my 32 bit kernel images with:

objdump -d vmlinux |awk '/stwu.*r1,/{print $6,$7}'|sort -u

and the stack seems to always be 16 byte aligned.
For 64 bit, use stdu instead of stwu.

I've also found a few stwux/stdux which are hopefully known
to be harmless.

 
 - During function cleanup I will restore the registers the same way.
 
 - In case I interrupted myself, I might have saved sensitive data of another 
 thread on my stack. So I will zero that area after I restored the registers.
 That needs an additional 10 instructions. In contrast to ~2000 instructions
 for one sha256 round that should be neglectable.
 
 This little overhead will save me lots of trouble at other locations:
 
 - I can avoid checking for an interrupt context.
 
 - I don't need a fallback to the generic implementation. 
 
 Thinking about it more and more I think I performance will stay the same. 
 Can you confirm that this will work? If yes I will send a v2 patch.
 
 Markus

Gabriel

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: AW: SPE Interrupt context (was how to make use of SPE instructions)

2015-01-30 Thread Gabriel Paubert
On Fri, Jan 30, 2015 at 09:39:41AM +, Markus Stockhausen wrote:
  Von: Gabriel Paubert [paub...@iram.es]
  Gesendet: Freitag, 30. Januar 2015 09:49
  An: Markus Stockhausen
  Cc: Scott Wood; linuxppc-dev@lists.ozlabs.org; Herbert Xu
  Betreff: Re: AW: SPE  Interrupt context (was how to make use of SPE 
  instructions)
 
   ...
   - I must already save several non-volatile registers. Putting the 64 bit 
   values
   into them would require me to save their contents with evstdd instead of
   stw. Of course stack alignment to 8 bytes required. So only a few 
   alignment
   instructions needed additionally during initialization.
  
  On most PPC ABI the stack is guaranteed to be aligned to a 16 byte
  boundary. In some it may be only 8, but I can't remember any 4 byte
  only alignment.
  
  I checked my 32 bit kernel images with:
  
  objdump -d vmlinux |awk '/stwu.*r1,/{print $6,$7}'|sort -u
  
  and the stack seems to always be 16 byte aligned.
  For 64 bit, use stdu instead of stwu.
  
  I've also found a few stwux/stdux which are hopefully known
  to be harmless.
 
  Gabriel
 
 A helpful annotation. But now I'm unsure about function usage. SPE seems to be
 32bit only and I would use their evxxx instructions. Do you think the 
 following
 sequence will be the right way? 
 
 _GLOBAL(ppc_spe_sha256_transform)
   stwur1,-128(r1);/* create stack frame   */
   stw r24,8(r1);  /* save normal registers*/
   stw r25,12(r1);   
   evstdw  r14,16(r1); /* We must save non volatile*/
   evstdw  r15,24(r1);/* registers. Take the chance   */
   evstdw  r16,32(r12);/* and save the SPE part too*/ \
   ...
   lwz r24,8(r1);  /* restore normal registers */ \
   lwz r25,12(r1);
   evldw   r14,16(r12); /* restore non-v. + SPE registers  */
   evldw   r15,24(r12);
   evldw   r16,32(r12);
   addir1,r1,128;  /* cleanup stack frame  */
 

Yes. But there is also probably a status/control register somewhere that
you might need to save restore, unless it is never used and/or affected by the
instructions you use.

 Or must I use the kernel provided defines with PPC_STLU 
 r1,-INT_FRAME_SIZE(r1) 
 plus SAVE_GPR/SAVE_EVR/REST_GPR/REST_EVR?
 

From what I understand INT_FRAME_SIZE is for interrupt entry code. This
is not the case of your code which is a standard function except for
the fact that it clobbers the upper 32 bits of some registers by using
SPE instructions. Therore INT_FRAME_SIZE is overkill. I also believe that
you can save the registers as you suggest, no need to split it into
the high and low part.

By the way, I wonder where the SAVE_EVR/REST_EVR macros are used. I only
see the definitions, no use in a 3.18 source tree.

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 1/2]powerpc: foundation code to handle CR5 for local_t

2014-12-03 Thread Gabriel Paubert
On Wed, Dec 03, 2014 at 08:29:37PM +0530, Madhavan Srinivasan wrote:
 On Tuesday 02 December 2014 03:05 AM, Gabriel Paubert wrote:
  On Thu, Nov 27, 2014 at 05:48:40PM +0530, Madhavan Srinivasan wrote:
  diff --git a/arch/powerpc/include/asm/exception-64s.h 
  b/arch/powerpc/include/asm/exception-64s.h
  index 77f52b2..c42919a 100644
  --- a/arch/powerpc/include/asm/exception-64s.h
  +++ b/arch/powerpc/include/asm/exception-64s.h
  @@ -306,7 +306,26 @@ do_kvm_##n:   
  \
 std r10,0(r1);  /* make stack chain pointer */ \
 std r0,GPR0(r1);/* save r0 in stackframe*/ \
 std r10,GPR1(r1);   /* save r1 in stackframe*/ \
  -  beq 4f; /* if from kernel mode  */ \
  +BEGIN_FTR_SECTION;
 \
  +  lis r9,4096;/* Create a mask with HV and PR */ \
  +  rldicr  r9,r9,32,31;/* bits, AND with the MSR   */ \
  +  mr  r10,r9; /* to check for Hyp state   */ \
  +  ori r9,r9,16384;   \
  +  and r9,r12,r9; \
  +  cmpdcr3,r10,r9; 
 \
  +  beq cr3,66f;/* Jump if we come from Hyp mode*/ \
  +  mtcrf   0x04,r10;   /* Clear CR5 if coming from usr */ \
  
  I think you can do better than this, powerpc has a fantastic set
  of rotate and mask instructions. If I understand correctly your
  code you can replace it with the following:
  
  rldicl  r10,r12,4,63   /* Extract HV bit to LSB of r10*/
  rlwinm  r9,r12,19,0x02 /* Extract PR bit to 2nd to last bit of r9 */
  or  r9,r9,10
  cmplwi  cr3,r9,1   /* Check for HV=1 and PR=0 */
  beq cr3,66f
  mtcrf   0x04,r10   /* Bits going to cr5 bits are 0 in r10 */
  
 
 From previous comment, at this point, CR0 already has MSR[PR], and only
 HV need to be checked. So I guess there still room for optimization.
 But good to learn this seq.

Actually the sequence I suggested can even be shortened again since the or
is not necessary and you can use an rlwimi instead.

rldicl r9,r12,5,62  /*  r9 = 62 zeroes | HV | ?  */
rlwimi r9,r12,18,0x01   /*  r9 = 62 zeroes | HV | PR */
cmplwi cr3,r9,2 /* Check for HV=1 and PR=0 */

For 32 bit operands, rlwimi is a generic bit field to bit field move, but
GCC is (was, maybe GCC5 is better?) quite bad at recognizing opportunities
to use it.

I'm not sure that it is better to have two separate branches, one for
testing PR and the other for testing HV. Through how many branches can
the hardware speculate? 

Unless I'm mistaken, this is code which is likely not to be in the
cache so I would optimize it first towards minimal code size.

Regards,
Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/2]powerpc: rewrite local_* to use CR5 flag

2014-12-01 Thread Gabriel Paubert
On Thu, Nov 27, 2014 at 05:48:41PM +0530, Madhavan Srinivasan wrote:
 This patch re-write the current local_* functions to CR5 based one.
 Base flow for each function is 
 
 {
   set cr5(eq)
   load
   ..
   store
   clear cr5(eq)
 }
 
 Above set of instructions are followed by a fixup section which points
 to the entry of the function incase of interrupt in the flow. If the 
 interrupt happens to be after the store, we just continue to last 
 instruction in that block. 
 
 Currently only asm/local.h has been rewrite, and local64 is TODO.
 Also the entire change is only for PPC64.
 
 Signed-off-by: Madhavan Srinivasan ma...@linux.vnet.ibm.com
 ---
  arch/powerpc/include/asm/local.h | 306 
 +++
  1 file changed, 306 insertions(+)
 
 diff --git a/arch/powerpc/include/asm/local.h 
 b/arch/powerpc/include/asm/local.h
 index b8da913..a26e5d3 100644
 --- a/arch/powerpc/include/asm/local.h
 +++ b/arch/powerpc/include/asm/local.h
 @@ -11,6 +11,310 @@ typedef struct
  
  #define LOCAL_INIT(i){ ATOMIC_LONG_INIT(i) }
  
 +#ifdef   CONFIG_PPC64
 +
 +static __inline__ long local_read(local_t *l)
 +{
 + long t;
 +
 + __asm__ __volatile__(
 +1:  crset   22\n
 +2: PPC_LL %0,0(%1)\n
 +3:  crclr   22\n
 +4:\n
 +.section __ex_table,\a\\n
 + PPC_LONG_ALIGN \n
 + PPC_LONG 2b,1b\n
 + PPC_LONG 3b,3b\n
 +.previous\n
 + : =r (t)
 + : r ((l-a.counter)));
 +
 + return t;
 +}
 +
 +static __inline__ void local_set(local_t *l, long i)
 +{
 + long t;
 +
 + __asm__ __volatile__(
 +1:  crset   22\n
 +2: PPC_LL %0,0(%1)\n
 +3: PPC405_ERR77(0,%2)
 +4: PPC_STL %0,0(%2)\n
 +5:  crclr   22\n
 +6:\n
 +.section __ex_table,\a\\n
 + PPC_LONG_ALIGN \n
 + PPC_LONG 2b,1b\n
 + PPC_LONG 3b,1b\n
 + PPC_LONG 4b,1b\n
 + PPC_LONG 5b,5b\n
 +.previous\n
 + : =r (t)
 + : r ((i)), r ((l-a.counter)));
 +}
 +

Apart from the other comments on bloat which can very likely
be removed by tracing backwards for a few instructions, removing
the exception table entries which are 2 or 4 (64 bit?) times as 
large as the instruction sequence, I don't understand at all why
you need these sequences for the local_read and local_set functions.

After all these are single instructions (why do you perform a read before
the write in set when the result of the read is never used?).

I believe read and set are better mapped to access_once (or assign_once
or whatever it's called after the recent discussion on linux-kernel).
You don't even need a memory barrier if it's for a single thread,
so you could get away with a single volatile access to the variables.

For the other ones, I think that what you do is correct, except that
the workaround for PPC405 erratum 77 is not needed since this erratum
only affects the stwcx. instruction and the whole point of the patch
is to avoid the use of an l?arx/st?cx. pair.

Regards,
Gabriel

 +static __inline__ void local_add(long i, local_t *l)
 +{
 + long t;
 +
 + __asm__ __volatile__(
 +1:  crset   22\n
 +2: PPC_LL %0,0(%2)\n
 +3:  add %0,%1,%0\n
 +4: PPC405_ERR77(0,%2)
 +5: PPC_STL %0,0(%2)\n
 +6:  crclr   22\n
 +7:\n
 +.section __ex_table,\a\\n
 + PPC_LONG_ALIGN \n
 + PPC_LONG 2b,1b\n
 + PPC_LONG 3b,1b\n
 + PPC_LONG 4b,1b\n
 + PPC_LONG 5b,1b\n
 + PPC_LONG 6b,6b\n
 +.previous\n
 + : =r (t)
 + : r (i), r ((l-a.counter)));
 +}
 +
 +static __inline__ void local_sub(long i, local_t *l)
 +{
 + long t;
 +
 + __asm__ __volatile__(
 +1:  crset   22\n
 +2: PPC_LL %0,0(%2)\n
 +3:  subf%0,%1,%0\n
 +4: PPC405_ERR77(0,%2)
 +5: PPC_STL %0,0(%2)\n
 +6:  crclr   22\n
 +7:\n
 +.section __ex_table,\a\\n
 + PPC_LONG_ALIGN \n
 + PPC_LONG 2b,1b\n
 + PPC_LONG 3b,1b\n
 + PPC_LONG 4b,1b\n
 + PPC_LONG 5b,1b\n
 + PPC_LONG 6b,6b\n
 +.previous\n
 + : =r (t)
 + : r (i), r ((l-a.counter)));
 +}
 +
 +static __inline__ long local_add_return(long a, local_t *l)
 +{
 + long t;
 +
 + __asm__ __volatile__(
 +1:  crset   22\n
 +2: PPC_LL %0,0(%2)\n
 +3:  add %0,%1,%0\n
 +4: PPC405_ERR77(0,%2)
 +5: PPC_STL %0,0(%2)\n
 +6:  crclr   22\n
 +7:\n
 +.section __ex_table,\a\\n
 + PPC_LONG_ALIGN \n
 + PPC_LONG 2b,1b\n
 + PPC_LONG 3b,1b\n
 + PPC_LONG 4b,1b\n
 + PPC_LONG 5b,1b\n
 + PPC_LONG 6b,6b\n
 +.previous\n
 + : =r (t)
 + : r (a), r ((l-a.counter))
 + : cc, memory);
 +
 + return t;
 +}
 +
 +
 +#define local_add_negative(a, l) (local_add_return((a), (l))  0)
 +
 +static __inline__ long local_sub_return(long a, local_t *l)
 +{
 + long t;
 +
 + __asm__ __volatile__(
 +1:  crset   22\n
 +2: PPC_LL %0,0(%2)\n
 +3:  subf%0,%1,%0\n
 +4: PPC405_ERR77(0,%2)
 +5: PPC_STL %0,0(%2)\n
 +6:  crclr   22\n
 +7:\n
 +.section __ex_table,\a\\n
 + PPC_LONG_ALIGN \n
 + PPC_LONG 2b,1b\n
 + PPC_LONG 3b,1b\n
 + PPC_LONG 4b,1b\n
 + PPC_LONG 

Re: [RFC PATCH 1/2]powerpc: foundation code to handle CR5 for local_t

2014-12-01 Thread Gabriel Paubert
On Thu, Nov 27, 2014 at 05:48:40PM +0530, Madhavan Srinivasan wrote:
 This patch create the infrastructure to handle the CR based 
 local_* atomic operations. Local atomic operations are fast 
 and highly reentrant per CPU counters.  Used for percpu 
 variable updates. Local atomic operations only guarantee 
 variable modification atomicity wrt the CPU which owns the
 data and these needs to be executed in a preemption safe way. 
 
 Here is the design of this patch. Since local_* operations 
 are only need to be atomic to interrupts (IIUC), patch uses 
 one of the Condition Register (CR) fields as a flag variable. When 
 entering the local_*, specific bit in the CR5 field is set
 and on exit, bit is cleared. CR bit checking is done in the
 interrupt return path. If CR5[EQ] bit set and if we return 
 to kernel, we reset to start of local_* operation.
 
 Reason for this approach is that, currently l[w/d]arx/st[w/d]cx.
 instruction pair is used for local_* operations, which are heavy 
 on cycle count and they dont support a local variant. So to 
 see whether the new implementation helps, used a modified 
 version of Rusty's benchmark code on local_t.   
 
 https://lkml.org/lkml/2008/12/16/450
 
 Modifications: 
  - increated the working set size from 1MB to 8MB,
  - removed cpu_local_inc test.
 
 Test ran 
 - on POWER8 1S Scale out System 2.0GHz
 - on OPAL v3 with v3.18-rc4 patch kernel as Host
 
 Here are the values with the patch.
 
 Time in ns per iteration
 
   inc add readadd_return
 atomic_long   67  67  18  69
 irqsave/rest  39  39  23  39
 trivalue  39  39  29  49
 local_t   26  26  24  26
 
 Since CR5 is used as a flag, have added CFLAGS to avoid CR5 
 for the kernel compilation and CR5 is zeroed at the kernel
 entry.  
 
 Tested the patch in a 
  - pSeries LPAR, 
  - Host with patched/unmodified guest kernel 
 
 To check whether userspace see any CR5 corruption, ran a simple
 test which does,
  - set CR5 field,
  - while(1)
- sleep or gettimeofday
- chk bit set
 
 Signed-off-by: Madhavan Srinivasan ma...@linux.vnet.ibm.com
 ---
 - I really appreciate feedback on the patchset.
 - Kindly comment if I should try with any other benchmark or
 workload to check the numbers.
 - Also, kindly recommand any know stress test for CR
 
  Makefile |   6 ++
  arch/powerpc/include/asm/exception-64s.h |  21 +-
  arch/powerpc/kernel/entry_64.S   | 106 
 ++-
  arch/powerpc/kernel/exceptions-64s.S |   2 +-
  arch/powerpc/kernel/head_64.S|   8 +++
  5 files changed, 138 insertions(+), 5 deletions(-)
 
 diff --git a/Makefile b/Makefile
 index 00d618b..2e271ad 100644
 --- a/Makefile
 +++ b/Makefile
 @@ -706,6 +706,12 @@ endif
  
  KBUILD_CFLAGS   += $(call cc-option, -fno-var-tracking-assignments)
  
 +ifdefCONFIG_PPC64
 +# We need this flag to force compiler not to use CR5, since
 +# local_t type code is based on this.
 +KBUILD_CFLAGS   += -ffixed-cr5
 +endif
 +
  ifdef CONFIG_DEBUG_INFO
  ifdef CONFIG_DEBUG_INFO_SPLIT
  KBUILD_CFLAGS   += $(call cc-option, -gsplit-dwarf, -g)
 diff --git a/arch/powerpc/include/asm/exception-64s.h 
 b/arch/powerpc/include/asm/exception-64s.h
 index 77f52b2..c42919a 100644
 --- a/arch/powerpc/include/asm/exception-64s.h
 +++ b/arch/powerpc/include/asm/exception-64s.h
 @@ -306,7 +306,26 @@ do_kvm_##n:  
 \
   std r10,0(r1);  /* make stack chain pointer */ \
   std r0,GPR0(r1);/* save r0 in stackframe*/ \
   std r10,GPR1(r1);   /* save r1 in stackframe*/ \
 - beq 4f; /* if from kernel mode  */ \
 +BEGIN_FTR_SECTION;  \
 + lis r9,4096;/* Create a mask with HV and PR */ \
 + rldicr  r9,r9,32,31;/* bits, AND with the MSR   */ \
 + mr  r10,r9; /* to check for Hyp state   */ \
 + ori r9,r9,16384;   \
 + and r9,r12,r9; \
 + cmpdcr3,r10,r9; 
\
 + beq cr3,66f;/* Jump if we come from Hyp mode*/ \
 + mtcrf   0x04,r10;   /* Clear CR5 if coming from usr */ \

I think you can do better than this, powerpc has a fantastic set
of rotate and mask instructions. If I understand correctly your
code you can replace it with the following:

rldicl  r10,r12,4,63   /* Extract HV bit to LSB of r10*/
rlwinm  r9,r12,19,0x02 /* Extract PR bit to 2nd to last bit of r9 */
or  r9,r9,10
cmplwi  cr3,r9,1   /* Check for HV=1 and PR=0 */
beq cr3,66f

Re: deb-pkg: Add support for powerpc little endian

2014-09-05 Thread Gabriel Paubert
On Fri, Sep 05, 2014 at 03:28:47PM +1000, Michael Neuling wrote:
 The Debian powerpc little endian architecture is called ppc64le.  This

Huh? ppc64le or ppc64el?

 is the default architecture used by Ubuntu for powerpc.
 
 The below checks the kernel config to see if we are compiling little
 endian and sets the Debian arch appropriately.
 
 Signed-off-by: Michael Neuling mi...@neuling.org
 
 diff --git a/scripts/package/builddeb b/scripts/package/builddeb
 index 35d5a58..6f4a1af 100644
 --- a/scripts/package/builddeb
 +++ b/scripts/package/builddeb
 @@ -37,7 +37,7 @@ create_package() {
   s390*)
   debarch=s390$(grep -q CONFIG_64BIT=y $KCONFIG_CONFIG  echo x 
 || true) ;;
   ppc*)
 - debarch=powerpc ;;
 + debarch=$(grep -q CPU_LITTLE_ENDIAN=y $KCONFIG_CONFIG  echo 
 ppc64el || echo powerpc) ;;
   parisc*)
   debarch=hppa ;;
   mips*)


Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: TTM placement caching issue/questions

2014-09-04 Thread Gabriel Paubert
On Wed, Sep 03, 2014 at 10:36:57PM -0400, Jerome Glisse wrote:
 On Wed, Sep 03, 2014 at 10:31:18PM -0400, Jerome Glisse wrote:
  On Thu, Sep 04, 2014 at 12:25:23PM +1000, Benjamin Herrenschmidt wrote:
   On Wed, 2014-09-03 at 22:07 -0400, Jerome Glisse wrote:
   
So in the meantime the attached patch should work, it just silently 
ignore
the caching attribute request on non x86 instead of pretending that 
things
are setup as expected and then latter the radeon ou nouveau hw unsetting
the snoop bit.

It's not tested but i think it should work.
   
   I'm still getting placements with !CACHED going from bo_memcpy in
   ttm_io_prot() though ... I'm looking at filtering the placement
   attributes instead.
   
   Ben.
  
  Ok so this one should do the trick.
 
 Ok final version ... famous last word.
[snipped older version]
 From 236038e18dc303bb9aa877922e01963d3fb0b7af Mon Sep 17 00:00:00 2001
 From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= jgli...@redhat.com
 Date: Wed, 3 Sep 2014 22:04:34 -0400
 Subject: [PATCH] drm/ttm: force cached mapping on non x86 platform.
 MIME-Version: 1.0
 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: 8bit
 
 People interested in providing uncached or write combined mapping
 on there architecture need to do the ground work inside there arch

s/there/their/g

 specific code to allow to break the linear kernel mapping so that
 page mapping attributes can be updated, in the meantime force cached
 mapping for non x86 architecture.
 
 Signed-off-by: Jérôme Glisse jgli...@redhat.com
 ---
  drivers/gpu/drm/radeon/radeon_ttm.c |  2 +-
  drivers/gpu/drm/ttm/ttm_bo.c|  2 +-
  drivers/gpu/drm/ttm/ttm_tt.c| 32 +---
  include/drm/ttm/ttm_bo_driver.h |  2 +-
  4 files changed, 24 insertions(+), 14 deletions(-)
 
 diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
 b/drivers/gpu/drm/radeon/radeon_ttm.c
 index 72afe82..4dd5060 100644
 --- a/drivers/gpu/drm/radeon/radeon_ttm.c
 +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
 @@ -304,7 +304,7 @@ static int radeon_move_vram_ram(struct ttm_buffer_object 
 *bo,
   return r;
   }
  
 - r = ttm_tt_set_placement_caching(bo-ttm, tmp_mem.placement);
 + r = ttm_tt_set_placement_caching(bo-ttm, tmp_mem.placement);
   if (unlikely(r)) {
   goto out_cleanup;
   }
 diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
 index 3da89d5..4dc21c3 100644
 --- a/drivers/gpu/drm/ttm/ttm_bo.c
 +++ b/drivers/gpu/drm/ttm/ttm_bo.c
 @@ -305,7 +305,7 @@ static int ttm_bo_handle_move_mem(struct 
 ttm_buffer_object *bo,
   goto out_err;
   }
  
 - ret = ttm_tt_set_placement_caching(bo-ttm, mem-placement);
 + ret = ttm_tt_set_placement_caching(bo-ttm, mem-placement);
   if (ret)
   goto out_err;
  
 diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
 index bf080ab..a0df803 100644
 --- a/drivers/gpu/drm/ttm/ttm_tt.c
 +++ b/drivers/gpu/drm/ttm/ttm_tt.c
 @@ -89,14 +89,6 @@ static inline int ttm_tt_set_page_caching(struct page *p,
  
   return ret;
  }
 -#else /* CONFIG_X86 */
 -static inline int ttm_tt_set_page_caching(struct page *p,
 -   enum ttm_caching_state c_old,
 -   enum ttm_caching_state c_new)
 -{
 - return 0;
 -}
 -#endif /* CONFIG_X86 */
  
  /*
   * Change caching policy for the linear kernel map
 @@ -149,19 +141,37 @@ out_err:
   return ret;
  }
  
 -int ttm_tt_set_placement_caching(struct ttm_tt *ttm, uint32_t placement)
 +int ttm_tt_set_placement_caching(struct ttm_tt *ttm, uint32_t *placement)
  {
   enum ttm_caching_state state;
  
 - if (placement  TTM_PL_FLAG_WC)
 + if (*placement  TTM_PL_FLAG_WC)
   state = tt_wc;
 - else if (placement  TTM_PL_FLAG_UNCACHED)
 + else if (*placement  TTM_PL_FLAG_UNCACHED)
   state = tt_uncached;
   else
   state = tt_cached;
  
   return ttm_tt_set_caching(ttm, state);
  }
 +#else /* CONFIG_X86 */
 +int ttm_tt_set_placement_caching(struct ttm_tt *ttm, uint32_t *placement)
 +{
 + if (*placement  (TTM_PL_TT | TTM_PL_FLAG_SYSTEM)) {
 + ttm-caching_state = tt_cached;
 + *placement = ~TTM_PL_MASK_CACHING;
 + *placement |= TTM_PL_FLAG_CACHED;
 + } else {
 + if (*placement  TTM_PL_FLAG_WC)
 + ttm-caching_state = tt_wc;
 + else if (placement  TTM_PL_FLAG_UNCACHED)
 + ttm-caching_state = tt_uncached;
 + else
 + ttm-caching_state = tt_cached;
 + }
 + return 0;
 +}
 +#endif /* CONFIG_X86 */
  EXPORT_SYMBOL(ttm_tt_set_placement_caching);
  
  void ttm_tt_destroy(struct ttm_tt *ttm)
 diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
 index 

Re: power and percpu: Could we move the paca into the percpu area?

2014-06-13 Thread Gabriel Paubert
On Thu, Jun 12, 2014 at 07:26:39AM -0500, Segher Boessenkool wrote:
  Actually, from gcc/config/rs6000.h:
  
  /* 1 for registers that have pervasive standard uses
 and are not available for the register allocator.
 
 [snip]
 
  So cr5, which is number 73, is never used by gcc. 
 
 Not available for RA is not the same thing at all as not used by GCC.
 For example, GPR1 and XER[CA] are also fixed regs.

Indeed, I should have been more clear, it is never explicitly reserved
by any ABI like GPR1 for the stack pointer nor used implicitly by any pattern
like the carry (which is also never allocated, but used or clobbered
by many patterns). However no machine description pattern uses cr5.

The line cr5 is not supposed to be used has always been a mystery
to me, but gcc has always obeyed this rule. If memory serves it has 
been in rs6000.h since I got my first PPC board in 1997.

 
 But, indeed, it does look like GCC doesn't use it.  It seems to me that
 some ABI forbade userland (or non-libraries or whatever) from using it.
 I'll see what I can find out.
 

Take what I say with a grain of salt, but I always had the impression that
it was a remnant from the first port of GCC to Power, which could well have 
been AIX. This first port might have been done by Richard Kenner.

Actually, a long time ago (1999-2000?) I toyed with he idea of using cr5 for 
soft masking of interrupts on PPC32. But I got distracted by other things
and never came around to it.

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: power and percpu: Could we move the paca into the percpu area?

2014-06-11 Thread Gabriel Paubert
On Thu, Jun 12, 2014 at 06:22:11AM +1000, Benjamin Herrenschmidt wrote:
 On Wed, 2014-06-11 at 14:37 -0500, Christoph Lameter wrote:
  Looking at arch/powerpc/include/asm/percpu.h I see that the per cpu offset
  comes from a local_paca field and local_paca is in r13. That means that
  for all percpu operations we first have to determine the address through a
  memory access.
  
  Would it be possible to put the paca at the beginning of the percpu data
  area and then have r31 point to the percpu area?
  
  power has these nice instructions that fetch from an offset relative to a
  base register which could be used throughout for percpu operations in the
  kernel (similar to x86 segment registers).
  
  With that we may also be able to use the atomic ops for fast percpu access
  so that we can avoid the irq enable/disable sequence that is now required
  for percpu atomics. Would result in fast and reliable percpu
  counters for powerpc.
 
 So this is complicated :) And it's something I did want to tackle
 for a while but haven't had a chance.
 
 The issues off the top of my head are:
 
  - The PACA must be accessible in real mode, which means that when
 running under a hypervisor, it must be allocated in the RMA which is
 the low part of memory up to a limit that depends on the hypervisor, but
 can be as low as 128M on some older machines.
 
  - However, we use percpu more than paca in normal kernel C code, the
 PACA is mostly used during exception entry/exit, KVM, and for interrupt
 soft-enable/disable. So it might make sense to change things so that r13
 contains the per-cpu offset instead. However, doing that change and
 updating the asm to cope isn't a trivial undertaking.
 
  - Direct offset from r13 in asm ... works as long as the offset is
 within the signed 32k range. Otherwise we need at least one more addis
 instruction. Anton mentioned the linker may have some smarts however for
 removing that addis if the high part of the offset happens to be 0.
 
  - For atomics, the jury is still out as to whether it would be faster
 or not. The atomic ops (lwarx/stwcx.) are expensive. They flush the
 value out of the L1 (to L2) among others. On the other hand we have
 interrupts soft-disable so masking interrupts isn't very expensive.
 Unmasking, while cheap, is currently out of line however. I have been
 wondering if we could move some of the soft-irq state instead to a CR
 field and mark that -ffixed with gcc so we can make irq
 soft-disable/enable even faster and more in-line.

Actually, from gcc/config/rs6000.h:

/* 1 for registers that have pervasive standard uses
   and are not available for the register allocator.

   On RS/6000, r1 is used for the stack.  On Darwin, r2 is available
   as a local register; for all other OS's r2 is the TOC pointer.

   cr5 is not supposed to be used.

   On System V implementations, r13 is fixed and not available for use.  */

#define FIXED_REGISTERS  \
  {0, 1, FIXED_R2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, FIXED_R13, 0, 0, \
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
   0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,  \
   /* AltiVec registers.  */   \
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
   1, 1\
   , 1, 1, 1, 1, 1, 1  \
}

So cr5, which is number 73, is never used by gcc. 
Disassembling a few kernels seems to confirm this.
This gives you 4 booleans to play with...

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang

2014-05-09 Thread Gabriel Paubert
On Fri, May 09, 2014 at 06:41:13AM -0700, Paul E. McKenney wrote:
 On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote:
  I am seeing an issue where a CPU running perf eventually hangs.
  Traces show timer interrupts happening every 4 seconds even
  when a userspace task is running on the CPU.
 
 Is this by chance every 4.2 seconds?  The reason I ask is that
 Paul Clarke and I are seeing an interrupt every 4.2 seconds when
 he runs NO_HZ_FULL, and are trying to get rid of it.  ;-)

Hmmm, it's close to 2^32 nanoseconds, isnt't it suspiscious?

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 3/3] perf: Use 64-bit value when comparing sample_regs

2014-03-06 Thread Gabriel Paubert
On Thu, Mar 06, 2014 at 09:44:47AM +, David Laight wrote:
 From: Sukadev Bhattiprolu
  When checking whether a bit representing a register is set in
  sample_regs, a 64-bit mask, use 64-bit value (1LL).
  
  Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
  ---
   tools/perf/util/unwind.c |4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)
  
  diff --git a/tools/perf/util/unwind.c b/tools/perf/util/unwind.c
  index 742f23b..2b888c6 100644
  --- a/tools/perf/util/unwind.c
  +++ b/tools/perf/util/unwind.c
  @@ -396,11 +396,11 @@ static int reg_value(unw_word_t *valp, struct 
  regs_dump *regs, int id,
   {
  int i, idx = 0;
  
  -   if (!(sample_regs  (1  id)))
  +   if (!(sample_regs  (1LL  id)))
  return -EINVAL;
  
  for (i = 0; i  id; i++) {
  -   if (sample_regs  (1  i))
  +   if (sample_regs  (1LL  i))
  idx++;
  }
 
 There are much faster ways to count the number of set bits, especially
 if you might need to check a significant number of bits.
 There might even be a function defined somewhere to do it.

Indeed, look for Hamming weight (hweight family of functions)
in asm/hweight.h and what is included from there.

Besides that, many modern processors also have a machine instruction
to perform this task. In the processor manuals the instruction is 
described as population count and the mnemonic starts with popcnt
on x86 and ppc.

Gabriel

 Basically you just add up the bits, for 16 bit it would be:
   val = (val  0x) + (val  1)  0x;
   val = (val  0x) + (val  2)  0x;
   val = (val  0x0f0f) + (val  4)  0x0f0f;
   val = (val  0x00ff) + (val  8)  0x00ff;
 As the size of the work increases the improvement is more significant.
 (Some of the later masking can probably be proven unnecessary.)
 
   David
 
 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?

2014-02-10 Thread Gabriel Paubert
On Fri, Feb 07, 2014 at 02:49:40PM -0600, James Yang wrote:
 On Fri, 7 Feb 2014, Gabriel Paubert wrote:
 
  Hi Stephen,
  
  On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
   Gabriel Paubert paub...@iram.es wrote on 02/06/2014 07:26:37 PM:
   
From: Gabriel Paubert paub...@iram.es
To: Stephen N Chivers schiv...@csc.com.au
Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor cproc...@csc.com.au
Date: 02/06/2014 07:26 PM
Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?

On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
 
 
 mask = 0;
 if (FM  (1  0))
 mask |= 0x000f;
 if (FM  (1  1))
 mask |= 0x00f0;
 if (FM  (1  2))
 mask |= 0x0f00;
 if (FM  (1  3))
 mask |= 0xf000;
 if (FM  (1  4))
 mask |= 0x000f;
 if (FM  (1  5))
 mask |= 0x00f0;
 if (FM  (1  6))
 mask |= 0x0f00;
 if (FM  (1  7))
 mask |= 0x9000;
 
 With the above mask computation I get consistent results for 
 both the MPC8548 and MPC7410 boards.
 
 Am I missing something subtle?

No I think you are correct. This said, this code may probably be 
   optimized 
to eliminate a lot of the conditional branches. I think that:
 
 
 If the compiler is enabled to generate isel instructions, it would not 
 use a conditional branch for this code. (ignore the andi's values, 
 this is an old compile)
 
 c0037c2c mtfsf:
 c0037c2c:   2c 03 00 00 cmpwi   r3,0
 c0037c30:   41 82 01 1c beq-c0037d4c mtfsf+0x120
 c0037c34:   2f 83 00 ff cmpwi   cr7,r3,255
 c0037c38:   41 9e 01 28 beq-cr7,c0037d60 mtfsf+0x134
 c0037c3c:   70 66 00 01 andi.   r6,r3,1
 c0037c40:   3d 00 90 00 lis r8,-28672
 c0037c44:   7d 20 40 9e iseleq  r9,r0,r8
 c0037c48:   70 6a 00 02 andi.   r10,r3,2
 c0037c4c:   65 28 0f 00 orisr8,r9,3840
 c0037c50:   7d 29 40 9e iseleq  r9,r9,r8
 c0037c54:   70 66 00 04 andi.   r6,r3,4
 c0037c58:   65 28 00 f0 orisr8,r9,240
 c0037c5c:   7d 29 40 9e iseleq  r9,r9,r8
 c0037c60:   70 6a 00 08 andi.   r10,r3,8
 c0037c64:   65 28 00 0f orisr8,r9,15
 c0037c68:   7d 29 40 9e iseleq  r9,r9,r8
 c0037c6c:   70 66 00 10 andi.   r6,r3,16
 c0037c70:   61 28 f0 00 ori r8,r9,61440
 c0037c74:   7d 29 40 9e iseleq  r9,r9,r8
 c0037c78:   70 6a 00 20 andi.   r10,r3,32
 c0037c7c:   61 28 0f 00 ori r8,r9,3840
 c0037c80:   54 6a cf fe rlwinm  r10,r3,25,31,31
 c0037c84:   7d 29 40 9e iseleq  r9,r9,r8
 c0037c88:   2f 8a 00 00 cmpwi   cr7,r10,0
 c0037c8c:   70 66 00 40 andi.   r6,r3,64
 c0037c90:   61 28 00 f0 ori r8,r9,240
 c0037c94:   7d 29 40 9e iseleq  r9,r9,r8
 c0037c98:   41 9e 00 08 beq-cr7,c0037ca0 mtfsf+0x74
 c0037c9c:   61 29 00 0f ori r9,r9,15
   ...
  
 However, your other solutions are better.
 
 

mask = (FM  1);
mask |= (FM  3)  0x10;
mask |= (FM  6)  0x100;
mask |= (FM  9)  0x1000;
mask |= (FM  12)  0x1;
mask |= (FM  15)  0x10;
mask |= (FM  18)  0x100;
mask |= (FM  21)  0x1000;
mask *= 15;

should do the job, in less code space and without a single branch.

Each one of the mask |= lines should be translated into an
rlwinm instruction followed by an or. Actually it should be possible
to transform each of these lines into a single rlwimi instruction
but I don't know how to coerce gcc to reach this level of optimization.

Another way of optomizing this could be:

mask = (FM  0x0f) | ((FM  12)  0x000f);
mask = (mask  0x00030003) | ((mask  6)  0x03030303);
mask = (mask  0x01010101) | ((mask  3)  0x10101010);
mask *= 15;

 
 
  Ok, I finally edited my sources and test compiled the suggestions
  I gave. I'd say that method2 is the best overall indeed. You can
  actually save one more instruction by setting mask to all ones in 
  the case FM=0xff, but that's about all in this area.
 
 
 My measurements show method1 to be smaller and faster than method2 due 
 to the number of instructions needed to generate the constant masks in 
 method2, but it may depend upon your compiler and hardware.  Both are 
 faster than the original with isel.

Ok, if you have measured that method1 is faster than method2, let us go for it.
I believe method2 would be faster if you had a large out-of-order execution
window, because more parallelism can be extracted from

Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?

2014-02-10 Thread Gabriel Paubert
On Mon, Feb 10, 2014 at 11:17:38AM +, David Laight wrote:
   However, your other solutions are better.
  
  
 
  mask = (FM  1);
  mask |= (FM  3)  0x10;
  mask |= (FM  6)  0x100;
  mask |= (FM  9)  0x1000;
  mask |= (FM  12)  0x1;
  mask |= (FM  15)  0x10;
  mask |= (FM  18)  0x100;
  mask |= (FM  21)  0x1000;
  mask *= 15;
 
  should do the job, in less code space and without a single branch.
 ...
  Another way of optomizing this could be:
 
  mask = (FM  0x0f) | ((FM  12)  0x000f);
  mask = (mask  0x00030003) | ((mask  6)  0x03030303);
  mask = (mask  0x01010101) | ((mask  3)  0x10101010);
  mask *= 15;
 ...
  Ok, if you have measured that method1 is faster than method2, let us go for 
  it.
  I believe method2 would be faster if you had a large out-of-order execution
  window, because more parallelism can be extracted from it, but this is 
  probably
  only true for high end cores, which do not need FPU emulation in the first 
  place.
 
 FWIW the second has a long dependency chain on 'mask', whereas the first can 
 execute
 the shift/and in any order and then merge the results.
 So on most superscalar cpu, or one with result delays for arithmetic, the 
 first
 is likely to be faster.

I disagree, perhaps mostly because the compiler is not clever enough, but right
now the code for solution 1 is (actually I have rewritten the code
and it reads:

mask = (FM  1)
| ((FM  3)  0x10)
| ((FM  6)  0x100)
| ((FM  9)  0x1000)
| ((FM  12)  0x1)
| ((FM  15)  0x10)
| ((FM  18)  0x100)
| ((FM  21)  0x1000);
to avoid sequence point in case it hampers the compiler)

and the output is:

rlwinm 10,3,3,27,27  # D.11621, FM,,
rlwinm 9,3,6,23,23   # D.11621, FM,,
or 9,10,9#, D.11621, D.11621, D.11621
rlwinm 10,3,0,31,31  # D.11621, FM,
or 9,9,10#, D.11621, D.11621, D.11621
rlwinm 10,3,9,19,19  # D.11621, FM,,
or 9,9,10#, D.11621, D.11621, D.11621
rlwinm 10,3,12,15,15 # D.11621, FM,,
or 9,9,10#, D.11621, D.11621, D.11621
rlwinm 10,3,15,11,11 # D.11621, FM,,
or 9,9,10#, D.11621, D.11621, D.11621
rlwinm 10,3,18,7,7   # D.11621, FM,,
or 9,9,10#, D.11621, D.11621, D.11621
rlwinm 3,3,21,3,3# D.11621, FM,,
or 9,9,3 #, mask, D.11621, D.11621
mulli 9,9,15 # mask, mask,

see that r9 is used 7 times as both input and output operand, plus
once for rlwinm. This gives a dependency length of 8 at least.

In the other case (I've deleted the code) the dependency length
was significantly shorter. In any case that one is fewer instructions, 
which is good for occasional use. 

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?

2014-02-10 Thread Gabriel Paubert
On Mon, Feb 10, 2014 at 12:32:18PM +, David Laight wrote:
  I disagree, perhaps mostly because the compiler is not clever enough, but 
  right
  now the code for solution 1 is (actually I have rewritten the code
  and it reads:
  
  mask = (FM  1)
  | ((FM  3)  0x10)
  | ((FM  6)  0x100)
  | ((FM  9)  0x1000)
  | ((FM  12)  0x1)
  | ((FM  15)  0x10)
  | ((FM  18)  0x100)
  | ((FM  21)  0x1000);
  to avoid sequence point in case it hampers the compiler)
  
  and the output is:
  
  rlwinm 10,3,3,27,27  # D.11621, FM,,
  rlwinm 9,3,6,23,23   # D.11621, FM,,
  or 9,10,9#, D.11621, D.11621, D.11621
  rlwinm 10,3,0,31,31  # D.11621, FM,
  or 9,9,10#, D.11621, D.11621, D.11621
  rlwinm 10,3,9,19,19  # D.11621, FM,,
  or 9,9,10#, D.11621, D.11621, D.11621
  rlwinm 10,3,12,15,15 # D.11621, FM,,
  or 9,9,10#, D.11621, D.11621, D.11621
  rlwinm 10,3,15,11,11 # D.11621, FM,,
  or 9,9,10#, D.11621, D.11621, D.11621
  rlwinm 10,3,18,7,7   # D.11621, FM,,
  or 9,9,10#, D.11621, D.11621, D.11621
  rlwinm 3,3,21,3,3# D.11621, FM,,
  or 9,9,3 #, mask, D.11621, D.11621
  mulli 9,9,15 # mask, mask,
  
  see that r9 is used 7 times as both input and output operand, plus
  once for rlwinm. This gives a dependency length of 8 at least.
  
  In the other case (I've deleted the code) the dependency length
  was significantly shorter. In any case that one is fewer instructions,
  which is good for occasional use.
 
 Hmmm... I hand-counted a dependency length of 8 for the other version.
 Maybe there are some ppc instructions that reduce it.

Either I misread the generated code or I got somewhat less.

What helps for method1 is the rotate and mask instructions of PPC. Each of
left shift and mask becomes a single rlwinm. 
 
 Stupid compiler :-)

Indeed. I've trying to coerce it into generating rlwimi instructions
(in which case the whole building of the mask reduces to 8 assembly
instructions) and failed. It seems that the compiler lacks some patterns
some patterns that would directly map to rlwimi.

 Trouble is, I bet that even if you code it as:
   mask1 = (FM  1) | ((FM  3)  0x10);
   mask2 = ((FM  6)  0x100) | ((FM  9)  0x1000);
   mask3 = ((FM  12)  0x1) | ((FM  15)  0x10);
   mask4 = ((FM  18)  0x100) | ((FM  21)  0x1000);
   mask1 |= mask2;
   mask3 |= mask4;
   mask = mask1 | mask3;
 the compiler will 'optimise' it to the above before code generation.

Indeed it's what it does :-(

I believe that the current suggestion is good enough.

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?

2014-02-10 Thread Gabriel Paubert
Hi James,

On Mon, Feb 10, 2014 at 11:03:07AM -0600, James Yang wrote:

[snipped]
  Ok, if you have measured that method1 is faster than method2, let us go for 
  it.
  I believe method2 would be faster if you had a large out-of-order execution
  window, because more parallelism can be extracted from it, but this is 
  probably
  only true for high end cores, which do not need FPU emulation in the first 
  place.
 
 Yeah, 8548 can issue 2 SFX instructions per cycle which is what the 
 compiler generated.   
 

Then it is method1.

  
  The other place where we can optimize is the generation of FEX. Here is 
  my current patch:
  
  
  diff --git a/arch/powerpc/math-emu/mtfsf.c b/arch/powerpc/math-emu/mtfsf.c
  index dbce92e..b57b3fa8 100644
  --- a/arch/powerpc/math-emu/mtfsf.c
  +++ b/arch/powerpc/math-emu/mtfsf.c
  @@ -11,48 +11,35 @@ mtfsf(unsigned int FM, u32 *frB)
  u32 mask;
  u32 fpscr;
   
  -   if (FM == 0)
  +   if (likely(FM == 0xff))
  +   mask = 0x;
  +   else if (unlikely(FM == 0))
  return 0;
  -
  -   if (FM == 0xff)
  -   mask = 0x9fff;
  else {
  -   mask = 0;
  -   if (FM  (1  0))
  -   mask |= 0x9000;
  -   if (FM  (1  1))
  -   mask |= 0x0f00;
  -   if (FM  (1  2))
  -   mask |= 0x00f0;
  -   if (FM  (1  3))
  -   mask |= 0x000f;
  -   if (FM  (1  4))
  -   mask |= 0xf000;
  -   if (FM  (1  5))
  -   mask |= 0x0f00;
  -   if (FM  (1  6))
  -   mask |= 0x00f0;
  -   if (FM  (1  7))
  -   mask |= 0x000f;
  +   mask = (FM  1);
  +   mask |= (FM  3)  0x10;
  +   mask |= (FM  6)  0x100;
  +   mask |= (FM  9)  0x1000;
  +   mask |= (FM  12)  0x1;
  +   mask |= (FM  15)  0x10;
  +   mask |= (FM  18)  0x100;
  +   mask |= (FM  21)  0x1000;
  +   mask *= 15;
 
 
 Needs to also mask out bits 1 and 2, they aren't to be set from frB.
 
   mask = 0x9FFF;
 
 

Look at the following lines:

 
 
  }
   
  -   __FPU_FPSCR = ~(mask);
  -   __FPU_FPSCR |= (frB[1]  mask);
  +   fpscr = ((__FPU_FPSCR  ~mask) | (frB[1]  mask)) 
  +   ~(FPSCR_VX | FPSCR_FEX);
 
It's here (masking FPSCR_VX and FPSCR_FEX).

Actually the previous code was redundant, it cleared FEX and VX in the
mask computation and later again when recomputing them. Clearing them
once should be enough.

   
  -   __FPU_FPSCR = ~(FPSCR_VX);
  -   if (__FPU_FPSCR  (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
  +   if (fpscr  (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
   FPSCR_VXZDZ | FPSCR_VXIMZ | FPSCR_VXVC |
   FPSCR_VXSOFT | FPSCR_VXSQRT | FPSCR_VXCVI))
  -   __FPU_FPSCR |= FPSCR_VX;
  -
  -   fpscr = __FPU_FPSCR;
  -   fpscr = ~(FPSCR_FEX);
  -   if (((fpscr  FPSCR_VX)  (fpscr  FPSCR_VE)) ||
  -   ((fpscr  FPSCR_OX)  (fpscr  FPSCR_OE)) ||
  -   ((fpscr  FPSCR_UX)  (fpscr  FPSCR_UE)) ||
  -   ((fpscr  FPSCR_ZX)  (fpscr  FPSCR_ZE)) ||
  -   ((fpscr  FPSCR_XX)  (fpscr  FPSCR_XE)))
  -   fpscr |= FPSCR_FEX;
  +   fpscr |= FPSCR_VX;
  +
  +   /* The bit order of exception enables and exception status
  +* is the same. Simply shift and mask to check for enabled
  +* exceptions.
  +*/
  +   if (fpscr  (fpscr  22)   0xf8) fpscr |= FPSCR_FEX;
  __FPU_FPSCR = fpscr;
   
   #ifdef DEBUG
   mtfsf.c |   57 ++---
   1 file changed, 22 insertions(+), 35 deletions(-)
  
  
  Notes: 
  
  1) I'm really unsure on whether 0xff is frequent or not. So the likely()
  statement at the beginning may be wrong. Actually, if it is not very likely,
  it might be better to remove the special casef for FM==0xff. A look at 
  GCC sources shows that it never generates a mask of 0xff. From glibc
  sources, there vast majority of cases uses 0x1, only isnan() uses 0xff.
 
 Can't handle all cases here.  

That's why I would go for the simplest possible code. Conditionals are
expensive and minimizing cache footprint is often the best measure of 
performance for infrequently used code. With this in mind, I would get 
rid of all the tests for special FM values and rely on the optimized
generic case.


  
  2) it may be better to remove the check for FM==0, after all, the 
  instruction
  effectively becomes a nop, and generating the instruction in the first place
  would be too stupid for words.
 
 Hmm a heavy no-op.  I wonder if it is heavier than a sync.

In theory not. It contains the equivalent of several isync (taking an exception
and returning from it), but not any synchronization wrt the memory accesses.

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org

Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?

2014-02-07 Thread Gabriel Paubert
Hi Stephen,

On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
 Gabriel Paubert paub...@iram.es wrote on 02/06/2014 07:26:37 PM:
 
  From: Gabriel Paubert paub...@iram.es
  To: Stephen N Chivers schiv...@csc.com.au
  Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor cproc...@csc.com.au
  Date: 02/06/2014 07:26 PM
  Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
  
  On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
   I have a MPC8548e based board and an application that makes
   extensive use of floating point including numerous calls to cos.
   In the same program there is the use of an sqlite database.
   
   The kernel is derived from 2.6.31 and is compiled with math emulation.
   
   At some point after the reading of the SQLITE database, the
   return result from cos goes from an in range value to an out
   of range value.
   
   This is as a result of the FP rounding mode mutating from round to 
   nearest
   to round toward zero.
   
   The cos in the glibc version being used is known to be sensitive to 
   rounding
   direction and Joseph Myers has previously fixed glibc.
   
   The failure does not occur on a machine that has a hardware floating
   point unit (a MPC7410 processor).
   
   I have traced the mutation to the following series of instructions:
   
   mffsf0
   mtfsb1  4*cr7+so
   mtfsb0  4*cr7+eq
   faddf13,f1,f2
   mtfsf   1, f0
   
   The instructions are part of the stream emitted by gcc for the 
 conversion
   of a 128 bit floating point value into an integer in the sqlite 
 database 
   read.
   
   Immediately before the execution of the mffs instruction the rounding
   mode is round to nearest.
   
   On the MPC8548 board, the execution of the mtfsf instruction does not
   restore the rounding mode to round to nearest.
   
   I believe that the mask computation in mtfsf.c is incorrect and is 
   reversed.
   
   In the latest version of the file (linux-3.14-rc1), the mask is 
 computed 
   by:
   
mask = 0;
if (FM  (1  0))
   mask |= 0x9000;
if (FM  (1  1))
   mask |= 0x0f00;
if (FM  (1  2))
   mask |= 0x00f0;
if (FM  (1  3))
   mask |= 0x000f;
if (FM  (1  4))
   mask |= 0xf000;
if (FM  (1  5))
   mask |= 0x0f00;
if (FM  (1  6))
   mask |= 0x00f0;
if (FM  (1  7))
   mask |= 0x000f;
   
   I think it should be:
   
   mask = 0;
   if (FM  (1  0))
   mask |= 0x000f;
   if (FM  (1  1))
   mask |= 0x00f0;
   if (FM  (1  2))
   mask |= 0x0f00;
   if (FM  (1  3))
   mask |= 0xf000;
   if (FM  (1  4))
   mask |= 0x000f;
   if (FM  (1  5))
   mask |= 0x00f0;
   if (FM  (1  6))
   mask |= 0x0f00;
   if (FM  (1  7))
   mask |= 0x9000;
   
   With the above mask computation I get consistent results for both the 
   MPC8548
   and MPC7410 boards.
   
   Am I missing something subtle?
  
  No I think you are correct. This said, this code may probably be 
 optimized 
  to eliminate a lot of the conditional branches. I think that:
  
  mask = (FM  1);
  mask |= (FM  3)  0x10;
  mask |= (FM  6)  0x100;
  mask |= (FM  9)  0x1000;
  mask |= (FM  12)  0x1;
  mask |= (FM  15)  0x10;
  mask |= (FM  18)  0x100;
  mask |= (FM  21)  0x1000;
  mask *= 15;
  
  should do the job, in less code space and without a single branch.
  
  Each one of the mask |= lines should be translated into an
  rlwinm instruction followed by an or. Actually it should be possible
  to transform each of these lines into a single rlwimi instruction
  but I don't know how to coerce gcc to reach this level of optimization.
  
  Another way of optomizing this could be:
  
  mask = (FM  0x0f) | ((FM  12)  0x000f);
  mask = (mask  0x00030003) | ((mask  6)  0x03030303);
  mask = (mask  0x01010101) | ((mask  3)  0x10101010);
  mask *= 15;
  
  It's not easy to see which of the solutions is faster, the second one
  needs to create quite a few constants, but its dependency length is 
  lower. It is very likely that the first solution is faster in cache-cold
  case and the second in cache-hot. 
  
  Regardless, the original code is rather naïve, larger and slower in all 
 cases,
  with timing variation depending

Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?

2014-02-06 Thread Gabriel Paubert
On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
 I have a MPC8548e based board and an application that makes
 extensive use of floating point including numerous calls to cos.
 In the same program there is the use of an sqlite database.
 
 The kernel is derived from 2.6.31 and is compiled with math emulation.
 
 At some point after the reading of the SQLITE database, the
 return result from cos goes from an in range value to an out
 of range value.
 
 This is as a result of the FP rounding mode mutating from round to 
 nearest
 to round toward zero.
 
 The cos in the glibc version being used is known to be sensitive to 
 rounding
 direction and Joseph Myers has previously fixed glibc.
 
 The failure does not occur on a machine that has a hardware floating
 point unit (a MPC7410 processor).
 
 I have traced the mutation to the following series of instructions:
 
 mffsf0
 mtfsb1  4*cr7+so
 mtfsb0  4*cr7+eq
 faddf13,f1,f2
 mtfsf   1, f0
 
 The instructions are part of the stream emitted by gcc for the conversion
 of a 128 bit floating point value into an integer in the sqlite database 
 read.
 
 Immediately before the execution of the mffs instruction the rounding
 mode is round to nearest.
 
 On the MPC8548 board, the execution of the mtfsf instruction does not
 restore the rounding mode to round to nearest.
 
 I believe that the mask computation in mtfsf.c is incorrect and is 
 reversed.
 
 In the latest version of the file (linux-3.14-rc1), the mask is computed 
 by:
 
  mask = 0;
  if (FM  (1  0))
 mask |= 0x9000;
  if (FM  (1  1))
 mask |= 0x0f00;
  if (FM  (1  2))
 mask |= 0x00f0;
  if (FM  (1  3))
 mask |= 0x000f;
  if (FM  (1  4))
 mask |= 0xf000;
  if (FM  (1  5))
 mask |= 0x0f00;
  if (FM  (1  6))
 mask |= 0x00f0;
  if (FM  (1  7))
 mask |= 0x000f;
 
 I think it should be:
 
 mask = 0;
 if (FM  (1  0))
 mask |= 0x000f;
 if (FM  (1  1))
 mask |= 0x00f0;
 if (FM  (1  2))
 mask |= 0x0f00;
 if (FM  (1  3))
 mask |= 0xf000;
 if (FM  (1  4))
 mask |= 0x000f;
 if (FM  (1  5))
 mask |= 0x00f0;
 if (FM  (1  6))
 mask |= 0x0f00;
 if (FM  (1  7))
 mask |= 0x9000;
 
 With the above mask computation I get consistent results for both the 
 MPC8548
 and MPC7410 boards.
 
 Am I missing something subtle?

No I think you are correct. This said, this code may probably be optimized 
to eliminate a lot of the conditional branches. I think that:

mask = (FM  1);
mask |= (FM  3)  0x10;
mask |= (FM  6)  0x100;
mask |= (FM  9)  0x1000;
mask |= (FM  12)  0x1;
mask |= (FM  15)  0x10;
mask |= (FM  18)  0x100;
mask |= (FM  21)  0x1000;
mask *= 15;

should do the job, in less code space and without a single branch.

Each one of the mask |= lines should be translated into an
rlwinm instruction followed by an or. Actually it should be possible
to transform each of these lines into a single rlwimi instruction
but I don't know how to coerce gcc to reach this level of optimization.

Another way of optomizing this could be:

mask = (FM  0x0f) | ((FM  12)  0x000f);
mask = (mask  0x00030003) | ((mask  6)  0x03030303);
mask = (mask  0x01010101) | ((mask  3)  0x10101010);
mask *= 15;

It's not easy to see which of the solutions is faster, the second one
needs to create quite a few constants, but its dependency length is 
lower. It is very likely that the first solution is faster in cache-cold
case and the second in cache-hot. 

Regardless, the original code is rather naïve, larger and slower in all cases,
with timing variation depending on branch mispredictions.

Regards,
Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH] powerpc: add ioremap_wt

2014-02-03 Thread Gabriel Paubert
On Mon, Feb 03, 2014 at 08:16:49AM +0100, Michael Moese wrote:
 Allow for IO memory to be mapped cacheable for performing
 PCI read bursts.
 
 Signed-off-by: Michael Moese michael.mo...@men.de
 ---
  arch/powerpc/include/asm/io.h | 3 +++
  arch/powerpc/mm/pgtable_32.c  | 8 
  2 files changed, 11 insertions(+)
 
 diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
 index 45698d5..9591fff 100644
 --- a/arch/powerpc/include/asm/io.h
 +++ b/arch/powerpc/include/asm/io.h
 @@ -631,6 +631,8 @@ static inline void iosync(void)
   *
   * * ioremap_wc enables write combining
   *
 + * * ioremap_wc enables write thru

Typo: _wc - _wt

Looks fine in principle, but there is a significant difference with wc
on x86, where read accesses always go to the bus (no read caching).

Gabriel

 + *
   * * iounmap undoes such a mapping and can be hooked
   *
   * * __ioremap_at (and the pending __iounmap_at) are low level functions to
 @@ -652,6 +654,7 @@ extern void __iomem *ioremap(phys_addr_t address, 
 unsigned long size);
  extern void __iomem *ioremap_prot(phys_addr_t address, unsigned long size,
 unsigned long flags);
  extern void __iomem *ioremap_wc(phys_addr_t address, unsigned long size);
 +extern void __iomem *ioremap_wt(phys_addr_t address, unsigned long size);
  #define ioremap_nocache(addr, size)  ioremap((addr), (size))
  
  extern void iounmap(volatile void __iomem *addr);
 diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
 index 51f8795..9ab0a54 100644
 --- a/arch/powerpc/mm/pgtable_32.c
 +++ b/arch/powerpc/mm/pgtable_32.c
 @@ -141,6 +141,14 @@ ioremap_wc(phys_addr_t addr, unsigned long size)
  EXPORT_SYMBOL(ioremap_wc);
  
  void __iomem *
 +ioremap_wt(phys_addr_t addr, unsigned long size)
 +{
 + return __ioremap_caller(addr, size, _PAGE_WRITETHRU,
 + __builtin_return_address(0));
 +}
 +EXPORT_SYMBOL(ioremap_wt);
 +
 +void __iomem *
  ioremap_prot(phys_addr_t addr, unsigned long size, unsigned long flags)
  {
   /* writeable implies dirty for kernel addresses */
 -- 
 1.8.5.3
 
 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: PCIe Access - achieve bursts without DMA

2014-01-31 Thread Gabriel Paubert
On Thu, Jan 30, 2014 at 12:20:21PM +, Moese, Michael wrote:
 Hello PPC-developers,
 I'm currently trying to benchmark access speeds to our PCIe-connected IP-cores
 located inside our FPGA. On x86-based systems I was able to achieve bursts for
 both read and write access. On PPC32, using an e500v2, I had no success at 
 all 
 so far. 
 I tried using ioremap_wc(), like I did on x86, for writing, and it only 
 results in my
 writes just being single requests, one after another.

I believe that on PPC, write-combine is directly mapped to nocache. I can't 
remember
if there is a writethrough option for ioremap (but adding it would probably be
relaively easy).

 For reads, I noticed I could not ioremap_cache() on PPC, so I used simple 
 ioremap()
 here. 

You might be able to use ioremap_cache and using direct cache control 
instruction
(dcbf/dcbi) to achieve your goals. This becomes similar to handling machines 
with 
no hardware cache coherency. You have to know the hardware cache line size to 
make
this work.

This said, it might be better to mark the memory as guarded and non-coherent 
(WIMG=), I don't know what ioremap_cache does for the MG bits and don't
have the time to look it up right now.

 I used several ways to read from the device, from simple 
 readl(),memcpy_from_io(), 
 memcpy()  to cacheable_memcpy() - with no improvements.  Even when just 
 issuing
 a batch of prefetch()-calls for all the memory to read did not result in read 
 bursts.

If the device data you want to read is supposed to be cacheable (which means 
basically
that the data does not change unexpectedly under you, i.e., is not as volatile 
as
a typical device I/O register), you don't want to use readl() which adds some
synchronization to the read.

Prefetch only works on writeback memory, maybe writethrough, expecting it to 
work on
cache-inhibited memory is contradictory.

Regards,
Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: provide __bswapdi2

2013-05-13 Thread Gabriel Paubert
On Mon, May 13, 2013 at 05:09:59PM +1000, Michael Neuling wrote:
 David Woodhouse dw...@infradead.org wrote:
 
  From: David Woodhouse david.woodho...@intel.com
  
  Some versions of GCC apparently expect this to be provided by libgcc.
  
  Signed-off-by: David Woodhouse david.woodho...@intel.com
  ---
  Untested.
  
  diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
  index 19e096b..f077dc2 100644
  --- a/arch/powerpc/kernel/misc_32.S
  +++ b/arch/powerpc/kernel/misc_32.S
  @@ -657,6 +657,17 @@ _GLOBAL(__ucmpdi2)
  li  r3,2
  blr
   
  +_GLOBAL(__bswapdi2)
  +   rlwinm  10,4,8,0x
  +   rlwinm  11,3,8,0x
  +   rlwimi  10,4,24,0,7
  +   rlwimi  11,3,24,0,7
  +   rlwimi  10,4,24,16,23
  +   rlwimi  11,3,24,16,23
  +   mr  4,11
  +   mr  3,10
  +   blr
  +
 
 This doesn't work for me but the below does:
 
 _GLOBAL(__bswapdi2)
   rotlwi  r9,r4,8
   rotlwi  r10,r3,8
   rlwimi  r9,r4,24,0,7
   rlwimi  r10,r3,24,0,7
   rlwimi  r9,r4,24,16,23
   rlwimi  r10,r3,24,16,23
   mr  r4,r10
   mr  r3,r9
   blr
 

Actually, I'd swap the two mr instructions to never
have an instruction that uses the result from the
previous one. 


 stolen from GCC -02 output of:
   unsigned long long __bswapdi2(unsigned long long x)
   {
return ((x  0x00ffULL)  56) |
   ((x  0xff00ULL)  40) |
   ((x  0x00ffULL)  24) |
   ((x  0xff00ULL)   8) |
   ((x  0x00ffULL)   8) |
   ((x  0xff00ULL)  24) |
   ((x  0x00ffULL)  40) |
   ((x  0xff00ULL)  56);
   }
 
   _GLOBAL(abs)
  srawi   r4,r3,31
  xor r3,r3,r4
  diff --git a/arch/powerpc/kernel/misc_64.S b/arch/powerpc/kernel/misc_64.S
  index 5cfa800..3b2e6e8 100644
  --- a/arch/powerpc/kernel/misc_64.S
  +++ b/arch/powerpc/kernel/misc_64.S
  @@ -234,6 +234,18 @@ _GLOBAL(__flush_dcache_icache)
  isync
  blr
   
  +_GLOBAL(__bswapdi2)
  +   srdi8,3,32
  +   rlwinm  7,3,8,0x
  +   rlwimi  7,3,24,0,7
  +   rlwinm  9,8,8,0x
  +   rlwimi  7,3,24,16,23
  +   rlwimi  9,8,24,0,7
  +   rlwimi  9,8,24,16,23
  +   sldi7,7,32
  +   or  7,7,9
  +   mr  3,7
  +   blr
 
 This works but we should add r to the register names.
 

And merge the last two instructions as a single or r3,r7,r9.


Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: provide __bswapdi2

2013-05-13 Thread Gabriel Paubert
On Mon, May 13, 2013 at 11:38:13AM +0100, David Woodhouse wrote:
 On Mon, 2013-05-13 at 11:33 +0100, David Woodhouse wrote:
  
  On Mon, 2013-05-13 at 09:33 +0200, Gabriel Paubert wrote:
   Actually, I'd swap the two mr instructions to never
   have an instruction that uses the result from the
   previous one. 
  
  Bad GCC. No biscuit.
  
  Should we file a PR? 
 
 Maybe not. If you tell it to tune for an in-order machine like Cell, it
 swaps them round. Although now I'm confused about which of POWER[567]
 were in-order:

It was Power6 IIRC. On this kind of fine point, don't rely too much
on what GCC produces.

 
 [dwmw2@i7 ~]$ powerpc64-linux-gnu-gcc -O2 -S -o- bswapdi2.c -m32  | grep -B1 
 mr
   rlwimi 11,3,24,16,23
   mr 4,11
   mr 3,10
 [dwmw2@i7 ~]$ powerpc64-linux-gnu-gcc -O2 -S -o- bswapdi2.c -m32 -mtune=cell 
 | grep -B1 mr
   rlwimi 11,3,24,16,23
   mr 3,10
   mr 4,11
 [dwmw2@i7 ~]$ powerpc64-linux-gnu-gcc -O2 -S -o- bswapdi2.c -m32 
 -mtune=power5 | grep -B1 mr
   rlwimi 11,3,24,16,23
   mr 3,10
   mr 4,11
 [dwmw2@i7 ~]$ powerpc64-linux-gnu-gcc -O2 -S -o- bswapdi2.c -m32 
 -mtune=power6 | grep -B1 mr
   rlwimi 11,3,24,16,23
   mr 4,11
   mr 3,10
 [dwmw2@i7 ~]$ powerpc64-linux-gnu-gcc -O2 -S -o- bswapdi2.c -m32 
 -mtune=power7 | grep -B1 mr
   rlwimi 11,3,24,16,23
   mr 4,11
   mr 3,10

I don't know of any processor in which putting the mr 3,10 first can cause 
stalls, so
even a generic tuning should put it first.

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 00/14] powerpc: Add support for POWER8 relocation on exceptions

2012-11-09 Thread Gabriel Paubert
On Fri, Nov 09, 2012 at 05:18:58PM +1100, Michael Neuling wrote:
 This set of patches adds support for taking exceptions with the MMU on which 
 is
 supported by POWER8.
 
 A new set of exception vectors is added at 0xc000___4xxx.  When the HW
 takes us here, MSR IR/DR will be set already and we no longer need a costly
 RFID to turn the MMU back on again.
 
 The original 0x0 based exception vectors remain for when the HW can't leave 
 the
 MMU on.  Examples of this are when we can't trust the current the MMU 
 mappings,
^^^
Extra the (here and in a couple of patches). 

Can't do much more since I don't have any hardware past G5, but the series
looks nice and avoiding transient excursions to real mode is a good thing.

Gabriel


 like when we are changing from guest to hypervisor (HV 0 - 1) or when the MMU
 was off already.  In these cases the HW will take us to the original 0x0 based
 exception vectors with the MMU off as before.
 
 The core of these patches were originally written by Matt Evans.  
 
 Ian Munsie (5):
   powerpc: Add set_mode hcall
   powerpc: Add wrappers to enable/disable relocation on exceptions
   powerpc: Move get_longbusy_msecs into hvcall.h and remove duplicate
 function
   powerpc: Enable relocation on during exceptions at boot
   powerpc: Disable relocation on exceptions when kexecing
 
 Michael Neuling (9):
   powerpc: Add POWER8 architected mode to cputable
   powerpc: Whitespace changes in exception64s.S
   powerpc: Remove unessessary 0x3000 location enforcement
   powerpc: Make load_hander handle upto 64k offset
   powerpc: Turn syscall handler into macros
   powerpc: Add new macros needed for relocation on exceptions
   powerpc: Add relocation on exception vector handlers
   powerpc: Move initial mfspr LPCR out of __init_LPCR
   powerpc: Setup relocation on exceptions for bare metal systems
 
  arch/powerpc/include/asm/exception-64s.h|   97 ++-
  arch/powerpc/include/asm/firmware.h |4 +-
  arch/powerpc/include/asm/hvcall.h   |   23 +-
  arch/powerpc/include/asm/reg.h  |2 +
  arch/powerpc/kernel/cpu_setup_power.S   |8 +-
  arch/powerpc/kernel/cputable.c  |   15 ++
  arch/powerpc/kernel/exceptions-64s.S|  306 
 +++
  arch/powerpc/kernel/head_64.S   |3 +-
  arch/powerpc/kernel/setup_64.c  |5 +
  arch/powerpc/platforms/pseries/firmware.c   |1 +
  arch/powerpc/platforms/pseries/plpar_wrappers.h |   36 +++
  arch/powerpc/platforms/pseries/setup.c  |   71 ++
  drivers/infiniband/hw/ehca/hcp_if.c |   20 --
  drivers/net/ethernet/ibm/ehea/ehea_phyp.h   |   20 --
  14 files changed, 516 insertions(+), 95 deletions(-)
 
 -- 
 1.7.9.5
 
 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: Add POWER8 architected mode to cputable

2012-11-09 Thread Gabriel Paubert
On Fri, Nov 09, 2012 at 05:26:42PM +1100, Michael Neuling wrote:
 A PVR of 0x0F04 means we are arch v2.07 complicate ie, POWER8.

Huh? 

s/complicate/compliant/ ?

Also ie has to be written with dots (i.e.).

Gabriel

 
 Signed-off-by: Michael Neuling mi...@neuling.org
 
 diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
 index 216ff84..75a3d71 100644
 --- a/arch/powerpc/kernel/cputable.c
 +++ b/arch/powerpc/kernel/cputable.c
 @@ -435,6 +435,21 @@ static struct cpu_spec __initdata cpu_specs[] = {
   .cpu_restore= __restore_cpu_power7,
   .platform   = power7,
   },
 + {   /* 2.07-compliant processor, i.e. Power8 architected mode */
 + .pvr_mask   = 0x,
 + .pvr_value  = 0x0f04,
 + .cpu_name   = POWER8 (architected),
 + .cpu_features   = CPU_FTRS_POWER8,
 + .cpu_user_features  = COMMON_USER_POWER8,
 + .mmu_features   = MMU_FTRS_POWER8,
 + .icache_bsize   = 128,
 + .dcache_bsize   = 128,
 + .oprofile_type  = PPC_OPROFILE_POWER4,
 + .oprofile_cpu_type  = ppc64/ibm-compat-v1,
 + .cpu_setup  = __setup_cpu_power8,
 + .cpu_restore= __restore_cpu_power8,
 + .platform   = power8,
 + },
   {   /* Power7 */
   .pvr_mask   = 0x,
   .pvr_value  = 0x003f,
 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: Probing for native availability of isel from userspace

2012-09-24 Thread Gabriel Paubert
On Sun, Sep 23, 2012 at 03:46:06AM +0200, Segher Boessenkool wrote:
 Why does the kernel emulate this, btw?  I can see emulation is useful
 for running older binaries, for instructions that have been removed
 from the architecture; but for newly added instructions, or optional
 instructions, it hurts more than it helps?

Indeed. I also don't understand why mfpvr is emulated. That's the kind
of information that should be passed to the executables through auxiliary
vectors. After all, you can (or could at least) compile a kernel without
Altivec support and run it on a processor with Altivec.

Therefore, whether Altivec is supported or not, is a matter of
processor and kernel options. Provide this information through
the auxiliary vector and the problem is solved.

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: Probing for native availability of isel from userspace

2012-09-24 Thread Gabriel Paubert
On Mon, Sep 24, 2012 at 05:58:37PM +1000, Benjamin Herrenschmidt wrote:
 On Mon, 2012-09-24 at 09:55 +0200, Gabriel Paubert wrote:
  On Sun, Sep 23, 2012 at 03:46:06AM +0200, Segher Boessenkool wrote:
   Why does the kernel emulate this, btw?  I can see emulation is useful
   for running older binaries, for instructions that have been removed
   from the architecture; but for newly added instructions, or optional
   instructions, it hurts more than it helps?
  
  Indeed. I also don't understand why mfpvr is emulated. That's the kind
  of information that should be passed to the executables through auxiliary
  vectors. After all, you can (or could at least) compile a kernel without
  Altivec support and run it on a processor with Altivec.
  
  Therefore, whether Altivec is supported or not, is a matter of
  processor and kernel options. Provide this information through
  the auxiliary vector and the problem is solved.
 
 Which we do. mfpvr is available as a fallback (essentially because if we
 don't do it somebody's going to parse /proc/cpuinfo which is arguably
 worse :-)

Fine. But I believe that mfpvr emulation came first, which is the point
I object to (see the mess that the fact that CPUID is available to 
applications made to x86 when SSE registers were added).

Bottom line, the mappin between PVR and capabilities offered to 
applications should happen in one place, and this place is the kernel. 

 
 We should definitely advertise the availability of isel.

Agreed.

Cheers,
Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: Probing for native availability of isel from userspace

2012-09-22 Thread Gabriel Paubert
On Sat, Sep 22, 2012 at 02:12:42PM +0400, malc wrote:
 On Sat, 22 Sep 2012, Segher Boessenkool wrote:
 
   Is it possible to determine if _native_ isel is available from userspace
   somehow?
  
  Just try to execute one and catch the SIGILL?
  
 
 Unfortunately my kernel emulates ISEL for me in this case, so i don't
 get any SIGILLs.

Perform a few isels in a loop between a couple of mftb and measure the 
shortest time it takes. Any emulation will take tens of timebase ticks,
hardware implementation will return very small values, perhaps even
0 depending on the relationship between core and timebase frequencies.

I don't remember whether it's necessary inserting an isync between the
two mftb. I believe that even on the most OOO machines, two mftb in
a row will not be reordered to the point that the second executes before 
the first.

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/2] power: Define PV_POWER7P

2012-07-13 Thread Gabriel Paubert
On Thu, Jul 12, 2012 at 04:59:12PM -0700, Sukadev Bhattiprolu wrote:
 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 Date: Tue, 3 Jul 2012 13:32:46 -0700
 Subject: [PATCH 1/2] power: Define PV_POWER7P
 
 This change is based on the patch that Carl Love posted to LKML
 
   https://lkml.org/lkml/2012/6/22/309
 
 It is included here for completeness and to enable building. When
 the above patch is merged, this patch can be ignored.
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 ---
  arch/powerpc/include/asm/reg.h |1 +
  1 files changed, 1 insertions(+), 0 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
 index f0cb7f4..b3fc2c1 100644
 --- a/arch/powerpc/include/asm/reg.h
 +++ b/arch/powerpc/include/asm/reg.h
 @@ -1014,6 +1014,7 @@
  #define PV_970FX 0x003C
  #define PV_POWER60x003E
  #define PV_POWER70x003F
 +#define PV_POWER7P   0x004A
  #define PV_630   0x0040
  #define PV_630p  0x0041
  #define PV_970MP 0x0044

Hmm, before this patch the PVR definitions were sorted in ascending
numerical order, at least for the list of 64 bit processors. Your
patch breaks this, which is not a good idea IMHO. 

For example, the 970* processors are already interspersed with other
processors to maintain numerical order, therefore I don't see why the 
POWER7P could not be between 970GX and BE.

Another inconsistency is that all other plus variants seem to 
use a lower case p suffix. So it would be better to use POWER7p.

Regards,
Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: linux-next: build failure after merge of the final tree (powerpc related)

2012-06-21 Thread Gabriel Paubert
On Thu, Jun 21, 2012 at 03:36:01PM +1000, Michael Ellerman wrote:
 On Wed, 2012-06-20 at 17:50 +1000, Stephen Rothwell wrote:
  Hi all,
  
  After merging the final tree, today's linux-next build (powerpc
  allyesconfig) failed like this:
  
  powerpc64-linux-ld: arch/powerpc/net/built-in.o: In function 
  `bpf_slow_path_word':
  (.text+0x90): sibling call optimization to `skb_copy_bits' does not allow 
  automatic multiple TOCs; recompile with -mminimal-toc or 
  -fno-optimize-sibling-calls, or make `skb_copy_bits' extern
 
 
 Those seem to be caused because we don't have a nop after the call,
 meaning we can't patch the TOC pointer on the way back. Adding a nop
 fixes those.
 
 But, then I get 32,410 variants of this:
 
 powerpc64-linux-ld: 
 /src/next/net/openvswitch/vport-netdev.c:189:(.text+0x89b990): 
   sibling call optimization to `_restgpr0_28' does not allow automatic 
 multiple TOCs;
   recompile with -mminimal-toc or -fno-optimize-sibling-calls, or make 
 `_restgpr0_28' extern
 
 

These functions should not need a TOC in the first place. There is
code in the linker (for 64 bit only: bfd/elf64-ppc.c) to automatically 
generate them whenever they are needed.

I suspect you compile with -Os. But I don't think you can use
these functions when doing a sibling call since restgpr0_nn
implies a return to the caller. restgpr1_nn would be different...

 And those are generated calls so I don't see how we can fix them.
 
  I started building with gcc 4.6.3/binutils 2.22 today.  gcc
  4.6.0/binutils 2.21 do not produce this error, it produces this instead
  (which has been happening for a long time):
  
  powerpc64-linux-ld: TOC section size exceeds 64k
 
 
 So presumably there's some new error checking that we're hitting, I
 imagine it was always broken, but now it's being more explicit.

I'm not so sure. I suspect gcc, but upgrading gcc and binutils at the
same time may not be the wisest...

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: kernel panic during kernel module load (powerpc specific part)

2012-06-05 Thread Gabriel Paubert
On Tue, Jun 05, 2012 at 08:00:42AM +1000, Benjamin Herrenschmidt wrote:
 On Mon, 2012-06-04 at 13:03 +0200, Gabriel Paubert wrote:
  There is no conflict to the ABI. These functions are supposed to be
  directly reachable from whatever code
  section may need them.
  
  Now I have a question: how did you get the need for this?
  
  None of my kernels uses them:
  - if I compile with -O2, the compiler simply expands epilogue and
  prologue to series of lwz and stw
  - if I compile with -Os, the compiler generates lmw/stmw which give
  the smallest possible cache footprint
  
  Neither did I find a single reference to these functions in several
  systems that I grepped for.
 
 Newer gcc's ... even worse, with -Os, it does it for a single register
 spill afaik. At least that's how I hit it with grub2 using whatever gcc
 is in fc17.

Ok, it's beyond stupid, and at this point must be considered a gcc bug.

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: kernel panic during kernel module load (powerpc specific part)

2012-06-05 Thread Gabriel Paubert
On Tue, Jun 05, 2012 at 08:00:42AM +1000, Benjamin Herrenschmidt wrote:
 On Mon, 2012-06-04 at 13:03 +0200, Gabriel Paubert wrote:
  There is no conflict to the ABI. These functions are supposed to be
  directly reachable from whatever code
  section may need them.
  
  Now I have a question: how did you get the need for this?
  
  None of my kernels uses them:
  - if I compile with -O2, the compiler simply expands epilogue and
  prologue to series of lwz and stw
  - if I compile with -Os, the compiler generates lmw/stmw which give
  the smallest possible cache footprint
  
  Neither did I find a single reference to these functions in several
  systems that I grepped for.
 
 Newer gcc's ... even worse, with -Os, it does it for a single register
 spill afaik. At least that's how I hit it with grub2 using whatever gcc
 is in fc17.

Well, I've not yet been able to make it call the save/restore routines,
but here on a machine with Debian testing and several gcc installed:

- gcc-4.4 never generates lmw/stmw, it always uses individual
 lwz/stw whatever options are set (-Os -mmultiple).

- gcc-4.6 and gcc-4.7 behave identically, if -Os is set, they
  generate by default lmw/stmw. But if I combine -Os with 
  -mno-multiple, they call the helper functions.

In other words, on this system, gcc-4.4 is broken but should not
cause any harm. gcc-4.6 and gcc-4.7 look correct, but are there
any processors on which -mno-multiple is worth setting?

Regards,
Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: kernel panic during kernel module load (powerpc specific part)

2012-06-04 Thread Gabriel Paubert
On Fri, Jun 01, 2012 at 11:33:37AM +, Wrobel Heinz-R39252 wrote:
   I believe that the basic premise is that you should provide a directly
   reachable copy of the save/rstore functions, even if this means that
  you need several copies of the functions.
  
  I just fixed a very similar problem with grub2 in fact. It was using r0
  and trashing the saved LR that way.
  
  The real fix is indeed to statically link those gcc helpers, we
  shouldn't generate things like cross-module calls inside function prologs
  and epilogues, when stackframes aren't even guaranteed to be reliable.
  
  However, in the grub2 case, it was easier to just use r12 :-)
 
 For not just the module loading case, I believe r12 is the only real solution 
 now. I checked one debugger capable of doing ELF download. It also uses r12 
 for trampoline code. I am guessing for the reason that prompted this 
 discussion.
 

I disagree. Look carefully at Be's answer: cross-module calls
are intrinsically dangerous when stack frames are in a transient
state.

 Without r12 we'd have to change standard libraries to automagically link in 
 gcc helpers for any conceivable non-.text section, which I am not sure is 
 feasible. How would you write section independent helper functions which link 
 to any section needing them?!

I don't thnk that it is tha bad: the helpers should be linked to the default 
.text section
when needed, typically the init code and so on are mapped within the reach of 
that 
section (otherwise you'll end up with the linker complaining that it finds 
overflowing
branch offsets between .text and .init.text).

 Asking users to create their own section specific copy of helper functions is 
 definitely not portable if the module or other code is not architecture 
 dependent.

Well, it automagically works on 64 bit. There is is performed by magic built 
into the linker.

 It is a normal gcc feature that you can assign specific code to non-.text 
 sections and it is not documented that it may crash depending on the OS arch 
 the ELF is built for, so asking for a Power Architecture specific change on 
 tool libs to make Power Architecture Linux happy seems a bit much to ask.
 

Once again I disagree.

 Using r12 in any Linux related trampoline code seems a reachable goal, and it 
 would eliminate the conflict to the ABI.
 

There is no conflict to the ABI. These functions are supposed to be directly 
reachable from whatever code
section may need them.

Now I have a question: how did you get the need for this?

None of my kernels uses them:
- if I compile with -O2, the compiler simply expands epilogue and prologue to 
series of lwz and stw
- if I compile with -Os, the compiler generates lmw/stmw which give the 
smallest possible cache footprint

Neither did I find a single reference to these functions in several systems 
that I grepped for.

Regards,
Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: kernel panic during kernel module load (powerpc specific part)

2012-05-31 Thread Gabriel Paubert
On Thu, May 31, 2012 at 07:04:42AM +, Wrobel Heinz-R39252 wrote:
 Michael,
 
  On Wed, 2012-05-30 at 16:33 +0200, Steffen Rumler wrote:
   I've found the following root cause:
  
(6) Unfortunately, the trampoline code (do_plt_call()) is using
  register r11 to setup the jump.
  It looks like the prologue and epilogue are using also the
  register r11, in order to point to the previous stack frame.
  This is a conflict !!! The trampoline code is damaging the
  content of r11.
  
  Hi Steffen,
  
  Great bug report!
  
  I can't quite work out what the standards say, the versions I'm looking
  at are probably old anyway.
 
 The ABI supplement from https://www.power.org/resources/downloads/ is 
 explicit about r11 being a requirement for the statically lined save/restore 
 functions in section 3.3.4 on page 59.
 
 This means that any trampoline code must not ever use r11 or it can't be used 
 to get to such save/restore functions safely from far away.

I believe that the basic premise is that you should provide a directly 
reachable copy 
of the save/rstore functions, even if this means that you need several copies 
of the functions.

 
 Unfortunately the same doc and predecessors show r11 in all basic examples 
 for PLT/trampoline code AFAICS, which is likely why all trampoline code uses 
 r11 in any known case.
 
 I would guess that it was never envisioned that compiler generated code would 
 be in a different section than save/restore functions, i.e., the Linux module 
 __init assumptions for Power break the ABI. Or does the ABI break the 
 __init concept?!
 
 Using r12 in the trampoline seems to be the obvious solution for module 
 loading.
 
 But what about other code loading done? If, e.g., a user runs any app from 
 bash it gets loaded and relocated and trampolines might get set up somehow.

I don't think so. The linker/whatever should generate a copy of the 
save/restore functions for every
executable code area (shared library), and probably more than one copy if the 
text becomes too large.

For 64 bit code, these functions are actually inserted by the linker.

[Ok, I just recompiled my 64 bit kernel with -Os and I see that vmlinux gets 
one copy
of the save/restore functions and every module also gets its copy.]

This makes sense, really these functions are there for a compromise between 
locality 
and performance, there should be one per code section, otherwise the cache line 
used by the trampoline negates a large part of their advantage.

 
 Wouldn't we have to find fix ANY trampoline code generator remotely related 
 to a basic Power Architecture Linux?
 Or is it a basic assumption for anything but modules that compiler generated 
 code may never ever be outside the .text section? I am not sure that would be 
 a safe assumption.
 
 Isn't this problem going beyond just module loading for Power Architecture 
 Linux?

I don't think so. It really seems to be a 32 bit kernel problem.

Regards,
Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: pread() and pwrite() system calls

2012-05-25 Thread Gabriel Paubert
On Fri, May 25, 2012 at 02:29:06PM +0100, David Laight wrote:
 We have a system with linux 2.6.32 and the somewhat archaic
 uClibc 0.9.27 (but I'm not sure the current version is
 any better, and I think there are binary compatibility
 if we update).
 
 I've just discovered that pread() is 'implemented'
 by using 3 lseek() system calls and read().
 (the same is true for the 64bit versions).
 
 I thought that pread() was supposed to be atomic
 (so usable concurrently by multiple threads) which
 means that this implementation is completely broken.
 
 I've not looked to see what glibc does.
 
 I can see that part of the problem is the alignment
 of the 64bit value on the argument list of syscall()
 (when the register save area is cast to a sycall
 argument structure).
 But it also looks as though the uClibc syscall()
 stub can only pass 5 arguments in registers, and
 pread() (with an aligned 64bit offset) requires 6.
 
 The ucLibc source seems to be predicated by __NR_pread,
 but if that were defined it would try to call
 __syscall_pread() and I can't find that anywhere.
 
 A special pread/pwrite asm stub that just copies
 r7 to r0 could be used.
 
 Would it be enough to do:
 syscall_pread_pwrite:
   mov 0,7
   sc
   blr
 and handle the -ve - errno in C?

Huh? Won't fly, r0 is used for the system call number!

On the other hand, I believed PPC had no problems passing
up to 8 32 bit arguments in registers (r3 to r10), but
I may be confusing with the standard ABI for function calls.

Hmm, a quick look at kernel/entry_32.s shows that it should 
be able to use at least r3 to r8, which should be sufficient.

I think that it is an uClibc problem.

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] tty/serial/pmac_zilog: Fix nobody cared IRQ message

2012-04-29 Thread Gabriel Paubert
On Sat, Apr 28, 2012 at 06:53:49PM -0500, Larry Finger wrote:
 Following commit a79dd5a titled tty/serial/pmac_zilog: Fix suspend  resume,
 my Powerbook G4 Titanium showed the following stack dump:
 
 [   36.878225] irq 23: nobody cared (try booting with the irqpoll option)
 [   36.878251] Call Trace:
 [   36.878291] [dfff3f00] [c000984c] show_stack+0x7c/0x194 (unreliable)
 [   36.878322] [dfff3f40] [c00a6868] __report_bad_irq+0x44/0xf4
 [   36.878339] [dfff3f60] [c00a6b04] note_interrupt+0x1ec/0x2ac
 [   36.878356] [dfff3f80] [c00a48d0] handle_irq_event_percpu+0x250/0x2b8
 [   36.878372] [dfff3fd0] [c00a496c] handle_irq_event+0x34/0x54
 [   36.878389] [dfff3fe0] [c00a753c] handle_fasteoi_irq+0xb4/0x124
 [   36.878412] [dfff3ff0] [c000f5bc] call_handle_irq+0x18/0x28
 [   36.878428] [deef1f10] [c000719c] do_IRQ+0x114/0x1cc
 [   36.878446] [deef1f40] [c0015868] ret_from_except+0x0/0x1c
 [   36.878484] --- Exception: 501 at 0xf497610
 [   36.878489] LR = 0xfdc3dd0
 [   36.878497] handlers:
 [   36.878510] [c02b7424] pmz_interrupt
 [   36.878520] Disabling IRQ #23
 
 From an E-mail exchange about this problem, Andreas Schwab noticed a typo
 that resulted in the wrong condition being tested.
 
 The patch also corrects 2 typos that incorrectly report why an error branch
 is being taken.
 
 Signed-off-by: Larry Finger larry.fin...@lwfinger.net
 ---
 
 Ben,
 
 Any changes you wish to make in the commit message are OK with me.
 
 Larry
 ---
 
 Index: wireless-testing/drivers/tty/serial/pmac_zilog.c
 ===
 --- wireless-testing.orig/drivers/tty/serial/pmac_zilog.c 2012-04-28
 15:51:38.843723074 -0500
 +++ wireless-testing/drivers/tty/serial/pmac_zilog.c  2012-04-28
 18:34:34.053900600 -0500
 @@ -469,7 +469,7 @@
   tty = NULL;
   if (r3  (CHAEXT | CHATxIP | CHARxIP)) {
   if (!ZS_IS_OPEN(uap_a)) {
 - pmz_debug(ChanA interrupt while open !\n);
 + pmz_debug(ChanA interrupt while not open !\n);

Hmm, I'm not a native english speaker, but I have the feeling that
it would be more grammatically correct to use opened instead of open.

Of course if the message never triggers, it's less of concern :-)


   goto skip_a;
   }
   write_zsreg(uap_a, R0, RES_H_IUS);
 @@ -493,8 +493,8 @@
   spin_lock(uap_b-port.lock);
   tty = NULL;
   if (r3  (CHBEXT | CHBTxIP | CHBRxIP)) {
 - if (!ZS_IS_OPEN(uap_a)) {
 - pmz_debug(ChanB interrupt while open !\n);
 + if (!ZS_IS_OPEN(uap_b)) {
 + pmz_debug(ChanB interrupt while not open !\n);

Ditto.

   goto skip_b;
   }
   write_zsreg(uap_b, R0, RES_H_IUS);

Regards,
Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: linux-3.0.4, mv643xx_eth troubles on Pegasos2 G4

2011-10-18 Thread Gabriel Paubert
On Mon, Oct 17, 2011 at 11:40:54PM +0200, nello martuscielli wrote:
 i'm trying to enable marvel gigabit ethernet support but it doesn't work.
 Here my dmesg instead my config is attached.
[snipped]
 via_rhine: v1.10-LK1.5.0 2010-10-09 Written by Donald Becker
 mv643xx_eth: MV-643xx 10/100/1000 ethernet driver version 1.4
 uhci_hcd :00:0c.2: irq 9, io base 0x1040
 sysfs: cannot create duplicate filename '/class/mdio_bus/0'

I have 2 Pegasos running 3.0, but in my case mv643xx_eth is non-modular
and /sys/class/mdio_bus/0 exists and points to

../../devices/platform/mv643xx_eth.0/mdio_bus/0

which is correct as far as I can say. 

Is it a regression from 3.0 or not? Try to make it non modular and see
what happens. If it is a regression, could you try to bisect it?

I won't be close enough to the machines to do a regression
hunt myself before a week or 3 (really, maybe next week, 
I don't yet know, but for sure starting on Nov 8th).



 usb usb2: New USB device found, idVendor=1d6b, idProduct=0001
 [ cut here ]
 WARNING: at fs/sysfs/dir.c:455
 Modules linked in: snd_via82xx(+) snd_ac97_codec mv643xx_eth(+)
 via_rhine(+) i2c_viapro(+) ac97_bus ohci_hcd(+) snd_mpu401_uart
 uhci_hcd(+) snd_rawmidi
 NIP: c00fa718 LR: c00fa718 CTR: 
 REGS: ef271c00 TRAP: 0700   Not tainted  (3.0.4)
 MSR: 00029032 EE,ME,CE,IR,DR  CR: 22004428  XER: 
 TASK = ef294c60[94] 'modprobe' THREAD: ef27
 GPR00: c00fa718 ef271cb0 ef294c60 0042 c0008904 0001  
 GPR08: c06b6bd8  22004482 ef271c70 22004422 10024440 1000ba68 
 GPR16: 1000ba44 bf83e324  1000ba58  104410ec 0a30 
 GPR24: c0059210 0124  0001 ef271cd8 ef2ba480 ffef ef344000
 NIP [c00fa718] sysfs_add_one+0x88/0xa0
 LR [c00fa718] sysfs_add_one+0x88/0xa0
 Call Trace:
 [ef271cb0] [c00fa718] sysfs_add_one+0x88/0xa0 (unreliable)
 [ef271cd0] [c00faff4] sysfs_do_create_link+0x134/0x1e0
 [ef271d00] [c0392cf8] device_add+0x204/0x544
 [ef271d40] [c03d67e4] mdiobus_register+0xa4/0x198
 [ef271d60] [f26785a4] mv643xx_eth_shared_probe+0x144/0x428 [mv643xx_eth]
 [ef271d80] [c039685c] platform_drv_probe+0x20/0x30
 [ef271d90] [c0395578] driver_probe_device+0xe4/0x198
 [ef271db0] [c039569c] __driver_attach+0x70/0x98
 [ef271dd0] [c0394614] bus_for_each_dev+0x60/0x90
 [ef271e00] [c03951d0] driver_attach+0x24/0x34
 [ef271e10] [c0394d9c] bus_add_driver+0xbc/0x23c
 [ef271e30] [c0395ac8] driver_register+0xb8/0x144
 [ef271e50] [c0396bb4] platform_driver_register+0x68/0x78
 [ef271e60] [f2680024] mv643xx_eth_init_module+0x24/0x80 [mv643xx_eth]
 [ef271e80] [c000402c] do_one_initcall+0xe0/0x1c0
 [ef271eb0] [c005b438] sys_init_module+0x1600/0x17f4
 [ef271f40] [c0012df8] ret_from_syscall+0x0/0x38
 --- Exception: c01 at 0xff62ac0
LR = 0x10003f2c
 Instruction dump:
 807c 7fe4fb78 4bfff469 3c80c060 3884f131 4bf2051d 809d0010 4bf20515
 7c641b78 3c60c060 3863f0fe 484650f9 0fe0 7fe3fb78 4bfa8009 39610020
 ---[ end trace cebed1f190337b77 ]---
 usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
 usb usb2: Product: UHCI Host Controller
 usb usb2: Manufacturer: Linux 3.0.4 uhci_hcd
 usb usb2: SerialNumber: :00:0c.2
 hub 2-0:1.0: USB hub found
 mii_bus 0 failed to register
 mv643xx_eth: probe of mv643xx_eth.0 failed with error -12
 hub 2-0:1.0: 2 ports detected
 ohci_hcd :00:05.0: OHCI Host Controller
 ohci_hcd :00:05.0: new USB bus registered, assigned bus number 3
 Unable to handle kernel paging request for data at address 0x
 ohci_hcd :00:05.0: irq 9, io mem 0x8000
 Faulting instruction address: 0xf267b3a8
 Oops: Kernel access of bad area, sig: 11 [#1]
 PREEMPT CHRP
 Modules linked in: snd_via82xx(+) snd_ac97_codec mv643xx_eth(+)
 via_rhine(+) i2c_viapro(+) ac97_bus ohci_hcd(+) snd_mpu401_uart
 uhci_hcd(+) snd_rawmidi
 NIP: f267b3a8 LR: f267b3a0 CTR: c0394ff4
 REGS: ef271c90 TRAP: 0300   Tainted: GW(3.0.4)
 MSR: 9032 EE,ME,IR,DR  CR: 84004448  XER: 
 DAR: , DSISR: 4000
 TASK = ef294c60[94] 'modprobe' THREAD: ef27
 GPR00:  ef271d40 ef294c60  eec003c0 eec5 ef24bb3c 
 GPR08: ef24bb28 ef8a7600  0001 44004442 10024440 1000ba68 
 GPR16: 1000ba44 bf83e324  1000ba58  104410ec 0a30 
 GPR24: c0059210 c06b68c0 0020 c06b68b8 fff4 eec0 c06b6740 eec003c0
 NIP [f267b3a8] mv643xx_eth_probe+0x98/0x604 [mv643xx_eth]
 LR [f267b3a0] mv643xx_eth_probe+0x90/0x604 [mv643xx_eth]
 Call Trace:
 [ef271d40] [f267b394] mv643xx_eth_probe+0x84/0x604 [mv643xx_eth] (unreliable)
 [ef271d80] [c039685c] platform_drv_probe+0x20/0x30
 [ef271d90] [c0395578] driver_probe_device+0xe4/0x198
 [ef271db0] [c039569c] __driver_attach+0x70/0x98
 [ef271dd0] [c0394614] bus_for_each_dev+0x60/0x90
 [ef271e00] [c03951d0] driver_attach+0x24/0x34
 [ef271e10] [c0394d9c] bus_add_driver+0xbc/0x23c
 [ef271e30] [c0395ac8] driver_register+0xb8/0x144
 [ef271e50] 

Re: [patch 0/4] powerpc: Mark various interrupts IRQF_NO_THREAD

2011-10-06 Thread Gabriel Paubert
On Wed, Oct 05, 2011 at 05:31:48PM +0200, Thomas Gleixner wrote:
 On Wed, 5 Oct 2011, Gabriel Paubert wrote:
 
  On Wed, Oct 05, 2011 at 12:30:49PM -, Thomas Gleixner wrote:
   The following series marks the obvious interrupts IRQF_NO_THREAD to
   prevent forced interrupt threading - no guarantee of completeness :)
   
   The last patch enables the forced threading mechanism in the core
   code, which in turn enables the irqthreaded commandline option.
  
  Is there any description of what interrupt threading means?
 
 That means that the interrupt handler is running in a thread. The
 interrupt itself merily wakes the thread. That's a debugging option in
 mainline - it comes handy when interrupt handlers crash as the system
 just kills the thread, but stays otherwise alive. If the same
 situation happens in the normal hard interrupt context, then it takes
 them machine down completely.
 
 Aside of that it's a prerequisite to support PREEMPT_RT on your
 arch/machine.
 
  I'm only looking for a pointer to a web page, a mailing list thread
 
 https://lwn.net/Articles/429690/

Thanks.

 
  (I am no more subscribed to lkml, too many things to do, so maybe
  it has been discussed but it comes out of the blue on linuxppc-dev), 
  or a well commented git commit?
  
  From followups, I see that cascaded interrupt controller should
  not be threaded. I suspect that the private VME bridge driver
  (Universe chip) I maintain here will need it: clients request
  a given VME interrupt (level/vector) and specify an interrupt
  handler which is called by the handler for the PCI interrupt.
 
 Can't tell as I don't know how your code looks like. If your
 subinterrupts are registered with the core interrupt system and the
 drivers install their handlers via request_irq(), then yes. If you
 just have your own registration and handling stuff (which you
 shouldn't) then you might be better off to let the dispatch interrupt
 handler be threaded so that all the demuxed ones run in that same
 thread.


Well, for now I have my own registration, because when I originally wrote
the code (over a decade ago), there was no simple way to add potentially
~1800 additional interrupt sources (and the size of the interrupt descriptor
array on a machine with 16MB of RAM was not negligible, it became several
percent of the, minimalistic, kernel). 

This might be feasible now, but I still have boards with the first release 
of the bridge, which has very nasty interrupt related bugs (sometimes it 
incorrectly reports the level, only the vector is correct).

Regards,
Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [patch 0/4] powerpc: Mark various interrupts IRQF_NO_THREAD

2011-10-05 Thread Gabriel Paubert
On Wed, Oct 05, 2011 at 12:30:49PM -, Thomas Gleixner wrote:
 The following series marks the obvious interrupts IRQF_NO_THREAD to
 prevent forced interrupt threading - no guarantee of completeness :)
 
 The last patch enables the forced threading mechanism in the core
 code, which in turn enables the irqthreaded commandline option.

Is there any description of what interrupt threading means?

I'm only looking for a pointer to a web page, a mailing list thread
(I am no more subscribed to lkml, too many things to do, so maybe
it has been discussed but it comes out of the blue on linuxppc-dev), 
or a well commented git commit?

From followups, I see that cascaded interrupt controller should
not be threaded. I suspect that the private VME bridge driver
(Universe chip) I maintain here will need it: clients request
a given VME interrupt (level/vector) and specify an interrupt
handler which is called by the handler for the PCI interrupt.

Regards,
Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: Bug#630845: linux-image-2.6.39-2-powerpc: CHRP Pegasos2 boot failure

2011-06-28 Thread Gabriel Paubert
On Sun, Jun 26, 2011 at 11:14:13PM +0100, Ben Hutchings wrote:
 On Thu, 2011-06-23 at 20:36 +0800, Andrew Buckeridge wrote:
  Package: linux-image-3.0.0-rc3-powerpc
  Version: 3.0.0~rc3-1~experimental.1
  
  On Wed, 22 Jun 2011 04:01:38 +0100
  Ben Hutchings b...@decadent.org.uk wrote:
  
linux-image-2.6.36-trunk-powerpc_2.6.36-1~experimental.1_powerpc.deb
linux-image-2.6.37-1-powerpc_2.6.37-1_powerpc.deb
linux-image-2.6.37-2-powerpc_2.6.37-2_powerpc.deb
These failed to boot. In all cases stuck at the spinner.
   
   At a guess, this may be fixed by a change in Linux 3.0-rc1:
  
   Please can you test Linux 3.0-rc3, currently available in experimental?
  
  linux-image-3.0.0-rc3-powerpc_3.0.0~rc3-1~experimental.1_powerpc.deb
  Also failed to boot and got stuck at spinner.
 
 Gabriel, Michael, do you recognise this bug?  Are there any fixes for
 Pegasos that are missing from 3.0-rc3?

What do you mean by the spinner? I've had very long boot times with 
an apparently dead machine depending on graphics options.

For now I'm running 2.6.39 with one patch for keyboard/mouse handling
which is now upstream.

I can try a more recent kernel on Thursday.

Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/3] powerpc: POWER7 optimised memcpy using VMX

2011-06-17 Thread Gabriel Paubert
On Fri, Jun 17, 2011 at 02:54:00PM +1000, Anton Blanchard wrote:
 Implement a POWER7 optimised memcpy using VMX. For large aligned
 copies this new loop is over 10% faster and for large unaligned
 copies it is over 200% faster.
 
 On POWER7 unaligned stores rarely slow down - they only flush when
 a store crosses a 4KB page boundary. Furthermore this flush is
 handled completely in hardware and should be 20-30 cycles.
 
 Unaligned loads on the other hand flush much more often - whenever
 crossing a 128 byte cache line, or a 32 byte sector if either sector
 is an L1 miss.
 
 Considering this information we really want to get the loads aligned
 and not worry about the alignment of the stores. Microbenchmarks
 confirm that this approach is much faster than the current unaligned
 copy loop that uses shifts and rotates to ensure both loads and
 stores are aligned.
 
 We also want to try and do the stores in cacheline aligned, cacheline
 sized chunks. If the store queue is unable to merge an entire
 cacheline of stores then the L2 cache will have to do a
 read/modify/write. Even worse, we will serialise this with the stores
 in the next iteration of the copy loop since both iterations hit
 the same cacheline.
 
 Based on this, the new loop does the following things:
 
 
 1 - 127 bytes
 Get the source 8 byte aligned and use 8 byte loads and stores. Pretty
 boring and similar to how the current loop works.
 
 128 - 4095 bytes
 Get the source 8 byte aligned and use 8 byte loads and stores,
 1 cacheline at a time. We aren't doing the stores in cacheline
 aligned chunks so we will potentially serialise once per cacheline.
 Even so it is much better than the loop we have today.
 
 4096 - bytes
 If both source and destination have the same alignment get them both
 16 byte aligned, then get the destination cacheline aligned. Do
 cacheline sized loads and stores using VMX.
 
 If source and destination do not have the same alignment, we get the
 destination cacheline aligned, and use permute to do aligned loads.
 
 In both cases the VMX loop should be optimal - we always do aligned
 loads and stores and are always doing stores in cacheline aligned,
 cacheline sized chunks.
 
 
 The VMX breakpoint of 4096 bytes was chosen using this microbenchmark:
 
 http://ozlabs.org/~anton/junkcode/copy_to_user.c
 
 (Note that the breakpoint analysis was done with the copy_tofrom_user
 version of the loop and using varying sizes and alignments to read(). 
 It's much easier to create a benchmark using read() that can control
 the size and alignment of a kernel copy loop and synchronise it with
 userspace doing optional VMX instructions).
 
 Since we are using VMX and there is a cost to saving and restoring
 the user VMX state there are two broad cases we need to benchmark:
 
 - Best case - userspace never uses VMX
 
 - Worst case - userspace always uses VMX
 
 In reality a userspace process will sit somewhere between these two
 extremes. Since we need to test both aligned and unaligned copies we
 end up with 4 combinations. The point at which the VMX loop begins to
 win is:
 
 0% VMX
 aligned   2048 bytes
 unaligned 2048 bytes
 
 100% VMX
 aligned   16384 bytes
 unaligned 8192 bytes
 
 Considering this is a microbenchmark, the data is hot in cache and
 the VMX loop has better store queue merging properties we set the
 breakpoint to 4096 bytes, a little below the unaligned breakpoints.
 
 Some future optimisations we can look at:
 
 - Looking at the perf data, a significant part of the cost when a task
   is always using VMX is the extra exception we take to restore the
   VMX state. As such we should do something similar to the x86
   optimisation that restores FPU state for heavy users. ie:
 
 /*
  * If the task has used fpu the last 5 timeslices, just do a full
  * restore of the math state immediately to avoid the trap; the
  * chances of needing FPU soon are obviously high now
  */
 preload_fpu = tsk_used_math(next_p)  next_p-fpu_counter  5;
 
   and 
 
 /*
  * fpu_counter contains the number of consecutive context switches
  * that the FPU is used. If this is over a threshold, the lazy fpu
  * saving becomes unlazy to save the trap. This is an unsigned char
  * so that after 256 times the counter wraps and the behavior turns
  * lazy again; this to deal with bursty apps that only use FPU for
  * a short time
  */
 
 - We could create a paca bit to mirror the VMX enabled MSR bit and check
   that first, avoiding multiple calls to calling enable_kernel_altivec.
 
 - We could have two VMX breakpoints, one for when we know the user VMX
   state is loaded into the registers and one when it isn't. This could
   be a second bit in the paca so we can calculate the break points quickly.
 
 Signed-off-by: Anton Blanchard an...@samba.org
 ---
 
 Index: linux-powerpc/arch/powerpc/lib/Makefile
 

[PATHC] Fix for Pegasos keyboard and mouse

2011-05-13 Thread Gabriel Paubert
[See http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-October/086424.html
and followups. Part of the commit message is directly copied from that.]

Commit 540c6c392f01887dcc96bef0a41e63e6c1334f01 tries to find i8042 IRQs in
the device-tree but doesn't fall back to the old hardcoded 1 and 12 in all
failure cases.

Specifically, the case where the device-tree contains nothing matching
pnpPNP,303 or pnpPNP,f03 doesn't seem to be handled well. It sort of falls
through to the old code, but leaves the IRQs set to 0.

Signed-off-by: Gabriel Paubert paub...@iram.es

---

This fix has only been tested on Pegasos, but to my knowledge it only 
affects a Pegasos specific path (all other fimwares should be able
to find the keyboard through the pnp identifiers.

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 21f30cb..6c7abbf 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -602,6 +602,10 @@ int check_legacy_ioport(unsigned long base_port)
 * name instead */
if (!np)
np = of_find_node_by_name(NULL, 8042);
+   if (np) {
+   of_i8042_kbd_irq = 1;
+   of_i8042_aux_irq = 12;
+   }
break;
case FDC_BASE: /* FDC1 */
np = of_find_node_by_type(NULL, fdc);
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


  1   2   >