Re: [RFC v7 02/25] powerpc: track allocation status of all pkeys

2017-08-10 Thread Michael Ellerman
Thiago Jung Bauermann  writes:

> Ram Pai  writes:
>>  static inline void pkey_initialize(void)
>>  {
>> +int os_reserved, i;
>> +
>>  /* disable the pkey system till everything
>>   * is in place. A patch further down the
>>   * line will enable it.
>>   */
>>  pkey_inited = false;
>> +
>> +/* Lets assume 32 keys */
>> +pkeys_total = 32;
>> +
>> +#ifdef CONFIG_PPC_4K_PAGES
>> +/*
>> + * the OS can manage only 8 pkeys
>> + * due to its inability to represent
>> + * them in the linux 4K-PTE.
>> + */
>> +os_reserved = pkeys_total-8;
>> +#else
>> +os_reserved = 0;
>> +#endif
>> +/*
>> + * Bits are in LE format.
>> + * NOTE: 1, 0 are reserved.
>> + * key 0 is the default key, which allows read/write/execute.
>> + * key 1 is recommended not to be used.
>> + * PowerISA(3.0) page 1015, programming note.
>> + */
>> +initial_allocation_mask = ~0x0;
>> +for (i = 2; i < (pkeys_total - os_reserved); i++)
>> +initial_allocation_mask &= ~(0x1<>  }
>>  #endif /*_ASM_PPC64_PKEYS_H */
>
> In v6, key 31 was also reserved, but it's not in this version. Is this
> intentional?

That whole thing could be replaced with two constants.

Except it can't, because we can't just hard code the number of keys. It
needs to come either from the device tree or be based on the CPU we're
running on.

> Isn't it better for this function to be in pkeys.c? Ideally, functions
> should be in .c files not in headers unless they're very small or
> performance sensitive IMHO.

Yes. No reason for that to be in a header AFAICS.

cheers


Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller

2017-08-10 Thread David Gibson
On Thu, Aug 10, 2017 at 09:36:04PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2017-08-10 at 09:19 +0200, Cédric Le Goater wrote:
> > > > > > +  /* Perform the acknowledge hypervisor to register cycle */
> > > > > > +  ack = be16_to_cpu(__raw_readw(xive_tima + TM_SPC_ACK_OS_REG));
> > > > > 
> > > > > Why do you need the raw_readw() + be16_to_cpu + mb, rather than one of
> > > > > the higher level IO helpers?
> > > > 
> > > > This is one of the many ways to do MMIOs on the TIMA. This memory 
> > > > region defines a set of offsets and sizes for which loads and 
> > > > stores have different effects. 
> > > > 
> > > > See the arch/powerpc/include/asm/xive-regs.h file and 
> > > > arch/powerpc/kvm/book3s_xive_template.c for some more usage.
> > > 
> > > Sure, much like any IO region.  My point is, why do you want this kind
> > > of complex combo, rather than say an in_be16() or readw_be().
> > > 
> > 
> > The code is inherited from the native backend. 
> > 
> > I think this is because we know what we are doing and we skip 
> > the synchronization routines of the higher level IO helpers. 
> > That might not be necessary for sPAPR. Ben ? 
> 
> It's a performance optimisation, we know we don't need the
> sync,twi,isync crap of the higher level accessor here.

Ok.  A comment mentioning this would be good - unless you're very
familiar with the code and hardware it's not going to be obvious if
the nonstandard IO accessors are for legitimate performance reasons,
or just cargo-culting.


-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


[PATCH net-next] fsl/fman: implement several errata workarounds

2017-08-10 Thread Florinel Iordache
Implemented workarounds for the following dTSEC Erratum:
A002, A004, A0012, A0014, A004839 on several operations
that involve MAC CFG register changes: adjust link,
rx pause frames, modify MAC address.

Signed-off-by: Florinel Iordache 
---
 drivers/net/ethernet/freescale/fman/fman_dtsec.c | 118 ++-
 1 file changed, 93 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman_dtsec.c 
b/drivers/net/ethernet/freescale/fman/fman_dtsec.c
index 98bba10..ea43b49 100644
--- a/drivers/net/ethernet/freescale/fman/fman_dtsec.c
+++ b/drivers/net/ethernet/freescale/fman/fman_dtsec.c
@@ -123,7 +123,7 @@
 #define DTSEC_ECNTRL_R100M 0x0008
 #define DTSEC_ECNTRL_QSGMIIM   0x0001
 
-#define DTSEC_TCTRL_GTS0x0020
+#define TCTRL_GTS  0x0020
 
 #define RCTRL_PAL_MASK 0x001f
 #define RCTRL_PAL_SHIFT16
@@ -863,6 +863,52 @@ int dtsec_cfg_pad_and_crc(struct fman_mac *dtsec, bool 
new_val)
return 0;
 }
 
+static void graceful_start(struct fman_mac *dtsec, enum comm_mode mode)
+{
+   struct dtsec_regs __iomem *regs = dtsec->regs;
+
+   if (mode & COMM_MODE_TX)
+   iowrite32be(ioread32be(>tctrl) &
+   ~TCTRL_GTS, >tctrl);
+   if (mode & COMM_MODE_RX)
+   iowrite32be(ioread32be(>rctrl) &
+   ~RCTRL_GRS, >rctrl);
+}
+
+static void graceful_stop(struct fman_mac *dtsec, enum comm_mode mode)
+{
+   struct dtsec_regs __iomem *regs = dtsec->regs;
+   u32 tmp;
+
+   /* Graceful stop - Assert the graceful Rx stop bit */
+   if (mode & COMM_MODE_RX) {
+   tmp = ioread32be(>rctrl) | RCTRL_GRS;
+   iowrite32be(tmp, >rctrl);
+
+   if (dtsec->fm_rev_info.major == 2) {
+   /* Workaround for dTSEC Errata A002 */
+   usleep_range(100, 200);
+   } else {
+   /* Workaround for dTSEC Errata A004839 */
+   usleep_range(10, 50);
+   }
+   }
+
+   /* Graceful stop - Assert the graceful Tx stop bit */
+   if (mode & COMM_MODE_TX) {
+   if (dtsec->fm_rev_info.major == 2) {
+   /* dTSEC Errata A004: Do not use TCTRL[GTS]=1 */
+   pr_debug("GTS not supported due to DTSEC_A004 
Errata.\n");
+   } else {
+   tmp = ioread32be(>tctrl) | TCTRL_GTS;
+   iowrite32be(tmp, >tctrl);
+
+   /* Workaround for dTSEC Errata A0012, A0014 */
+   usleep_range(10, 50);
+   }
+   }
+}
+
 int dtsec_enable(struct fman_mac *dtsec, enum comm_mode mode)
 {
struct dtsec_regs __iomem *regs = dtsec->regs;
@@ -880,13 +926,8 @@ int dtsec_enable(struct fman_mac *dtsec, enum comm_mode 
mode)
 
iowrite32be(tmp, >maccfg1);
 
-   /* Graceful start - clear the graceful receive stop bit */
-   if (mode & COMM_MODE_TX)
-   iowrite32be(ioread32be(>tctrl) & ~DTSEC_TCTRL_GTS,
-   >tctrl);
-   if (mode & COMM_MODE_RX)
-   iowrite32be(ioread32be(>rctrl) & ~RCTRL_GRS,
-   >rctrl);
+   /* Graceful start - clear the graceful Rx/Tx stop bit */
+   graceful_start(dtsec, mode);
 
return 0;
 }
@@ -899,23 +940,8 @@ int dtsec_disable(struct fman_mac *dtsec, enum comm_mode 
mode)
if (!is_init_done(dtsec->dtsec_drv_param))
return -EINVAL;
 
-   /* Gracefull stop - Assert the graceful transmit stop bit */
-   if (mode & COMM_MODE_RX) {
-   tmp = ioread32be(>rctrl) | RCTRL_GRS;
-   iowrite32be(tmp, >rctrl);
-
-   if (dtsec->fm_rev_info.major == 2)
-   usleep_range(100, 200);
-   else
-   udelay(10);
-   }
-
-   if (mode & COMM_MODE_TX) {
-   if (dtsec->fm_rev_info.major == 2)
-   pr_debug("GTS not supported due to DTSEC_A004 
errata.\n");
-   else
-   pr_debug("GTS not supported due to DTSEC_A0014 
errata.\n");
-   }
+   /* Graceful stop - Assert the graceful Rx/Tx stop bit */
+   graceful_stop(dtsec, mode);
 
tmp = ioread32be(>maccfg1);
if (mode & COMM_MODE_RX)
@@ -933,11 +959,19 @@ int dtsec_set_tx_pause_frames(struct fman_mac *dtsec,
  u16 pause_time, u16 __maybe_unused thresh_time)
 {
struct dtsec_regs __iomem *regs = dtsec->regs;
+   enum comm_mode mode = COMM_MODE_NONE;
u32 ptv = 0;
 
if (!is_init_done(dtsec->dtsec_drv_param))
return -EINVAL;
 
+   if ((ioread32be(>rctrl) & RCTRL_GRS) == 0)
+   mode |= COMM_MODE_RX;
+   if ((ioread32be(>tctrl) & TCTRL_GTS) == 0)
+ 

Re: [RFC v7 25/25] powerpc: Enable pkey subsystem

2017-08-10 Thread Thiago Jung Bauermann

Ram Pai  writes:
> --- a/arch/powerpc/include/asm/cputable.h
> +++ b/arch/powerpc/include/asm/cputable.h
> @@ -214,6 +214,7 @@ enum {
>  #define CPU_FTR_DAWR LONG_ASM_CONST(0x0400)
>  #define CPU_FTR_DABRX
> LONG_ASM_CONST(0x0800)
>  #define CPU_FTR_PMAO_BUG LONG_ASM_CONST(0x1000)
> +#define CPU_FTR_PKEY LONG_ASM_CONST(0x2000)
>  #define CPU_FTR_POWER9_DD1   LONG_ASM_CONST(0x4000)
>
>  #ifndef __ASSEMBLY__
> @@ -452,7 +453,7 @@ enum {
>   CPU_FTR_DSCR | CPU_FTR_SAO  | CPU_FTR_ASYM_SMT | \
>   CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
>   CPU_FTR_ICSWX | CPU_FTR_CFAR | CPU_FTR_HVMODE | \
> - CPU_FTR_VMX_COPY | CPU_FTR_HAS_PPR | CPU_FTR_DABRX)
> + CPU_FTR_VMX_COPY | CPU_FTR_HAS_PPR | CPU_FTR_DABRX | CPU_FTR_PKEY)

P7 supports protection keys for data access (AMR) but not for
instruction access (IAMR), right? There's nothing in the code making
this distinction, so either CPU_FTR_PKEY shouldn't be enabled in P7 or
separate feature bits for AMR and IAMR should be used and checked before
trying to access the IAMR.

>  #define CPU_FTRS_POWER8 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
>   CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | CPU_FTR_ARCH_206 |\
>   CPU_FTR_MMCRA | CPU_FTR_SMT | \
> @@ -462,7 +463,7 @@ enum {
>   CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
>   CPU_FTR_ICSWX | CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
>   CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_DAWR | \
> - CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP)
> + CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP | CPU_FTR_PKEY)
>  #define CPU_FTRS_POWER8E (CPU_FTRS_POWER8 | CPU_FTR_PMAO_BUG)
>  #define CPU_FTRS_POWER8_DD1 (CPU_FTRS_POWER8 & ~CPU_FTR_DBELL)
>  #define CPU_FTRS_POWER9 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
> @@ -474,7 +475,8 @@ enum {
>   CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
>   CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
>   CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_DAWR | \
> - CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP | CPU_FTR_ARCH_300)
> + CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP | CPU_FTR_ARCH_300 | \
> + CPU_FTR_PKEY)
>  #define CPU_FTRS_POWER9_DD1 ((CPU_FTRS_POWER9 | CPU_FTR_POWER9_DD1) & \
>(~CPU_FTR_SAO))
>  #define CPU_FTRS_CELL(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
> diff --git a/arch/powerpc/include/asm/mmu_context.h 
> b/arch/powerpc/include/asm/mmu_context.h
> index a1cfcca..acd59d8 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -188,6 +188,7 @@ static inline bool arch_vma_access_permitted(struct 
> vm_area_struct *vma,
>
>  #define pkey_initialize()
>  #define pkey_mm_init(mm)
> +#define pkey_mmu_values(total_data, total_execute)
>
>  static inline int vma_pkey(struct vm_area_struct *vma)
>  {
> diff --git a/arch/powerpc/include/asm/pkeys.h 
> b/arch/powerpc/include/asm/pkeys.h
> index ba7bff6..e61ed6c 100644
> --- a/arch/powerpc/include/asm/pkeys.h
> +++ b/arch/powerpc/include/asm/pkeys.h
> @@ -1,6 +1,8 @@
>  #ifndef _ASM_PPC64_PKEYS_H
>  #define _ASM_PPC64_PKEYS_H
>
> +#include 
> +
>  extern bool pkey_inited;
>  extern int pkeys_total; /* total pkeys as per device tree */
>  extern u32 initial_allocation_mask;/* bits set for reserved keys */
> @@ -227,6 +229,24 @@ static inline void pkey_mm_init(struct mm_struct *mm)
>   mm->context.execute_only_pkey = -1;
>  }
>
> +static inline void pkey_mmu_values(int total_data, int total_execute)
> +{
> + /*
> +  * since any pkey can be used for data or execute, we
> +  * will  just  treat all keys as equal and track them
> +  * as one entity.
> +  */
> + pkeys_total = total_data + total_execute;
> +}

Right now this works because the firmware reports 0 execute keys in the
device tree, but if (when?) it is fixed to report 32 execute keys as
well as 32 data keys (which are the same keys), any place using
pkeys_total expecting it to mean the number of keys that are available
will be broken. This includes pkey_initialize and mm_pkey_is_allocated.

Perhaps pkeys_total should use total_data as the number of keys
supported in the system, and total_execute just as a flag to say whether
there's a IAMR? Or, since P8 and later have IAMR and P7 is unlikely to
have the firmware fixed, maybe the kernel should just ignore
total_execute altogether?

> +static inline bool pkey_mmu_enabled(void)
> +{
> + if (firmware_has_feature(FW_FEATURE_LPAR))
> + return pkeys_total;
> + else
> + return cpu_has_feature(CPU_FTR_PKEY);
> +}
> +
>  static inline void pkey_initialize(void)
>  {
>   int os_reserved, i;
> @@ -236,9 +256,17 @@ static inline void pkey_initialize(void)
>* line will 

Re: [RFC v7 24/25] powerpc: Deliver SEGV signal on pkey violation

2017-08-10 Thread Thiago Jung Bauermann

Ram Pai  writes:

> The value of the AMR register at the time of exception
> is made available in gp_regs[PT_AMR] of the siginfo.
>
> The value of the pkey, whose protection got violated,
> is made available in si_pkey field of the siginfo structure.

Should the IAMR also be made available?

Also, should the AMR and IAMR be accesible to userspace (e.g., to GDB)
via ptrace and the core file?

> --- a/arch/powerpc/kernel/signal_32.c
> +++ b/arch/powerpc/kernel/signal_32.c
> @@ -500,6 +500,11 @@ static int save_user_regs(struct pt_regs *regs, struct 
> mcontext __user *frame,
>  (unsigned long) >tramp[2]);
>   }
>
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + if (__put_user(get_paca()->paca_amr, >mc_gregs[PT_AMR]))
> + return 1;
> +#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +
>   return 0;
>  }

frame->mc_gregs[PT_AMR] has 32 bits, but paca_amr has 64 bits. Does this
work as intended?

> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
> index c83c115..86a4262 100644
> --- a/arch/powerpc/kernel/signal_64.c
> +++ b/arch/powerpc/kernel/signal_64.c
> @@ -174,6 +174,10 @@ static long setup_sigcontext(struct sigcontext __user 
> *sc,
>   if (set != NULL)
>   err |=  __put_user(set->sig[0], >oldmask);
>
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + err |= __put_user(get_paca()->paca_amr, >gp_regs[PT_AMR]);
> +#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +
>   return err;
>  }

Isn't a corresponding change needed in restore_sigcontext? And in the
corresponding TM versions setup_tm_sigcontexts and restore_tm_sigcontexts?

Ditto for the equivalent functions in signal_32.c.

-- 
Thiago Jung Bauermann
IBM Linux Technology Center



Re: [RFC v7 09/25] powerpc: store and restore the pkey state across context switches

2017-08-10 Thread Thiago Jung Bauermann

Ram Pai  writes:
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -42,6 +42,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -1096,6 +1097,13 @@ static inline void save_sprs(struct thread_struct *t)
>   t->tar = mfspr(SPRN_TAR);
>   }
>  #endif
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + if (arch_pkeys_enabled()) {
> + t->amr = mfspr(SPRN_AMR);
> + t->iamr = mfspr(SPRN_IAMR);
> + t->uamor = mfspr(SPRN_UAMOR);
> + }
> +#endif
>  }

Is it worth having a flag in thread_struct saying whether it has every
called pkey_alloc and only do the mfsprs if it did?

> @@ -1131,6 +1139,16 @@ static inline void restore_sprs(struct thread_struct 
> *old_thread,
>   mtspr(SPRN_TAR, new_thread->tar);
>   }
>  #endif
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + if (arch_pkeys_enabled()) {
> + if (old_thread->amr != new_thread->amr)
> + mtspr(SPRN_AMR, new_thread->amr);
> + if (old_thread->iamr != new_thread->iamr)
> + mtspr(SPRN_IAMR, new_thread->iamr);
> + if (old_thread->uamor != new_thread->uamor)
> + mtspr(SPRN_UAMOR, new_thread->uamor);
> + }
> +#endif
>  }
>
>  struct task_struct *__switch_to(struct task_struct *prev,
> @@ -1689,6 +1707,13 @@ void start_thread(struct pt_regs *regs, unsigned long 
> start, unsigned long sp)
>   current->thread.tm_tfiar = 0;
>   current->thread.load_tm = 0;
>  #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + if (arch_pkeys_enabled()) {
> + current->thread.amr   = 0x0ul;
> + current->thread.iamr  = 0x0ul;
> + current->thread.uamor = 0x0ul;
> + }
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>  }
>  EXPORT_SYMBOL(start_thread);

-- 
Thiago Jung Bauermann
IBM Linux Technology Center



Re: [FIX PATCH v0] powerpc: Fix memory unplug failure on radix guest

2017-08-10 Thread Reza Arbab

On Thu, Aug 10, 2017 at 11:50:19AM -0500, Reza Arbab wrote:

On Thu, Aug 10, 2017 at 02:53:48PM +0530, Bharata B Rao wrote:

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index f830562..24ecf53 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -524,6 +524,7 @@ static int __init early_init_dt_scan_drconf_memory(unsigned 
long node)
size = 0x8000ul - base;
}
memblock_add(base, size);
+   memblock_mark_hotplug(base, size);
} while (--rngs);
}
memblock_dump_all();


Doing this has the effect of putting all the affected memory into 
ZONE_MOVABLE. See find_zone_movable_pfns_for_nodes(). This means no 
kernel allocations can occur there. Is that okay?


I should clarify. The behavior change I mention applies when 
movable_node_is_enabled().


--
Reza Arbab



Re: [PATCH] powerpc/64s: Add support for ASB_Notify on POWER9

2017-08-10 Thread christophe lombard

Le 07/08/2017 à 05:08, Michael Ellerman a écrit :

Hi Christophe,

I'm not across any of the details of this so hopefully most of these
comments aren't too stupid :)


Christophe Lombard  writes:


The POWER9 core supports a new feature: ASB_Notify which requires the
support of the Special Purpose Register: TIDR.

TIDR is defined in ISA 3.0B, which would be worth mentioning.

Can you spell out what an ASB_Notifiy is.


In addition to waking a thread up from the wait state via an
interrupt, the thread can also be removed from the wait state
via a ASB_notify command.
We see the acronyme ASB present in several documents, but
I never saw the definition of ASB.


The ASB_Notify command, generated by the AFU, will attempt to

And please tell us what an AFU is, this patch and the code it touches
are not obviously CAPI related, so the reader may not know what an AFU
is, or the broader context.


You are right, this patch is not obviously CAPI related.
The Coherent Accelerator Process Interface (CAPI) is a general
term for the infrastructure of attaching a coherent
accelerator (CAPI card) to a POWER system.
The AFU (Accelerator Function Unit), located on the CAPI card,
executes computation-heavy functions helping the
application running on the host processor. There is a direct
communication between the application and the accelerator.
The purpose of an AFU is to provide applications with a higher
computational unit to improve the performance of the
application and off-load the host processor.



wake-up the host thread identified by the particular LPID:PID:TID.

Host implies it doesn't work for threads in guests, but then LPID
implies it does.

You say "PID" a few times here, but I think you always mean PIDR? It
would be worth spelling that out, otherwise people will assume you're
talking about the Linux "pid" (which is actually the thread group id
(TGID)), which is not the same as PIDR.


You are completely right.  The use of the term 'PIDR' is much
more accurate.


The special register TIDR has to be updated to with the same value
present in the process element.

What's a process element? :)


An other CAPI term :-)
When an application (running on the host) requests use of an AFU,
a process element is added to the process-element linked list
that describes the application’s process state. The process element
also contains a work element descriptor (WED) provided by the
application. The WED contains the full description of the job to be
performed by the AFU.


If the length of the register TIDR is 64bits, the CAPI Translation

Is it 64-bits? The architecture says it is.


yes. The length is 64-bits.


Service Layer core (XSL9) for Power9 systems limits the size (16bits) of
the Thread ID when it generates the ASB_Notify message adding
PID:LPID:TID information from the context.

When you say "Thread ID" here I think you're referring to the "TID" in
the "PID:LPID:TID" tuple, and you're saying it is limited to 16-bits, by
the P9 CAPI hardware.


correct.


The content of the internal kernel Thread ID (32bits) can not therefore

"as returned by sys_gettid()" would help to disambiguate here.


be used to fulfill the register TIDR.

.. if you're intention is to use TIDR with CAPI.


I'm assuming here that by "kernel Thread ID" you mean the kernel "pid"
which is returned to userspace as the "TID" by gettid().


correct.



That value is global in the system, so it's overkill for this usage, as
all you need is a value unique to all threads that share a PIDR
(== mm_struct).

So it seems like we could also solve this by simply having a counter in
the mm_struct (actually mm_context_t), and handing that out when
userspace tries to read TIDR.

Or we could do the same and have some other API for reading it, ie. not
using mfspr() but an actual CAPI API. Or does userspace even need the
raw value?


Your remarks are very interesting. I need more time to look
into that.

Is there a usecase where multiple threads would assign themselves the
same TIDR so that all (some? one?) of them is woken up by the
ASB_notify? Or is that an error?


Normally no. Only one thread can have a non zero TIDR value.


This patch allows to avoid this limitation by adding a new interface

This doesn't avoid the 16-bit CAPI limitation AFAICS. What it does is
let (force) userspace manage the values of TIDR, and therefore it can
control them such that they fit in 16-bits.


right. The control will be done by libcxl. Libcxl is the focal point
to manage the values of TIDR.


for the user. The instructions mfspr/mtspr SPRN_TIDR are emulated,
save/restore SPRs (context switch) are updated and a new feature
(CPU_FTR_TIDR) is added to POWER9 system.

I'm not clear if we need a CPU feature.

TIDR is defined in ISA 3.0B, so anything that implements ISA 3.0B must
implemented TIDR.

It's not entirely clear if the kernel's CPU_FTR_ARCH_300 means 3.0 or
3.0B, but given 3.0B is essentially "what got implemented in P9" it
should probably be 

Re: [RFC v7 02/25] powerpc: track allocation status of all pkeys

2017-08-10 Thread Thiago Jung Bauermann

Ram Pai  writes:
>  static inline void pkey_initialize(void)
>  {
> + int os_reserved, i;
> +
>   /* disable the pkey system till everything
>* is in place. A patch further down the
>* line will enable it.
>*/
>   pkey_inited = false;
> +
> + /* Lets assume 32 keys */
> + pkeys_total = 32;
> +
> +#ifdef CONFIG_PPC_4K_PAGES
> + /*
> +  * the OS can manage only 8 pkeys
> +  * due to its inability to represent
> +  * them in the linux 4K-PTE.
> +  */
> + os_reserved = pkeys_total-8;
> +#else
> + os_reserved = 0;
> +#endif
> + /*
> +  * Bits are in LE format.
> +  * NOTE: 1, 0 are reserved.
> +  * key 0 is the default key, which allows read/write/execute.
> +  * key 1 is recommended not to be used.
> +  * PowerISA(3.0) page 1015, programming note.
> +  */
> + initial_allocation_mask = ~0x0;
> + for (i = 2; i < (pkeys_total - os_reserved); i++)
> + initial_allocation_mask &= ~(0x1<  }
>  #endif /*_ASM_PPC64_PKEYS_H */

In v6, key 31 was also reserved, but it's not in this version. Is this
intentional?

Isn't it better for this function to be in pkeys.c? Ideally, functions
should be in .c files not in headers unless they're very small or
performance sensitive IMHO.

-- 
Thiago Jung Bauermann
IBM Linux Technology Center



Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count

2017-08-10 Thread Laurent Dufour
On 10/08/2017 15:43, Kirill A. Shutemov wrote:
> On Thu, Aug 10, 2017 at 10:27:50AM +0200, Laurent Dufour wrote:
>> On 10/08/2017 02:58, Kirill A. Shutemov wrote:
>>> On Wed, Aug 09, 2017 at 12:43:33PM +0200, Laurent Dufour wrote:
 On 09/08/2017 12:12, Kirill A. Shutemov wrote:
> On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote:
>> The VMA sequence count has been introduced to allow fast detection of
>> VMA modification when running a page fault handler without holding
>> the mmap_sem.
>>
>> This patch provides protection agains the VMA modification done in :
>>  - madvise()
>>  - mremap()
>>  - mpol_rebind_policy()
>>  - vma_replace_policy()
>>  - change_prot_numa()
>>  - mlock(), munlock()
>>  - mprotect()
>>  - mmap_region()
>>  - collapse_huge_page()
>
> I don't thinks it's anywhere near complete list of places where we touch
> vm_flags. What is your plan for the rest?

 The goal is only to protect places where change to the VMA is impacting the
 page fault handling. If you think I missed one, please advise.
>>>
>>> That's very fragile approach. We rely here too much on specific compiler 
>>> behaviour.
>>>
>>> Any write access to vm_flags can, in theory, be translated to several
>>> write accesses. For instance with setting vm_flags to 0 in the middle,
>>> which would result in sigfault on page fault to the vma.
>>
>> Indeed, just setting vm_flags to 0 will not result in sigfault, the real
>> job is done when the pte are updated and the bits allowing access are
>> cleared. Access to the pte is controlled by the pte lock.
>> Page fault handler is triggered based on the pte bits, not the content of
>> vm_flags and the speculative page fault is checking for the vma again once
>> the pte lock is held. So there is no concurrency when dealing with the pte
>> bits.
> 
> Suppose we are getting page fault to readable VMA, pte is clear at the
> time of page fault. In this case we need to consult vm_flags to check if
> the vma is read-accessible.
> 
> If by the time of check vm_flags happend to be '0' we would get SIGSEGV as
> the vma appears to be non-readable.
> 
> Where is my logic faulty?

The speculative page fault handler will not deliver the signal, if the page
fault can't be done in the speculative path for instance because the
vm_flags are not matching the required one, the speculative page fault is
aborted and the *classic* page fault handler is run which will do the job
again grabbing the mmap_sem.

>> Regarding the compiler behaviour, there are memory barriers and locking
>> which should prevent that.
> 
> Which locks barriers are you talking about?

When the VMA is modified and that the changes will impact the speculative
page fault handler the sequence count is touch using write_seqcount_begin()
and write_seqcount_end(). These 2 services contains calls to smp_wmb().
On the speculative path side, the calls to *_read_seqcount() contains also
memory barriers calls.

> We need at least READ_ONCE/WRITE_ONCE to access vm_flags everywhere.

I don't think READ_ONCE/WRITE_ONCE would help here, as they would not
prevent reading transcient state as the vm_flags example you mentioned.

That said, there are not so much VMA's fields used in the SPF's path and
caching them into the vmf structure under the control of the VMA's sequence
count would solve this.
I'll try to move in that direction unless anyone has a better idea.


Cheers,
Laurent.



Re: [FIX PATCH v0] powerpc: Fix memory unplug failure on radix guest

2017-08-10 Thread Reza Arbab

On Thu, Aug 10, 2017 at 02:53:48PM +0530, Bharata B Rao wrote:

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index f830562..24ecf53 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -524,6 +524,7 @@ static int __init early_init_dt_scan_drconf_memory(unsigned 
long node)
size = 0x8000ul - base;
}
memblock_add(base, size);
+   memblock_mark_hotplug(base, size);
} while (--rngs);
}
memblock_dump_all();


Doing this has the effect of putting all the affected memory into 
ZONE_MOVABLE. See find_zone_movable_pfns_for_nodes(). This means no 
kernel allocations can occur there. Is that okay?



diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 671a45d..180d25a 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -12,6 +12,7 @@
#include 
#include 
#include 
+#include 

#include 
#include 
@@ -255,15 +256,25 @@ static void __init radix_init_pgtable(void)
{
unsigned long rts_field;
struct memblock_region *reg;
+   phys_addr_t addr;
+   u64 lmb_size = memory_block_size_bytes();

/* We don't support slb for radix */
mmu_slb_size = 0;
/*
 * Create the linear mapping, using standard page size for now
 */
-   for_each_memblock(memory, reg)
-   WARN_ON(create_physical_mapping(reg->base,
-   reg->base + reg->size));
+   for_each_memblock(memory, reg) {
+   if (memblock_is_hotpluggable(reg)) {
+   for (addr = reg->base; addr < (reg->base + reg->size);
+   addr += lmb_size)
+   WARN_ON(create_physical_mapping(addr,
+   addr + lmb_size));
+   } else {
+   WARN_ON(create_physical_mapping(reg->base,
+   reg->base + reg->size));
+   }
+   }

/* Find out how many PID bits are supported */
if (cpu_has_feature(CPU_FTR_HVMODE)) {
--
2.7.4



--
Reza Arbab



Re: [PATCH] powerpc/pseries: Check memory device state before onlining/offlining

2017-08-10 Thread Laurent Vivier
On 02/08/2017 20:03, Nathan Fontenot wrote:
> When DLPAR adding or removing memory we need to check the device
> offline status before trying to online/offline the memory. This is
> needed because calls device_online() and device_offline() will return
> non-zero for memory that is already online and offline respectively.
> 
> This update resolves two scenarios. First, for kernel built with
> auto-online memory enabled, memory will be onlined as part of calls
> to add_memory(). After adding the memory the pseries dlpar code tries
> to online it and fails since the memory is already online. The dlpar
> code then tries to remove the memory which produces the oops message
> below because the memory is not offline.
> 
> The second scenario occurs when removing memory that is already offline,
> i.e. marking memory offline (via sysfs) and the trying to remove that
> memory. This doesn't work because offlining the already offline memory
> does not succeed and the dlpar code then fails the dlpar remove operation.
> 
> The fix for both scenarios is to check the device.offline status before
> making the calls to device_online() or device_offline().
> 
> kernel BUG at mm/memory_hotplug.c:2189!
> Oops: Exception in kernel mode, sig: 5 [#1]
> SMP NR_CPUS=2048
> NUMA
> pSeries
> CPU: 0 PID: 5 Comm: kworker/u129:0 Not tainted 4.12.0-rc3 #272
> Workqueue: pseries hotplug workque .pseries_hp_work_fn
> task: c003f9c89200 task.stack: c003f9d1
> NIP: c02ca428 LR: c02ca3cc CTR: c0ba16a0
> REGS: c003f9d13630 TRAP: 0700   Not tainted  (4.12.0-rc3)
> MSR: 8282b032 
>   CR: 22002024  XER: 000a
> CFAR: c02ca3d0 SOFTE: 1
> GPR00: c02ca3cc c003f9d138b0 c1bb0200 0001
> GPR04: c003fb143c80 c003fef21630 0003 0002
> GPR08: 0003 0003 0003 31b1
> GPR12: 28002042 cfd8 c0118ae0 c003fb170180
> GPR16:  0004 0010 c00379c8
> GPR20: c0037b68 c003f728ff84 0002 0010
> GPR24: 0002 c003f728ff80 0002 0001
> GPR28: c003fb143c38 0002 1000 2000
> NIP [c02ca428] .remove_memory+0xb8/0xc0
> LR [c02ca3cc] .remove_memory+0x5c/0xc0
> Call Trace:
> [c003f9d138b0] [c02ca3cc] .remove_memory+0x5c/0xc0 (unreliable)
> [c003f9d13940] [c00938a4] .dlpar_add_lmb+0x384/0x400
> [c003f9d13a30] [c009456c] .dlpar_memory+0x5dc/0xca0
> [c003f9d13af0] [c008ce84] .handle_dlpar_errorlog+0x74/0xe0
> [c003f9d13b70] [c008cf1c] .pseries_hp_work_fn+0x2c/0x90
> [c003f9d13bf0] [c0110a5c] .process_one_work+0x17c/0x460
> [c003f9d13c90] [c0110dc8] .worker_thread+0x88/0x500
> [c003f9d13d70] [c0118c3c] .kthread+0x15c/0x1a0
> [c003f9d13e30] [c000ba18] .ret_from_kernel_thread+0x58/0xc0
> Instruction dump:
> 7fe3fb78 4bd7c845 6000 7fa3eb78 4bfdd3c9 38210090 e8010010 eba1ffe8
> ebc1fff0 ebe1fff8 7c0803a6 4bfdc2ac <0fe0>  7c0802a6 fb01ffc0
> 
> Fixes: 943db62c316c ("powerpc/pseries: Revert 'Auto-online hotplugged 
> memory'")
> Signed-off-by: Nathan Fontenot 

tested the first scenario with 4.13.0-rc4 and qemu 2.10.0-rc2.

Tested-by: Laurent Vivier 
Reviewed-by: Laurent Vivier 


Re: [PATCH v7 7/9] mm: Add address parameter to arch_validate_prot()

2017-08-10 Thread Khalid Aziz

On 08/10/2017 07:20 AM, Michael Ellerman wrote:

Khalid Aziz  writes:


A protection flag may not be valid across entire address space and
hence arch_validate_prot() might need the address a protection bit is
being set on to ensure it is a valid protection flag. For example, sparc
processors support memory corruption detection (as part of ADI feature)
flag on memory addresses mapped on to physical RAM but not on PFN mapped
pages or addresses mapped on to devices. This patch adds address to the
parameters being passed to arch_validate_prot() so protection bits can
be validated in the relevant context.

Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
---
v7:
- new patch

  arch/powerpc/include/asm/mman.h | 2 +-
  arch/powerpc/kernel/syscalls.c  | 2 +-
  include/linux/mman.h| 2 +-
  mm/mprotect.c   | 2 +-
  4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index 30922f699341..bc74074304a2 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -40,7 +40,7 @@ static inline bool arch_validate_prot(unsigned long prot)
return false;
return true;
  }
-#define arch_validate_prot(prot) arch_validate_prot(prot)
+#define arch_validate_prot(prot, addr) arch_validate_prot(prot)


This can be simpler, as just:

#define arch_validate_prot arch_validate_prot



Hi Michael,

Thanks for reviewing!

My patch expands parameter list for arch_validate_prot() from one to two 
parameters. Existing powerpc version of arch_validate_prot() is written 
with one parameter. If I use the above #define, compilation fails with:


mm/mprotect.c: In function ‘do_mprotect_pkey’:
mm/mprotect.c:399: error: too many arguments to function 
‘arch_validate_prot’


Another way to solve it would be to add the new addr parameter to 
powerpc version of arch_validate_prot() but I chose the less disruptive 
solution of tackling it through #define and expanded the existing 
#define to include the new parameter. Make sense?


Thanks,
Khalid


Re: [PATCH 11/13] powerpc/64s: idle POWER9 can execute stop without ptesync

2017-08-10 Thread Nicholas Piggin
On Thu, 10 Aug 2017 23:15:50 +1000
Michael Ellerman  wrote:

> Nicholas Piggin  writes:
> 
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/kernel/idle_book3s.S   |  7 ---
> >  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 24   
> 
> If you can split this into a KVM and non-KVM patch that would be
> helpful.

Yes I can do that.



Re: [PATCH 10/13] powerpc/64s: idle simplify KVM idle on POWER9

2017-08-10 Thread Nicholas Piggin
On Thu, 10 Aug 2017 23:14:46 +1000
Michael Ellerman  wrote:

> Nicholas Piggin  writes:
> 
> > POWER9 CPUs have independent MMU contexts per thread so KVM
> > does not have to bring sibling threads into real-mode when
> > switching MMU mode to guest. This can simplify POWER9 sleep/wake
> > paths and avoids hwsyncs.
> >
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/include/asm/kvm_book3s_asm.h |  4 
> >  arch/powerpc/kernel/idle_book3s.S |  8 ++-
> >  arch/powerpc/kvm/book3s_hv.c  | 37 
> > ++-
> >  arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  8 +++  
> 
> This will need to go via, or at least be shared with Paul's tree.
> 
> So if it's possible, splitting it out of this series would be easier.

I agree it's really a KVM patch, but patch 12 depends on this, 
it is a Linux patch. Not sure how you want to handle that?

Thanks,
Nick


Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count

2017-08-10 Thread Kirill A. Shutemov
On Thu, Aug 10, 2017 at 10:27:50AM +0200, Laurent Dufour wrote:
> On 10/08/2017 02:58, Kirill A. Shutemov wrote:
> > On Wed, Aug 09, 2017 at 12:43:33PM +0200, Laurent Dufour wrote:
> >> On 09/08/2017 12:12, Kirill A. Shutemov wrote:
> >>> On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote:
>  The VMA sequence count has been introduced to allow fast detection of
>  VMA modification when running a page fault handler without holding
>  the mmap_sem.
> 
>  This patch provides protection agains the VMA modification done in :
>   - madvise()
>   - mremap()
>   - mpol_rebind_policy()
>   - vma_replace_policy()
>   - change_prot_numa()
>   - mlock(), munlock()
>   - mprotect()
>   - mmap_region()
>   - collapse_huge_page()
> >>>
> >>> I don't thinks it's anywhere near complete list of places where we touch
> >>> vm_flags. What is your plan for the rest?
> >>
> >> The goal is only to protect places where change to the VMA is impacting the
> >> page fault handling. If you think I missed one, please advise.
> > 
> > That's very fragile approach. We rely here too much on specific compiler 
> > behaviour.
> > 
> > Any write access to vm_flags can, in theory, be translated to several
> > write accesses. For instance with setting vm_flags to 0 in the middle,
> > which would result in sigfault on page fault to the vma.
> 
> Indeed, just setting vm_flags to 0 will not result in sigfault, the real
> job is done when the pte are updated and the bits allowing access are
> cleared. Access to the pte is controlled by the pte lock.
> Page fault handler is triggered based on the pte bits, not the content of
> vm_flags and the speculative page fault is checking for the vma again once
> the pte lock is held. So there is no concurrency when dealing with the pte
> bits.

Suppose we are getting page fault to readable VMA, pte is clear at the
time of page fault. In this case we need to consult vm_flags to check if
the vma is read-accessible.

If by the time of check vm_flags happend to be '0' we would get SIGSEGV as
the vma appears to be non-readable.

Where is my logic faulty?

> Regarding the compiler behaviour, there are memory barriers and locking
> which should prevent that.

Which locks barriers are you talking about?

We need at least READ_ONCE/WRITE_ONCE to access vm_flags everywhere.

-- 
 Kirill A. Shutemov


Re: [PATCH v7 7/9] mm: Add address parameter to arch_validate_prot()

2017-08-10 Thread Michael Ellerman
Khalid Aziz  writes:

> A protection flag may not be valid across entire address space and
> hence arch_validate_prot() might need the address a protection bit is
> being set on to ensure it is a valid protection flag. For example, sparc
> processors support memory corruption detection (as part of ADI feature)
> flag on memory addresses mapped on to physical RAM but not on PFN mapped
> pages or addresses mapped on to devices. This patch adds address to the
> parameters being passed to arch_validate_prot() so protection bits can
> be validated in the relevant context.
>
> Signed-off-by: Khalid Aziz 
> Cc: Khalid Aziz 
> ---
> v7:
>   - new patch
>
>  arch/powerpc/include/asm/mman.h | 2 +-
>  arch/powerpc/kernel/syscalls.c  | 2 +-
>  include/linux/mman.h| 2 +-
>  mm/mprotect.c   | 2 +-
>  4 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
> index 30922f699341..bc74074304a2 100644
> --- a/arch/powerpc/include/asm/mman.h
> +++ b/arch/powerpc/include/asm/mman.h
> @@ -40,7 +40,7 @@ static inline bool arch_validate_prot(unsigned long prot)
>   return false;
>   return true;
>  }
> -#define arch_validate_prot(prot) arch_validate_prot(prot)
> +#define arch_validate_prot(prot, addr) arch_validate_prot(prot)

This can be simpler, as just:

#define arch_validate_prot arch_validate_prot

cheers


Re: [PATCH 11/13] powerpc/64s: idle POWER9 can execute stop without ptesync

2017-08-10 Thread Michael Ellerman
Nicholas Piggin  writes:

> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/kernel/idle_book3s.S   |  7 ---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 24 

If you can split this into a KVM and non-KVM patch that would be
helpful.

cheers


Re: [PATCH 10/13] powerpc/64s: idle simplify KVM idle on POWER9

2017-08-10 Thread Michael Ellerman
Nicholas Piggin  writes:

> POWER9 CPUs have independent MMU contexts per thread so KVM
> does not have to bring sibling threads into real-mode when
> switching MMU mode to guest. This can simplify POWER9 sleep/wake
> paths and avoids hwsyncs.
>
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/kvm_book3s_asm.h |  4 
>  arch/powerpc/kernel/idle_book3s.S |  8 ++-
>  arch/powerpc/kvm/book3s_hv.c  | 37 
> ++-
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  8 +++

This will need to go via, or at least be shared with Paul's tree.

So if it's possible, splitting it out of this series would be easier.

cheers


Re: [PATCH 08/13] powerpc/64s: irq replay remove spurious irq reason

2017-08-10 Thread Michael Ellerman
Nicholas Piggin  writes:

> On Sun, 06 Aug 2017 09:00:32 +1000
> Benjamin Herrenschmidt  wrote:
>
>> On Sun, 2017-08-06 at 03:02 +1000, Nicholas Piggin wrote:
>> > HVI interrupts have always used 0x500, so remove the dead branch.  
>> 
>> Maybe we should fix that and "catch" in incorrect entry via 0x500
>> which would mean the XIVE is trying to deliver guest irqs to the OS...
>
> I should be more clear, when I say 0x500, it is only in reference to
> the constant used by the soft-irq replay. After patch 6 the replay is
> sent to the 0xea0 common handler.
>
>> That can happen if some LPCR bits aren't set properly and/or KVM
>> doesn't pull the guest in time. I had bugs like that in my early
>> dev so I've been running with a b . at 0x500 for a while :-)
>
> So that's a separate issue of hardware actually doing a 0x500. I
> had this
>
> http://patchwork.ozlabs.org/patch/750033/

Hmm, yeah I was going to merge that.

But I got side tracked by HVICE, ie. the "interrupts don't go to 0x500"
is only true if we set HVICE.

Which we currently always do, in __setup_cpu_power9(), but then we have
a DT CPU feature for it, so we really shouldn't be always enabling
HVICE, we should leave it up to the DT CPU feature code.

And then it becomes controlled by the device tree, which makes your
patch potentially wrong depending on the device tree CPU features.

Ugh.

cheers


Re: [RFC Part1 PATCH v3 05/17] x86, realmode: Don't decrypt trampoline area under SEV

2017-08-10 Thread Tom Lendacky

On 7/26/2017 11:03 AM, Borislav Petkov wrote:

  Subject: x86/realmode: ...


Done.



On Mon, Jul 24, 2017 at 02:07:45PM -0500, Brijesh Singh wrote:

From: Tom Lendacky 

When SEV is active the trampoline area will need to be in encrypted
memory so only mark the area decrypted if SME is active.

Signed-off-by: Tom Lendacky 
Signed-off-by: Brijesh Singh 
---
  arch/x86/realmode/init.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 1f71980..c7eeca7 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -63,9 +63,11 @@ static void __init setup_real_mode(void)
/*
 * If SME is active, the trampoline area will need to be in
 * decrypted memory in order to bring up other processors
-* successfully.
+* successfully. For SEV the trampoline area needs to be in
+* encrypted memory, so only do this for SME.


Or simply say:

"It is not needed for SEV."


Will do.

Thanks,
Tom





Re: [1/6] powerpc: NMI IPI improve lock primitive

2017-08-10 Thread Michael Ellerman
On Wed, 2017-08-09 at 12:41:21 UTC, Nicholas Piggin wrote:
> When the NMI IPI lock is contended, spin at low SMT priority, using
> loads only, and with interrupts enabled (where possible). This
> improves behaviour under high contention (e.g., a system crash when
> a number of CPUs are trying to enter the debugger).
> 
> Signed-off-by: Nicholas Piggin 

Series applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/0459ddfdb31e7d812b555a2530ecba

cheers


Re: powerpc/configs: Re-enable HARD/SOFT lockup detectors

2017-08-10 Thread Michael Ellerman
On Wed, 2017-08-09 at 10:57:55 UTC, Michael Ellerman wrote:
> In commit 05a4a9527931 ("kernel/watchdog: split up config options"),
> CONFIG_LOCKUP_DETECTOR was split into two separate config options,
> CONFIG_HARDLOCKUP_DETECTOR and CONFIG_SOFTLOCKUP_DETECTOR.
> 
> Our defconfigs still have CONFIG_LOCKUP_DETECTOR=y, but that is no longer
> user selectable, and we don't mention the new options, so we end up with
> none of them enabled.
> 
> So update the defconfigs to turn on the new SOFT and HARD options, the
> end result being the same as what we had previously.
> 
> Fixes: 05a4a9527931 ("kernel/watchdog: split up config options")
> Signed-off-by: Michael Ellerman 

Applied to powerpc fixes.

https://git.kernel.org/powerpc/c/7310d5c8c55e8987a854507f71022f

cheers


Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller

2017-08-10 Thread Benjamin Herrenschmidt
On Thu, 2017-08-10 at 09:19 +0200, Cédric Le Goater wrote:
> > > > > +  /* Perform the acknowledge hypervisor to register cycle */
> > > > > +  ack = be16_to_cpu(__raw_readw(xive_tima + TM_SPC_ACK_OS_REG));
> > > > 
> > > > Why do you need the raw_readw() + be16_to_cpu + mb, rather than one of
> > > > the higher level IO helpers?
> > > 
> > > This is one of the many ways to do MMIOs on the TIMA. This memory 
> > > region defines a set of offsets and sizes for which loads and 
> > > stores have different effects. 
> > > 
> > > See the arch/powerpc/include/asm/xive-regs.h file and 
> > > arch/powerpc/kvm/book3s_xive_template.c for some more usage.
> > 
> > Sure, much like any IO region.  My point is, why do you want this kind
> > of complex combo, rather than say an in_be16() or readw_be().
> > 
> 
> The code is inherited from the native backend. 
> 
> I think this is because we know what we are doing and we skip 
> the synchronization routines of the higher level IO helpers. 
> That might not be necessary for sPAPR. Ben ? 

It's a performance optimisation, we know we don't need the
sync,twi,isync crap of the higher level accessor here.

Cheers,
Ben.



Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller

2017-08-10 Thread Benjamin Herrenschmidt
On Thu, 2017-08-10 at 08:45 +0200, Cédric Le Goater wrote:
> > The problem with doorbells on POWER9 guests is that they may have
> > to trap and be emulated by the hypervisor, since the guest threads
> > on P9 don't have to match the HW threads of the core.
> 
> Well, the pseries cause_ipi() handler does :
> 
> if (doorbell_try_core_ipi(cpu))
> return;
> 
> to limit the doorbells to the same core. So we should be fine ?

No. It's theorically possible to create a guest that think it has 4
threads on P9 but those threads run on different cores of the host.

The doorbells are useful if KVM uses a "P8 style" whole-core dispatch
model or with PowerVM. We should probably invent some kind of DT
property to tell the guest I suppoes.

>  If not
> I suppose we should check CPU_FTR_ARCH_300 and use IPIs only for XIVE.
> 
> > Thus it's quite possible that using XIVE for IPIs is actually faster
> > than doorbells in that case.
> 
> How can we measure that ? ebizzy may be.

Or a simple socket ping pong with processes pinned to different
threads.

However the current KVM for P9 doesn't do threads yet afaik.

Cheers,
Ben.


[PATCH v2 2/3] livepatch: send a fake signal to all blocking tasks

2017-08-10 Thread Miroslav Benes
Live patching consistency model is of LEAVE_PATCHED_SET and
SWITCH_THREAD. This means that all tasks in the system have to be marked
one by one as safe to call a new patched function. Safe means when a
task is not (sleeping) in a set of patched functions. That is, no
patched function is on the task's stack. Another clearly safe place is
the boundary between kernel and userspace. The patching waits for all
tasks to get outside of the patched set or to cross the boundary. The
transition is completed afterwards.

The problem is that a task can block the transition for quite a long
time, if not forever. It could sleep in a set of patched functions, for
example.  Luckily we can force the task to leave the set by sending it a
fake signal, that is a signal with no data in signal pending structures
(no handler, no sign of proper signal delivered). Suspend/freezer use
this to freeze the tasks as well. The task gets TIF_SIGPENDING set and
is woken up (if it has been sleeping in the kernel before) or kicked by
rescheduling IPI (if it was running on other CPU). This causes the task
to go to kernel/userspace boundary where the signal would be handled and
the task would be marked as safe in terms of live patching.

There are tasks which are not affected by this technique though. The
fake signal is not sent to kthreads. They should be handled in a
different way. They can be woken up so they leave the patched set and
their TIF_PATCH_PENDING can be cleared thanks to stack checking.

For the sake of completeness, if the task is in TASK_RUNNING state but
not currently running on some CPU it doesn't get the IPI, but it would
eventually handle the signal anyway. Second, if the task runs in the
kernel (in TASK_RUNNING state) it gets the IPI, but the signal is not
handled on return from the interrupt. It would be handled on return to
the userspace in the future when the fake signal is sent again. Stack
checking deals with these cases in a better way.

If the task was sleeping in a syscall it would be woken by our fake
signal, it would check if TIF_SIGPENDING is set (by calling
signal_pending() predicate) and return ERESTART* or EINTR. Syscalls with
ERESTART* return values are restarted in case of the fake signal (see
do_signal()). EINTR is propagated back to the userspace program. This
could disturb the program, but...

* each process dealing with signals should react accordingly to EINTR
  return values.
* syscalls returning EINTR happen to be quite common situation in the
  system even if no fake signal is sent.
* freezer sends the fake signal and does not deal with EINTR anyhow.
  Thus EINTR values are returned when the system is resumed.

The very safe marking is done in architectures' "entry" on syscall and
interrupt/exception exit paths, and in a stack checking functions of
livepatch.  TIF_PATCH_PENDING is cleared and the next
recalc_sigpending() drops TIF_SIGPENDING. In connection with this, also
call klp_update_patch_state() before do_signal(), so that
recalc_sigpending() in dequeue_signal() can clear TIF_PATCH_PENDING
immediately and thus prevent a double call of do_signal().

Note that the fake signal is not sent to stopped/traced tasks. Such task
prevents the patching to finish till it continues again (is not traced
anymore).

Last, sending the fake signal is not automatic. It is done only when
admin requests it by writing 1 to force sysfs attribute in livepatch
sysfs directory.

Signed-off-by: Miroslav Benes 
Cc: Oleg Nesterov 
Cc: Michael Ellerman 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Andy Lutomirski 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: x...@kernel.org
---
 Documentation/ABI/testing/sysfs-kernel-livepatch |  4 ++-
 Documentation/livepatch/livepatch.txt|  5 ++-
 arch/powerpc/kernel/signal.c |  6 ++--
 arch/x86/entry/common.c  |  6 ++--
 kernel/livepatch/core.c  |  9 --
 kernel/livepatch/transition.c| 40 
 kernel/livepatch/transition.h|  1 +
 kernel/signal.c  |  4 ++-
 8 files changed, 64 insertions(+), 11 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-livepatch 
b/Documentation/ABI/testing/sysfs-kernel-livepatch
index b7a487ca8852..45f4e3551d27 100644
--- a/Documentation/ABI/testing/sysfs-kernel-livepatch
+++ b/Documentation/ABI/testing/sysfs-kernel-livepatch
@@ -16,9 +16,11 @@ Contact: live-patch...@vger.kernel.org
The attribute allows administrator to affect the course of an
existing transition.
 
-   Reading from the file returns all available operations.
+   Reading from the file returns all available operations, which
+   may be "signal" (signalling remaining tasks).
 
Writing 

Re: [PATCH] scsi-mq: Always unprepare before requeuing a request

2017-08-10 Thread Michael Ellerman
"Martin K. Petersen"  writes:
>> One of the two scsi-mq functions that requeue a request unprepares a
>> request before requeueing (scsi_io_completion()) but the other
>> function not (__scsi_queue_insert()). Make sure that a request is
>> unprepared before requeuing it.
>
> Applied to 4.13/scsi-fixes. Thanks much!

This seems to be preventing my Power8 box, which uses IPR, from booting.

Bisect said so:

# first bad commit: [270065e92c317845d69095ec8e3d18616b5b39d5] scsi: scsi-mq: 
Always unprepare before requeuing a request

And if I revert that on top of next-20170809 my system boots again.

The symptom is that it just gets "stuck" during boot and never gets to
mounting root, full log below, the end is:

  .
  ready
  ready
  sd 0:2:4:0: [sde] 554287104 512-byte logical blocks: (284 GB/264 GiB)
  sd 0:2:4:0: [sde] 4096-byte physical blocks
  sd 0:2:5:0: [sdf] 272646144 512-byte logical blocks: (140 GB/130 GiB)
  sd 0:2:5:0: [sdf] 4096-byte physical blocks
  sd 0:2:4:0: [sde] Write Protect is off
  sd 0:2:4:0: [sde] Mode Sense: 0b 00 00 08
  sd 0:2:5:0: [sdf] Write Protect is off
  sd 0:2:5:0: [sdf] Mode Sense: 0b 00 00 08


And it just sits there for at least hours.

I compared a good and bad boot log and there appears to be essentially
no difference. Certainly nothing that looks suspicous.

Any ideas?

cheers



[   29.492780] kexec_core: Starting new kernel
Allocated 2621440 bytes for 2048 pacas at cfd8
Page sizes from device-tree:
base_shift=12: shift=12, sllp=0x, avpnm=0x, tlbiel=1, penc=0
base_shift=12: shift=16, sllp=0x, avpnm=0x, tlbiel=1, penc=7
base_shift=12: shift=24, sllp=0x, avpnm=0x, tlbiel=1, penc=56
base_shift=16: shift=16, sllp=0x0110, avpnm=0x, tlbiel=1, penc=1
base_shift=16: shift=24, sllp=0x0110, avpnm=0x, tlbiel=1, penc=8
base_shift=24: shift=24, sllp=0x0100, avpnm=0x0001, tlbiel=0, penc=0
base_shift=34: shift=34, sllp=0x0120, avpnm=0x07ff, tlbiel=0, penc=3
Huge page(16GB) memory: addr = 0x8 size = 0x4 pages = 1
Huge page(16GB) memory: addr = 0xC size = 0x4 pages = 1
Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24
Using 1TB segments
Initializing hash mmu with SLB
Linux version 4.12.0-gcc-6.3.1-10811-g270065e92c31 (mich...@ka3.ozlabs.ibm.com) 
(gcc version 6.3.1 20170214 (Custom e9096cb27f4bd642)) #384 SMP Thu Aug 10 
19:43:35 AEST 2017
Using pSeries machine description
bootconsole [udbg0] enabled
Partition configured for 96 cpus.
CPU maps initialized for 8 threads per core
 (thread shift is 3)
Freed 2490368 bytes for unused pacas
 -> smp_release_cpus()
spinning_secondaries = 0
 <- smp_release_cpus()
-
ppc64_pft_size= 0x1c
phys_mem_size = 0x10
dcache_bsize  = 0x80
icache_bsize  = 0x80
cpu_features  = 0x07fc7aec18500249
  possible= 0x5fff18500649
  always  = 0x18100040
cpu_user_features = 0xdc0065c2 0xef00
mmu_features  = 0x7c006001
firmware_features = 0x0001c45ffc5f
htab_hash_mask= 0x1f
-
numa:   NODE_DATA [mem 0x36300-0x3]
numa:   NODE_DATA [mem 0x7ffcc2300-0x7ffccbfff]
numa:   NODE_DATA [mem 0x7ffcb8600-0x7ffcc22ff]
numa: NODE_DATA(2) on node 1
numa:   NODE_DATA [mem 0x7ffcae900-0x7ffcb85ff]
numa: NODE_DATA(3) on node 1
node 2 must be removed before remove section 2045
PCI host bridge /pci@8002015  ranges:
 MEM 0x3fc0c000..0x3fc0cfff -> 0xc000 
 MEM 0x3018..0x301b -> 0x0003d018 
PCI host bridge /pci@800201b  ranges:
 MEM 0x3fc2f000..0x3fc2f7ff -> 0xf000 
PCI host bridge /pci@800201e  ranges:
 MEM 0x3fc2c000..0x3fc2dfff -> 0xc000 
 MEM 0x3058..0x305b -> 0x0003d058 
PPC64 nvram contains 15360 bytes
Top of RAM: 0x10, Total RAM: 0x10
Memory hole size: 0MB
Zone ranges:
  DMA  [mem 0x-0x000f]
  DMA32empty
  Normal   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x-0x0003]
  node   1: [mem 0x0004-0x0007]
  node   2: [mem 0x0008-0x000b]
  node   3: [mem 0x000c-0x000f]
Initmem setup node 0 [mem 0x-0x0003]
On node 0 totalpages: 262144
  DMA zone: 256 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 262144 pages, LIFO batch:1
Initmem setup node 1 [mem 0x0004-0x0007]
On node 1 totalpages: 262144
  DMA zone: 256 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 262144 pages, LIFO batch:1
Initmem setup node 2 [mem 0x0008-0x000b]
On node 2 totalpages: 262144
  DMA zone: 256 pages used for memmap
 

Re: [PATCH 07/10] powerpc/xive: add XIVE exploitation mode to CAS

2017-08-10 Thread Cédric Le Goater
On 08/08/2017 10:56 AM, Cédric Le Goater wrote:
> On POWER9, the Client Architecture Support (CAS) negotiation process
> determines whether the guest operates in XIVE Legacy compatibility or
> in XIVE exploitation mode.
> 
> Now that we have initial guest support for the XIVE interrupt
> controller, let's inform the hypervisor what we can do.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/kernel/prom_init.c | 15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
> index 613f79f03877..25c14f543bd7 100644
> --- a/arch/powerpc/kernel/prom_init.c
> +++ b/arch/powerpc/kernel/prom_init.c
> @@ -177,6 +177,7 @@ struct platform_support {
>   bool hash_mmu;
>   bool radix_mmu;
>   bool radix_gtse;
> + bool xive;
>  };
>  
>  /* Platforms codes are now obsolete in the kernel. Now only used within this
> @@ -1054,6 +1055,12 @@ static void __init prom_parse_platform_support(u8 
> index, u8 val,
>   support->radix_gtse = true;
>   }
>   break;
> + case OV5_INDX(OV5_XIVE_SUPPORT): /* XIVE Exploitation mode */
> + if (val & OV5_FEAT(OV5_XIVE_SUPPORT)) {

This should be :

+   if (val & OV5_FEAT(OV5_XIVE_EXPLOIT)) {

I will fix it in the next version of the patchset.

C.

> + prom_debug("XIVE - exploitation mode\n");
> + support->xive = true;
> + }
> + break;
>   }
>  }
>  
> @@ -1062,7 +1069,8 @@ static void __init prom_check_platform_support(void)
>   struct platform_support supported = {
>   .hash_mmu = false,
>   .radix_mmu = false,
> - .radix_gtse = false
> + .radix_gtse = false,
> + .xive = false
>   };
>   int prop_len = prom_getproplen(prom.chosen,
>  "ibm,arch-vec-5-platform-support");
> @@ -1095,6 +1103,11 @@ static void __init prom_check_platform_support(void)
>   /* We're probably on a legacy hypervisor */
>   prom_debug("Assuming legacy hash support\n");
>   }
> +
> + if (supported.xive) {
> + prom_debug("Asking for XIVE\n");
> + ibm_architecture_vec.vec5.intarch = OV5_FEAT(OV5_XIVE_EXPLOIT);
> + }
>  }
>  
>  static void __init prom_send_capabilities(void)
> 



[FIX PATCH v0] powerpc: Fix memory unplug failure on radix guest

2017-08-10 Thread Bharata B Rao
For a PowerKVM guest, it is possible to specify a DIMM device in
addition to the system RAM at boot time. When such a cold plugged DIMM
device is removed from a radix guest, we hit the following warning in the
guest kernel resulting in the eventual failure of memory unplug:

remove_pud_table: unaligned range
WARNING: CPU: 3 PID: 164 at arch/powerpc/mm/pgtable-radix.c:597 
remove_pagetable+0x468/0xca0
Call Trace:
remove_pagetable+0x464/0xca0 (unreliable)
radix__remove_section_mapping+0x24/0x40
remove_section_mapping+0x28/0x60
arch_remove_memory+0xcc/0x120
remove_memory+0x1ac/0x270
dlpar_remove_lmb+0x1ac/0x210
dlpar_memory+0xbc4/0xeb0
pseries_hp_work_fn+0x1a4/0x230
process_one_work+0x1cc/0x660
worker_thread+0xac/0x6d0
kthread+0x16c/0x1b0
ret_from_kernel_thread+0x5c/0x74

The DIMM memory that is cold plugged gets merged to the same memblock
region as RAM and hence gets mapped at 1G alignment. However since the
removal is done for one LMB (lmb size 256MB) at a time, the address
of the LMB (which is 256MB aligned) would get flagged as unaligned
in remove_pud_table() resulting in the above failure.

This problem is not seen for hot plugged memory because for the
hot plugged memory, the mappings are created separately for each
LMB and hence they all get aligned at 256MB.

To fix this problem for the cold plugged memory, let us mark the
cold plugged memblock region explicitly as HOTPLUGGED so that the
region doesn't get merged with RAM. All the memory that is discovered
via ibm,dynamic-memory-configuration is marked so(1). Next identify
such regions in radix_init_pgtable() and create separate mappings
within that region for each LMB so that they get don't get aligned
like RAM region at 1G (2).

(1) For PowerKVM guests, all boot time memory is represented via
memory@ nodes and hot plugged/pluggable memory is represented via
ibm,dynamic-memory-reconfiguration property. We are marking all
hotplugged memory that is in ASSIGNED state during boot as HOTPLUGGED.
With this only cold plugged memory gets marked for PowerKVM but
need to check how this will affect PowerVM guests.

(2) To create separate mappings for every LMB in the hot plugged
region, we need lmb-size. I am currently using memory_block_size_bytes()
API to get the lmb-size. Since this is early init time code, the
machine type isn't probed yet and hence memory_block_size_bytes()
would return the default LMB size as 16MB. Hence we end up creating
separate mappings at much lower granularity than what we can ideally
do for pseries machine.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/kernel/prom.c  |  1 +
 arch/powerpc/mm/pgtable-radix.c | 17 ++---
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index f830562..24ecf53 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -524,6 +524,7 @@ static int __init early_init_dt_scan_drconf_memory(unsigned 
long node)
size = 0x8000ul - base;
}
memblock_add(base, size);
+   memblock_mark_hotplug(base, size);
} while (--rngs);
}
memblock_dump_all();
diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 671a45d..180d25a 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -255,15 +256,25 @@ static void __init radix_init_pgtable(void)
 {
unsigned long rts_field;
struct memblock_region *reg;
+   phys_addr_t addr;
+   u64 lmb_size = memory_block_size_bytes();
 
/* We don't support slb for radix */
mmu_slb_size = 0;
/*
 * Create the linear mapping, using standard page size for now
 */
-   for_each_memblock(memory, reg)
-   WARN_ON(create_physical_mapping(reg->base,
-   reg->base + reg->size));
+   for_each_memblock(memory, reg) {
+   if (memblock_is_hotpluggable(reg)) {
+   for (addr = reg->base; addr < (reg->base + reg->size);
+   addr += lmb_size)
+   WARN_ON(create_physical_mapping(addr,
+   addr + lmb_size));
+   } else {
+   WARN_ON(create_physical_mapping(reg->base,
+   reg->base + reg->size));
+   }
+   }
 
/* Find out how many PID bits are supported */
if (cpu_has_feature(CPU_FTR_HVMODE)) {
-- 
2.7.4



Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count

2017-08-10 Thread Laurent Dufour
On 10/08/2017 02:58, Kirill A. Shutemov wrote:
> On Wed, Aug 09, 2017 at 12:43:33PM +0200, Laurent Dufour wrote:
>> On 09/08/2017 12:12, Kirill A. Shutemov wrote:
>>> On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote:
 The VMA sequence count has been introduced to allow fast detection of
 VMA modification when running a page fault handler without holding
 the mmap_sem.

 This patch provides protection agains the VMA modification done in :
- madvise()
- mremap()
- mpol_rebind_policy()
- vma_replace_policy()
- change_prot_numa()
- mlock(), munlock()
- mprotect()
- mmap_region()
- collapse_huge_page()
>>>
>>> I don't thinks it's anywhere near complete list of places where we touch
>>> vm_flags. What is your plan for the rest?
>>
>> The goal is only to protect places where change to the VMA is impacting the
>> page fault handling. If you think I missed one, please advise.
> 
> That's very fragile approach. We rely here too much on specific compiler 
> behaviour.
> 
> Any write access to vm_flags can, in theory, be translated to several
> write accesses. For instance with setting vm_flags to 0 in the middle,
> which would result in sigfault on page fault to the vma.

Indeed, just setting vm_flags to 0 will not result in sigfault, the real
job is done when the pte are updated and the bits allowing access are
cleared. Access to the pte is controlled by the pte lock.
Page fault handler is triggered based on the pte bits, not the content of
vm_flags and the speculative page fault is checking for the vma again once
the pte lock is held. So there is no concurrency when dealing with the pte
bits.

Regarding the compiler behaviour, there are memory barriers and locking
which should prevent that.

Thanks,
Laurent.



Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller

2017-08-10 Thread Cédric Le Goater
On 08/10/2017 07:54 AM, David Gibson wrote:
> On Thu, Aug 10, 2017 at 02:46:00PM +1000, Benjamin Herrenschmidt wrote:
>> On Thu, 2017-08-10 at 14:28 +1000, David Gibson wrote:
>>>
>>> Also, will POWER9 always have doorbells?  In which case you could
>>> reduce it to 3 options.
>>
>> The problem with doorbells on POWER9 guests is that they may have
>> to trap and be emulated by the hypervisor, since the guest threads
>> on P9 don't have to match the HW threads of the core.
>>
>> Thus it's quite possible that using XIVE for IPIs is actually faster
>> than doorbells in that case.
> 
> Ok, well unless there's a compelling reason to use doorbells you can
> instead make it 3 cases with POWER9 never using doorbells.

yes I suppose we could improve things for XIVE. For XICS, there are 
another 3 backends behind icp_ops->cause_ipi. I need to take a look.

Thanks,

C. 



Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller

2017-08-10 Thread Cédric Le Goater
 +static void xive_spapr_update_pending(struct xive_cpu *xc)
 +{
 +  u8 nsr, cppr;
 +  u16 ack;
 +
 +  /* Perform the acknowledge hypervisor to register cycle */
 +  ack = be16_to_cpu(__raw_readw(xive_tima + TM_SPC_ACK_OS_REG));
>>>
>>> Why do you need the raw_readw() + be16_to_cpu + mb, rather than one of
>>> the higher level IO helpers?
>>
>> This is one of the many ways to do MMIOs on the TIMA. This memory 
>> region defines a set of offsets and sizes for which loads and 
>> stores have different effects. 
>>
>> See the arch/powerpc/include/asm/xive-regs.h file and 
>> arch/powerpc/kvm/book3s_xive_template.c for some more usage.
> 
> Sure, much like any IO region.  My point is, why do you want this kind
> of complex combo, rather than say an in_be16() or readw_be().
> 

The code is inherited from the native backend. 

I think this is because we know what we are doing and we skip 
the synchronization routines of the higher level IO helpers. 
That might not be necessary for sPAPR. Ben ? 

Thanks,

C.   


Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller

2017-08-10 Thread Cédric Le Goater
On 08/10/2017 06:46 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2017-08-10 at 14:28 +1000, David Gibson wrote:
>>
>> Also, will POWER9 always have doorbells?  In which case you could
>> reduce it to 3 options.
> 
> The problem with doorbells on POWER9 guests is that they may have
> to trap and be emulated by the hypervisor, since the guest threads
> on P9 don't have to match the HW threads of the core.

Well, the pseries cause_ipi() handler does :

if (doorbell_try_core_ipi(cpu))
return;

to limit the doorbells to the same core. So we should be fine ? If not
I suppose we should check CPU_FTR_ARCH_300 and use IPIs only for XIVE.

> Thus it's quite possible that using XIVE for IPIs is actually faster
> than doorbells in that case.

How can we measure that ? ebizzy may be.

C.


Re: [PATCH 10/13] powerpc/64s: idle simplify KVM idle on POWER9

2017-08-10 Thread Gautham R Shenoy
On Tue, Aug 08, 2017 at 10:42:57PM +1000, Nicholas Piggin wrote:
> On Tue, 8 Aug 2017 16:06:43 +0530
> Gautham R Shenoy  wrote:
> 
> > Hi Nicholas,
> > 
> > On Sun, Aug 06, 2017 at 03:02:38AM +1000, Nicholas Piggin wrote:
> > > POWER9 CPUs have independent MMU contexts per thread so KVM
> > > does not have to bring sibling threads into real-mode when
> > > switching MMU mode to guest. This can simplify POWER9 sleep/wake
> > > paths and avoids hwsyncs.
> > > @@ -2858,11 +2883,13 @@ static noinline void kvmppc_run_core(struct 
> > > kvmppc_vcore *vc)
> > > 
> > >   /* Let secondaries go back to the offline loop */
> > >   for (i = 0; i < controlled_threads; ++i) {
> > > - kvmppc_release_hwthread(pcpu + i);
> > >   if (sip && sip->napped[i])
> > >   kvmppc_ipi_thread(pcpu + i);
> > >   cpumask_clear_cpu(pcpu + i, >kvm->arch.cpu_in_guest);
> > >   }  
> > 
> > We are sending an IPI to the thread that has exited the guest and is
> > currently napping. The IPI wakes it up so that it can executes
> > offline loop. But we haven't released the hwthread yet, which means
> > that hwthread_req for this thread is still set.
> > 
> > The thread wakes up from nap, executes the pnv_powersave_wakeup code
> > where it can enter kvm_start_guest. Is this a legitimate race or am I
> > missing something?
> 
> Oh I think it's just a silly mistake in my patch, good catch.

Ah,np!

> Would moving this loop below the one below solve it? I wasn't
> completely happy with uglifying these loops by making the
> primary release different than secondary... maybe I will just
> move the difference into kvmppc_release_hwthread and which is
> less intrusive to callers.

I think moving it to kvmppc_release_hwthread is a good idea.

> 
> Thanks,
> Nick
> 

--
Thanks and Regards
gautham.



Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller

2017-08-10 Thread David Gibson
On Thu, Aug 10, 2017 at 02:46:00PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2017-08-10 at 14:28 +1000, David Gibson wrote:
> > 
> > Also, will POWER9 always have doorbells?  In which case you could
> > reduce it to 3 options.
> 
> The problem with doorbells on POWER9 guests is that they may have
> to trap and be emulated by the hypervisor, since the guest threads
> on P9 don't have to match the HW threads of the core.
> 
> Thus it's quite possible that using XIVE for IPIs is actually faster
> than doorbells in that case.

Ok, well unless there's a compelling reason to use doorbells you can
instead make it 3 cases with POWER9 never using doorbells.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature