Re: [RFC v7 02/25] powerpc: track allocation status of all pkeys
Thiago Jung Bauermannwrites: > Ram Pai writes: >> static inline void pkey_initialize(void) >> { >> +int os_reserved, i; >> + >> /* disable the pkey system till everything >> * is in place. A patch further down the >> * line will enable it. >> */ >> pkey_inited = false; >> + >> +/* Lets assume 32 keys */ >> +pkeys_total = 32; >> + >> +#ifdef CONFIG_PPC_4K_PAGES >> +/* >> + * the OS can manage only 8 pkeys >> + * due to its inability to represent >> + * them in the linux 4K-PTE. >> + */ >> +os_reserved = pkeys_total-8; >> +#else >> +os_reserved = 0; >> +#endif >> +/* >> + * Bits are in LE format. >> + * NOTE: 1, 0 are reserved. >> + * key 0 is the default key, which allows read/write/execute. >> + * key 1 is recommended not to be used. >> + * PowerISA(3.0) page 1015, programming note. >> + */ >> +initial_allocation_mask = ~0x0; >> +for (i = 2; i < (pkeys_total - os_reserved); i++) >> +initial_allocation_mask &= ~(0x1<> } >> #endif /*_ASM_PPC64_PKEYS_H */ > > In v6, key 31 was also reserved, but it's not in this version. Is this > intentional? That whole thing could be replaced with two constants. Except it can't, because we can't just hard code the number of keys. It needs to come either from the device tree or be based on the CPU we're running on. > Isn't it better for this function to be in pkeys.c? Ideally, functions > should be in .c files not in headers unless they're very small or > performance sensitive IMHO. Yes. No reason for that to be in a header AFAICS. cheers
Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller
On Thu, Aug 10, 2017 at 09:36:04PM +1000, Benjamin Herrenschmidt wrote: > On Thu, 2017-08-10 at 09:19 +0200, Cédric Le Goater wrote: > > > > > > + /* Perform the acknowledge hypervisor to register cycle */ > > > > > > + ack = be16_to_cpu(__raw_readw(xive_tima + TM_SPC_ACK_OS_REG)); > > > > > > > > > > Why do you need the raw_readw() + be16_to_cpu + mb, rather than one of > > > > > the higher level IO helpers? > > > > > > > > This is one of the many ways to do MMIOs on the TIMA. This memory > > > > region defines a set of offsets and sizes for which loads and > > > > stores have different effects. > > > > > > > > See the arch/powerpc/include/asm/xive-regs.h file and > > > > arch/powerpc/kvm/book3s_xive_template.c for some more usage. > > > > > > Sure, much like any IO region. My point is, why do you want this kind > > > of complex combo, rather than say an in_be16() or readw_be(). > > > > > > > The code is inherited from the native backend. > > > > I think this is because we know what we are doing and we skip > > the synchronization routines of the higher level IO helpers. > > That might not be necessary for sPAPR. Ben ? > > It's a performance optimisation, we know we don't need the > sync,twi,isync crap of the higher level accessor here. Ok. A comment mentioning this would be good - unless you're very familiar with the code and hardware it's not going to be obvious if the nonstandard IO accessors are for legitimate performance reasons, or just cargo-culting. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
[PATCH net-next] fsl/fman: implement several errata workarounds
Implemented workarounds for the following dTSEC Erratum: A002, A004, A0012, A0014, A004839 on several operations that involve MAC CFG register changes: adjust link, rx pause frames, modify MAC address. Signed-off-by: Florinel Iordache--- drivers/net/ethernet/freescale/fman/fman_dtsec.c | 118 ++- 1 file changed, 93 insertions(+), 25 deletions(-) diff --git a/drivers/net/ethernet/freescale/fman/fman_dtsec.c b/drivers/net/ethernet/freescale/fman/fman_dtsec.c index 98bba10..ea43b49 100644 --- a/drivers/net/ethernet/freescale/fman/fman_dtsec.c +++ b/drivers/net/ethernet/freescale/fman/fman_dtsec.c @@ -123,7 +123,7 @@ #define DTSEC_ECNTRL_R100M 0x0008 #define DTSEC_ECNTRL_QSGMIIM 0x0001 -#define DTSEC_TCTRL_GTS0x0020 +#define TCTRL_GTS 0x0020 #define RCTRL_PAL_MASK 0x001f #define RCTRL_PAL_SHIFT16 @@ -863,6 +863,52 @@ int dtsec_cfg_pad_and_crc(struct fman_mac *dtsec, bool new_val) return 0; } +static void graceful_start(struct fman_mac *dtsec, enum comm_mode mode) +{ + struct dtsec_regs __iomem *regs = dtsec->regs; + + if (mode & COMM_MODE_TX) + iowrite32be(ioread32be(>tctrl) & + ~TCTRL_GTS, >tctrl); + if (mode & COMM_MODE_RX) + iowrite32be(ioread32be(>rctrl) & + ~RCTRL_GRS, >rctrl); +} + +static void graceful_stop(struct fman_mac *dtsec, enum comm_mode mode) +{ + struct dtsec_regs __iomem *regs = dtsec->regs; + u32 tmp; + + /* Graceful stop - Assert the graceful Rx stop bit */ + if (mode & COMM_MODE_RX) { + tmp = ioread32be(>rctrl) | RCTRL_GRS; + iowrite32be(tmp, >rctrl); + + if (dtsec->fm_rev_info.major == 2) { + /* Workaround for dTSEC Errata A002 */ + usleep_range(100, 200); + } else { + /* Workaround for dTSEC Errata A004839 */ + usleep_range(10, 50); + } + } + + /* Graceful stop - Assert the graceful Tx stop bit */ + if (mode & COMM_MODE_TX) { + if (dtsec->fm_rev_info.major == 2) { + /* dTSEC Errata A004: Do not use TCTRL[GTS]=1 */ + pr_debug("GTS not supported due to DTSEC_A004 Errata.\n"); + } else { + tmp = ioread32be(>tctrl) | TCTRL_GTS; + iowrite32be(tmp, >tctrl); + + /* Workaround for dTSEC Errata A0012, A0014 */ + usleep_range(10, 50); + } + } +} + int dtsec_enable(struct fman_mac *dtsec, enum comm_mode mode) { struct dtsec_regs __iomem *regs = dtsec->regs; @@ -880,13 +926,8 @@ int dtsec_enable(struct fman_mac *dtsec, enum comm_mode mode) iowrite32be(tmp, >maccfg1); - /* Graceful start - clear the graceful receive stop bit */ - if (mode & COMM_MODE_TX) - iowrite32be(ioread32be(>tctrl) & ~DTSEC_TCTRL_GTS, - >tctrl); - if (mode & COMM_MODE_RX) - iowrite32be(ioread32be(>rctrl) & ~RCTRL_GRS, - >rctrl); + /* Graceful start - clear the graceful Rx/Tx stop bit */ + graceful_start(dtsec, mode); return 0; } @@ -899,23 +940,8 @@ int dtsec_disable(struct fman_mac *dtsec, enum comm_mode mode) if (!is_init_done(dtsec->dtsec_drv_param)) return -EINVAL; - /* Gracefull stop - Assert the graceful transmit stop bit */ - if (mode & COMM_MODE_RX) { - tmp = ioread32be(>rctrl) | RCTRL_GRS; - iowrite32be(tmp, >rctrl); - - if (dtsec->fm_rev_info.major == 2) - usleep_range(100, 200); - else - udelay(10); - } - - if (mode & COMM_MODE_TX) { - if (dtsec->fm_rev_info.major == 2) - pr_debug("GTS not supported due to DTSEC_A004 errata.\n"); - else - pr_debug("GTS not supported due to DTSEC_A0014 errata.\n"); - } + /* Graceful stop - Assert the graceful Rx/Tx stop bit */ + graceful_stop(dtsec, mode); tmp = ioread32be(>maccfg1); if (mode & COMM_MODE_RX) @@ -933,11 +959,19 @@ int dtsec_set_tx_pause_frames(struct fman_mac *dtsec, u16 pause_time, u16 __maybe_unused thresh_time) { struct dtsec_regs __iomem *regs = dtsec->regs; + enum comm_mode mode = COMM_MODE_NONE; u32 ptv = 0; if (!is_init_done(dtsec->dtsec_drv_param)) return -EINVAL; + if ((ioread32be(>rctrl) & RCTRL_GRS) == 0) + mode |= COMM_MODE_RX; + if ((ioread32be(>tctrl) & TCTRL_GTS) == 0) +
Re: [RFC v7 25/25] powerpc: Enable pkey subsystem
Ram Paiwrites: > --- a/arch/powerpc/include/asm/cputable.h > +++ b/arch/powerpc/include/asm/cputable.h > @@ -214,6 +214,7 @@ enum { > #define CPU_FTR_DAWR LONG_ASM_CONST(0x0400) > #define CPU_FTR_DABRX > LONG_ASM_CONST(0x0800) > #define CPU_FTR_PMAO_BUG LONG_ASM_CONST(0x1000) > +#define CPU_FTR_PKEY LONG_ASM_CONST(0x2000) > #define CPU_FTR_POWER9_DD1 LONG_ASM_CONST(0x4000) > > #ifndef __ASSEMBLY__ > @@ -452,7 +453,7 @@ enum { > CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYM_SMT | \ > CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \ > CPU_FTR_ICSWX | CPU_FTR_CFAR | CPU_FTR_HVMODE | \ > - CPU_FTR_VMX_COPY | CPU_FTR_HAS_PPR | CPU_FTR_DABRX) > + CPU_FTR_VMX_COPY | CPU_FTR_HAS_PPR | CPU_FTR_DABRX | CPU_FTR_PKEY) P7 supports protection keys for data access (AMR) but not for instruction access (IAMR), right? There's nothing in the code making this distinction, so either CPU_FTR_PKEY shouldn't be enabled in P7 or separate feature bits for AMR and IAMR should be used and checked before trying to access the IAMR. > #define CPU_FTRS_POWER8 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ > CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | CPU_FTR_ARCH_206 |\ > CPU_FTR_MMCRA | CPU_FTR_SMT | \ > @@ -462,7 +463,7 @@ enum { > CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \ > CPU_FTR_ICSWX | CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \ > CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_DAWR | \ > - CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP) > + CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP | CPU_FTR_PKEY) > #define CPU_FTRS_POWER8E (CPU_FTRS_POWER8 | CPU_FTR_PMAO_BUG) > #define CPU_FTRS_POWER8_DD1 (CPU_FTRS_POWER8 & ~CPU_FTR_DBELL) > #define CPU_FTRS_POWER9 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ > @@ -474,7 +475,8 @@ enum { > CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \ > CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \ > CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_DAWR | \ > - CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP | CPU_FTR_ARCH_300) > + CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP | CPU_FTR_ARCH_300 | \ > + CPU_FTR_PKEY) > #define CPU_FTRS_POWER9_DD1 ((CPU_FTRS_POWER9 | CPU_FTR_POWER9_DD1) & \ >(~CPU_FTR_SAO)) > #define CPU_FTRS_CELL(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ > diff --git a/arch/powerpc/include/asm/mmu_context.h > b/arch/powerpc/include/asm/mmu_context.h > index a1cfcca..acd59d8 100644 > --- a/arch/powerpc/include/asm/mmu_context.h > +++ b/arch/powerpc/include/asm/mmu_context.h > @@ -188,6 +188,7 @@ static inline bool arch_vma_access_permitted(struct > vm_area_struct *vma, > > #define pkey_initialize() > #define pkey_mm_init(mm) > +#define pkey_mmu_values(total_data, total_execute) > > static inline int vma_pkey(struct vm_area_struct *vma) > { > diff --git a/arch/powerpc/include/asm/pkeys.h > b/arch/powerpc/include/asm/pkeys.h > index ba7bff6..e61ed6c 100644 > --- a/arch/powerpc/include/asm/pkeys.h > +++ b/arch/powerpc/include/asm/pkeys.h > @@ -1,6 +1,8 @@ > #ifndef _ASM_PPC64_PKEYS_H > #define _ASM_PPC64_PKEYS_H > > +#include > + > extern bool pkey_inited; > extern int pkeys_total; /* total pkeys as per device tree */ > extern u32 initial_allocation_mask;/* bits set for reserved keys */ > @@ -227,6 +229,24 @@ static inline void pkey_mm_init(struct mm_struct *mm) > mm->context.execute_only_pkey = -1; > } > > +static inline void pkey_mmu_values(int total_data, int total_execute) > +{ > + /* > + * since any pkey can be used for data or execute, we > + * will just treat all keys as equal and track them > + * as one entity. > + */ > + pkeys_total = total_data + total_execute; > +} Right now this works because the firmware reports 0 execute keys in the device tree, but if (when?) it is fixed to report 32 execute keys as well as 32 data keys (which are the same keys), any place using pkeys_total expecting it to mean the number of keys that are available will be broken. This includes pkey_initialize and mm_pkey_is_allocated. Perhaps pkeys_total should use total_data as the number of keys supported in the system, and total_execute just as a flag to say whether there's a IAMR? Or, since P8 and later have IAMR and P7 is unlikely to have the firmware fixed, maybe the kernel should just ignore total_execute altogether? > +static inline bool pkey_mmu_enabled(void) > +{ > + if (firmware_has_feature(FW_FEATURE_LPAR)) > + return pkeys_total; > + else > + return cpu_has_feature(CPU_FTR_PKEY); > +} > + > static inline void pkey_initialize(void) > { > int os_reserved, i; > @@ -236,9 +256,17 @@ static inline void pkey_initialize(void) >* line will
Re: [RFC v7 24/25] powerpc: Deliver SEGV signal on pkey violation
Ram Paiwrites: > The value of the AMR register at the time of exception > is made available in gp_regs[PT_AMR] of the siginfo. > > The value of the pkey, whose protection got violated, > is made available in si_pkey field of the siginfo structure. Should the IAMR also be made available? Also, should the AMR and IAMR be accesible to userspace (e.g., to GDB) via ptrace and the core file? > --- a/arch/powerpc/kernel/signal_32.c > +++ b/arch/powerpc/kernel/signal_32.c > @@ -500,6 +500,11 @@ static int save_user_regs(struct pt_regs *regs, struct > mcontext __user *frame, > (unsigned long) >tramp[2]); > } > > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS > + if (__put_user(get_paca()->paca_amr, >mc_gregs[PT_AMR])) > + return 1; > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ > + > return 0; > } frame->mc_gregs[PT_AMR] has 32 bits, but paca_amr has 64 bits. Does this work as intended? > diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c > index c83c115..86a4262 100644 > --- a/arch/powerpc/kernel/signal_64.c > +++ b/arch/powerpc/kernel/signal_64.c > @@ -174,6 +174,10 @@ static long setup_sigcontext(struct sigcontext __user > *sc, > if (set != NULL) > err |= __put_user(set->sig[0], >oldmask); > > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS > + err |= __put_user(get_paca()->paca_amr, >gp_regs[PT_AMR]); > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ > + > return err; > } Isn't a corresponding change needed in restore_sigcontext? And in the corresponding TM versions setup_tm_sigcontexts and restore_tm_sigcontexts? Ditto for the equivalent functions in signal_32.c. -- Thiago Jung Bauermann IBM Linux Technology Center
Re: [RFC v7 09/25] powerpc: store and restore the pkey state across context switches
Ram Paiwrites: > --- a/arch/powerpc/kernel/process.c > +++ b/arch/powerpc/kernel/process.c > @@ -42,6 +42,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -1096,6 +1097,13 @@ static inline void save_sprs(struct thread_struct *t) > t->tar = mfspr(SPRN_TAR); > } > #endif > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS > + if (arch_pkeys_enabled()) { > + t->amr = mfspr(SPRN_AMR); > + t->iamr = mfspr(SPRN_IAMR); > + t->uamor = mfspr(SPRN_UAMOR); > + } > +#endif > } Is it worth having a flag in thread_struct saying whether it has every called pkey_alloc and only do the mfsprs if it did? > @@ -1131,6 +1139,16 @@ static inline void restore_sprs(struct thread_struct > *old_thread, > mtspr(SPRN_TAR, new_thread->tar); > } > #endif > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS > + if (arch_pkeys_enabled()) { > + if (old_thread->amr != new_thread->amr) > + mtspr(SPRN_AMR, new_thread->amr); > + if (old_thread->iamr != new_thread->iamr) > + mtspr(SPRN_IAMR, new_thread->iamr); > + if (old_thread->uamor != new_thread->uamor) > + mtspr(SPRN_UAMOR, new_thread->uamor); > + } > +#endif > } > > struct task_struct *__switch_to(struct task_struct *prev, > @@ -1689,6 +1707,13 @@ void start_thread(struct pt_regs *regs, unsigned long > start, unsigned long sp) > current->thread.tm_tfiar = 0; > current->thread.load_tm = 0; > #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */ > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS > + if (arch_pkeys_enabled()) { > + current->thread.amr = 0x0ul; > + current->thread.iamr = 0x0ul; > + current->thread.uamor = 0x0ul; > + } > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ > } > EXPORT_SYMBOL(start_thread); -- Thiago Jung Bauermann IBM Linux Technology Center
Re: [FIX PATCH v0] powerpc: Fix memory unplug failure on radix guest
On Thu, Aug 10, 2017 at 11:50:19AM -0500, Reza Arbab wrote: On Thu, Aug 10, 2017 at 02:53:48PM +0530, Bharata B Rao wrote: diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index f830562..24ecf53 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -524,6 +524,7 @@ static int __init early_init_dt_scan_drconf_memory(unsigned long node) size = 0x8000ul - base; } memblock_add(base, size); + memblock_mark_hotplug(base, size); } while (--rngs); } memblock_dump_all(); Doing this has the effect of putting all the affected memory into ZONE_MOVABLE. See find_zone_movable_pfns_for_nodes(). This means no kernel allocations can occur there. Is that okay? I should clarify. The behavior change I mention applies when movable_node_is_enabled(). -- Reza Arbab
Re: [PATCH] powerpc/64s: Add support for ASB_Notify on POWER9
Le 07/08/2017 à 05:08, Michael Ellerman a écrit : Hi Christophe, I'm not across any of the details of this so hopefully most of these comments aren't too stupid :) Christophe Lombardwrites: The POWER9 core supports a new feature: ASB_Notify which requires the support of the Special Purpose Register: TIDR. TIDR is defined in ISA 3.0B, which would be worth mentioning. Can you spell out what an ASB_Notifiy is. In addition to waking a thread up from the wait state via an interrupt, the thread can also be removed from the wait state via a ASB_notify command. We see the acronyme ASB present in several documents, but I never saw the definition of ASB. The ASB_Notify command, generated by the AFU, will attempt to And please tell us what an AFU is, this patch and the code it touches are not obviously CAPI related, so the reader may not know what an AFU is, or the broader context. You are right, this patch is not obviously CAPI related. The Coherent Accelerator Process Interface (CAPI) is a general term for the infrastructure of attaching a coherent accelerator (CAPI card) to a POWER system. The AFU (Accelerator Function Unit), located on the CAPI card, executes computation-heavy functions helping the application running on the host processor. There is a direct communication between the application and the accelerator. The purpose of an AFU is to provide applications with a higher computational unit to improve the performance of the application and off-load the host processor. wake-up the host thread identified by the particular LPID:PID:TID. Host implies it doesn't work for threads in guests, but then LPID implies it does. You say "PID" a few times here, but I think you always mean PIDR? It would be worth spelling that out, otherwise people will assume you're talking about the Linux "pid" (which is actually the thread group id (TGID)), which is not the same as PIDR. You are completely right. The use of the term 'PIDR' is much more accurate. The special register TIDR has to be updated to with the same value present in the process element. What's a process element? :) An other CAPI term :-) When an application (running on the host) requests use of an AFU, a process element is added to the process-element linked list that describes the application’s process state. The process element also contains a work element descriptor (WED) provided by the application. The WED contains the full description of the job to be performed by the AFU. If the length of the register TIDR is 64bits, the CAPI Translation Is it 64-bits? The architecture says it is. yes. The length is 64-bits. Service Layer core (XSL9) for Power9 systems limits the size (16bits) of the Thread ID when it generates the ASB_Notify message adding PID:LPID:TID information from the context. When you say "Thread ID" here I think you're referring to the "TID" in the "PID:LPID:TID" tuple, and you're saying it is limited to 16-bits, by the P9 CAPI hardware. correct. The content of the internal kernel Thread ID (32bits) can not therefore "as returned by sys_gettid()" would help to disambiguate here. be used to fulfill the register TIDR. .. if you're intention is to use TIDR with CAPI. I'm assuming here that by "kernel Thread ID" you mean the kernel "pid" which is returned to userspace as the "TID" by gettid(). correct. That value is global in the system, so it's overkill for this usage, as all you need is a value unique to all threads that share a PIDR (== mm_struct). So it seems like we could also solve this by simply having a counter in the mm_struct (actually mm_context_t), and handing that out when userspace tries to read TIDR. Or we could do the same and have some other API for reading it, ie. not using mfspr() but an actual CAPI API. Or does userspace even need the raw value? Your remarks are very interesting. I need more time to look into that. Is there a usecase where multiple threads would assign themselves the same TIDR so that all (some? one?) of them is woken up by the ASB_notify? Or is that an error? Normally no. Only one thread can have a non zero TIDR value. This patch allows to avoid this limitation by adding a new interface This doesn't avoid the 16-bit CAPI limitation AFAICS. What it does is let (force) userspace manage the values of TIDR, and therefore it can control them such that they fit in 16-bits. right. The control will be done by libcxl. Libcxl is the focal point to manage the values of TIDR. for the user. The instructions mfspr/mtspr SPRN_TIDR are emulated, save/restore SPRs (context switch) are updated and a new feature (CPU_FTR_TIDR) is added to POWER9 system. I'm not clear if we need a CPU feature. TIDR is defined in ISA 3.0B, so anything that implements ISA 3.0B must implemented TIDR. It's not entirely clear if the kernel's CPU_FTR_ARCH_300 means 3.0 or 3.0B, but given 3.0B is essentially "what got implemented in P9" it should probably be
Re: [RFC v7 02/25] powerpc: track allocation status of all pkeys
Ram Paiwrites: > static inline void pkey_initialize(void) > { > + int os_reserved, i; > + > /* disable the pkey system till everything >* is in place. A patch further down the >* line will enable it. >*/ > pkey_inited = false; > + > + /* Lets assume 32 keys */ > + pkeys_total = 32; > + > +#ifdef CONFIG_PPC_4K_PAGES > + /* > + * the OS can manage only 8 pkeys > + * due to its inability to represent > + * them in the linux 4K-PTE. > + */ > + os_reserved = pkeys_total-8; > +#else > + os_reserved = 0; > +#endif > + /* > + * Bits are in LE format. > + * NOTE: 1, 0 are reserved. > + * key 0 is the default key, which allows read/write/execute. > + * key 1 is recommended not to be used. > + * PowerISA(3.0) page 1015, programming note. > + */ > + initial_allocation_mask = ~0x0; > + for (i = 2; i < (pkeys_total - os_reserved); i++) > + initial_allocation_mask &= ~(0x1< } > #endif /*_ASM_PPC64_PKEYS_H */ In v6, key 31 was also reserved, but it's not in this version. Is this intentional? Isn't it better for this function to be in pkeys.c? Ideally, functions should be in .c files not in headers unless they're very small or performance sensitive IMHO. -- Thiago Jung Bauermann IBM Linux Technology Center
Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count
On 10/08/2017 15:43, Kirill A. Shutemov wrote: > On Thu, Aug 10, 2017 at 10:27:50AM +0200, Laurent Dufour wrote: >> On 10/08/2017 02:58, Kirill A. Shutemov wrote: >>> On Wed, Aug 09, 2017 at 12:43:33PM +0200, Laurent Dufour wrote: On 09/08/2017 12:12, Kirill A. Shutemov wrote: > On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote: >> The VMA sequence count has been introduced to allow fast detection of >> VMA modification when running a page fault handler without holding >> the mmap_sem. >> >> This patch provides protection agains the VMA modification done in : >> - madvise() >> - mremap() >> - mpol_rebind_policy() >> - vma_replace_policy() >> - change_prot_numa() >> - mlock(), munlock() >> - mprotect() >> - mmap_region() >> - collapse_huge_page() > > I don't thinks it's anywhere near complete list of places where we touch > vm_flags. What is your plan for the rest? The goal is only to protect places where change to the VMA is impacting the page fault handling. If you think I missed one, please advise. >>> >>> That's very fragile approach. We rely here too much on specific compiler >>> behaviour. >>> >>> Any write access to vm_flags can, in theory, be translated to several >>> write accesses. For instance with setting vm_flags to 0 in the middle, >>> which would result in sigfault on page fault to the vma. >> >> Indeed, just setting vm_flags to 0 will not result in sigfault, the real >> job is done when the pte are updated and the bits allowing access are >> cleared. Access to the pte is controlled by the pte lock. >> Page fault handler is triggered based on the pte bits, not the content of >> vm_flags and the speculative page fault is checking for the vma again once >> the pte lock is held. So there is no concurrency when dealing with the pte >> bits. > > Suppose we are getting page fault to readable VMA, pte is clear at the > time of page fault. In this case we need to consult vm_flags to check if > the vma is read-accessible. > > If by the time of check vm_flags happend to be '0' we would get SIGSEGV as > the vma appears to be non-readable. > > Where is my logic faulty? The speculative page fault handler will not deliver the signal, if the page fault can't be done in the speculative path for instance because the vm_flags are not matching the required one, the speculative page fault is aborted and the *classic* page fault handler is run which will do the job again grabbing the mmap_sem. >> Regarding the compiler behaviour, there are memory barriers and locking >> which should prevent that. > > Which locks barriers are you talking about? When the VMA is modified and that the changes will impact the speculative page fault handler the sequence count is touch using write_seqcount_begin() and write_seqcount_end(). These 2 services contains calls to smp_wmb(). On the speculative path side, the calls to *_read_seqcount() contains also memory barriers calls. > We need at least READ_ONCE/WRITE_ONCE to access vm_flags everywhere. I don't think READ_ONCE/WRITE_ONCE would help here, as they would not prevent reading transcient state as the vm_flags example you mentioned. That said, there are not so much VMA's fields used in the SPF's path and caching them into the vmf structure under the control of the VMA's sequence count would solve this. I'll try to move in that direction unless anyone has a better idea. Cheers, Laurent.
Re: [FIX PATCH v0] powerpc: Fix memory unplug failure on radix guest
On Thu, Aug 10, 2017 at 02:53:48PM +0530, Bharata B Rao wrote: diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index f830562..24ecf53 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -524,6 +524,7 @@ static int __init early_init_dt_scan_drconf_memory(unsigned long node) size = 0x8000ul - base; } memblock_add(base, size); + memblock_mark_hotplug(base, size); } while (--rngs); } memblock_dump_all(); Doing this has the effect of putting all the affected memory into ZONE_MOVABLE. See find_zone_movable_pfns_for_nodes(). This means no kernel allocations can occur there. Is that okay? diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c index 671a45d..180d25a 100644 --- a/arch/powerpc/mm/pgtable-radix.c +++ b/arch/powerpc/mm/pgtable-radix.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -255,15 +256,25 @@ static void __init radix_init_pgtable(void) { unsigned long rts_field; struct memblock_region *reg; + phys_addr_t addr; + u64 lmb_size = memory_block_size_bytes(); /* We don't support slb for radix */ mmu_slb_size = 0; /* * Create the linear mapping, using standard page size for now */ - for_each_memblock(memory, reg) - WARN_ON(create_physical_mapping(reg->base, - reg->base + reg->size)); + for_each_memblock(memory, reg) { + if (memblock_is_hotpluggable(reg)) { + for (addr = reg->base; addr < (reg->base + reg->size); + addr += lmb_size) + WARN_ON(create_physical_mapping(addr, + addr + lmb_size)); + } else { + WARN_ON(create_physical_mapping(reg->base, + reg->base + reg->size)); + } + } /* Find out how many PID bits are supported */ if (cpu_has_feature(CPU_FTR_HVMODE)) { -- 2.7.4 -- Reza Arbab
Re: [PATCH] powerpc/pseries: Check memory device state before onlining/offlining
On 02/08/2017 20:03, Nathan Fontenot wrote: > When DLPAR adding or removing memory we need to check the device > offline status before trying to online/offline the memory. This is > needed because calls device_online() and device_offline() will return > non-zero for memory that is already online and offline respectively. > > This update resolves two scenarios. First, for kernel built with > auto-online memory enabled, memory will be onlined as part of calls > to add_memory(). After adding the memory the pseries dlpar code tries > to online it and fails since the memory is already online. The dlpar > code then tries to remove the memory which produces the oops message > below because the memory is not offline. > > The second scenario occurs when removing memory that is already offline, > i.e. marking memory offline (via sysfs) and the trying to remove that > memory. This doesn't work because offlining the already offline memory > does not succeed and the dlpar code then fails the dlpar remove operation. > > The fix for both scenarios is to check the device.offline status before > making the calls to device_online() or device_offline(). > > kernel BUG at mm/memory_hotplug.c:2189! > Oops: Exception in kernel mode, sig: 5 [#1] > SMP NR_CPUS=2048 > NUMA > pSeries > CPU: 0 PID: 5 Comm: kworker/u129:0 Not tainted 4.12.0-rc3 #272 > Workqueue: pseries hotplug workque .pseries_hp_work_fn > task: c003f9c89200 task.stack: c003f9d1 > NIP: c02ca428 LR: c02ca3cc CTR: c0ba16a0 > REGS: c003f9d13630 TRAP: 0700 Not tainted (4.12.0-rc3) > MSR: 8282b032> CR: 22002024 XER: 000a > CFAR: c02ca3d0 SOFTE: 1 > GPR00: c02ca3cc c003f9d138b0 c1bb0200 0001 > GPR04: c003fb143c80 c003fef21630 0003 0002 > GPR08: 0003 0003 0003 31b1 > GPR12: 28002042 cfd8 c0118ae0 c003fb170180 > GPR16: 0004 0010 c00379c8 > GPR20: c0037b68 c003f728ff84 0002 0010 > GPR24: 0002 c003f728ff80 0002 0001 > GPR28: c003fb143c38 0002 1000 2000 > NIP [c02ca428] .remove_memory+0xb8/0xc0 > LR [c02ca3cc] .remove_memory+0x5c/0xc0 > Call Trace: > [c003f9d138b0] [c02ca3cc] .remove_memory+0x5c/0xc0 (unreliable) > [c003f9d13940] [c00938a4] .dlpar_add_lmb+0x384/0x400 > [c003f9d13a30] [c009456c] .dlpar_memory+0x5dc/0xca0 > [c003f9d13af0] [c008ce84] .handle_dlpar_errorlog+0x74/0xe0 > [c003f9d13b70] [c008cf1c] .pseries_hp_work_fn+0x2c/0x90 > [c003f9d13bf0] [c0110a5c] .process_one_work+0x17c/0x460 > [c003f9d13c90] [c0110dc8] .worker_thread+0x88/0x500 > [c003f9d13d70] [c0118c3c] .kthread+0x15c/0x1a0 > [c003f9d13e30] [c000ba18] .ret_from_kernel_thread+0x58/0xc0 > Instruction dump: > 7fe3fb78 4bd7c845 6000 7fa3eb78 4bfdd3c9 38210090 e8010010 eba1ffe8 > ebc1fff0 ebe1fff8 7c0803a6 4bfdc2ac <0fe0> 7c0802a6 fb01ffc0 > > Fixes: 943db62c316c ("powerpc/pseries: Revert 'Auto-online hotplugged > memory'") > Signed-off-by: Nathan Fontenot tested the first scenario with 4.13.0-rc4 and qemu 2.10.0-rc2. Tested-by: Laurent Vivier Reviewed-by: Laurent Vivier
Re: [PATCH v7 7/9] mm: Add address parameter to arch_validate_prot()
On 08/10/2017 07:20 AM, Michael Ellerman wrote: Khalid Azizwrites: A protection flag may not be valid across entire address space and hence arch_validate_prot() might need the address a protection bit is being set on to ensure it is a valid protection flag. For example, sparc processors support memory corruption detection (as part of ADI feature) flag on memory addresses mapped on to physical RAM but not on PFN mapped pages or addresses mapped on to devices. This patch adds address to the parameters being passed to arch_validate_prot() so protection bits can be validated in the relevant context. Signed-off-by: Khalid Aziz Cc: Khalid Aziz --- v7: - new patch arch/powerpc/include/asm/mman.h | 2 +- arch/powerpc/kernel/syscalls.c | 2 +- include/linux/mman.h| 2 +- mm/mprotect.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h index 30922f699341..bc74074304a2 100644 --- a/arch/powerpc/include/asm/mman.h +++ b/arch/powerpc/include/asm/mman.h @@ -40,7 +40,7 @@ static inline bool arch_validate_prot(unsigned long prot) return false; return true; } -#define arch_validate_prot(prot) arch_validate_prot(prot) +#define arch_validate_prot(prot, addr) arch_validate_prot(prot) This can be simpler, as just: #define arch_validate_prot arch_validate_prot Hi Michael, Thanks for reviewing! My patch expands parameter list for arch_validate_prot() from one to two parameters. Existing powerpc version of arch_validate_prot() is written with one parameter. If I use the above #define, compilation fails with: mm/mprotect.c: In function ‘do_mprotect_pkey’: mm/mprotect.c:399: error: too many arguments to function ‘arch_validate_prot’ Another way to solve it would be to add the new addr parameter to powerpc version of arch_validate_prot() but I chose the less disruptive solution of tackling it through #define and expanded the existing #define to include the new parameter. Make sense? Thanks, Khalid
Re: [PATCH 11/13] powerpc/64s: idle POWER9 can execute stop without ptesync
On Thu, 10 Aug 2017 23:15:50 +1000 Michael Ellermanwrote: > Nicholas Piggin writes: > > > Signed-off-by: Nicholas Piggin > > --- > > arch/powerpc/kernel/idle_book3s.S | 7 --- > > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 24 > > If you can split this into a KVM and non-KVM patch that would be > helpful. Yes I can do that.
Re: [PATCH 10/13] powerpc/64s: idle simplify KVM idle on POWER9
On Thu, 10 Aug 2017 23:14:46 +1000 Michael Ellermanwrote: > Nicholas Piggin writes: > > > POWER9 CPUs have independent MMU contexts per thread so KVM > > does not have to bring sibling threads into real-mode when > > switching MMU mode to guest. This can simplify POWER9 sleep/wake > > paths and avoids hwsyncs. > > > > Signed-off-by: Nicholas Piggin > > --- > > arch/powerpc/include/asm/kvm_book3s_asm.h | 4 > > arch/powerpc/kernel/idle_book3s.S | 8 ++- > > arch/powerpc/kvm/book3s_hv.c | 37 > > ++- > > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +++ > > This will need to go via, or at least be shared with Paul's tree. > > So if it's possible, splitting it out of this series would be easier. I agree it's really a KVM patch, but patch 12 depends on this, it is a Linux patch. Not sure how you want to handle that? Thanks, Nick
Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count
On Thu, Aug 10, 2017 at 10:27:50AM +0200, Laurent Dufour wrote: > On 10/08/2017 02:58, Kirill A. Shutemov wrote: > > On Wed, Aug 09, 2017 at 12:43:33PM +0200, Laurent Dufour wrote: > >> On 09/08/2017 12:12, Kirill A. Shutemov wrote: > >>> On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote: > The VMA sequence count has been introduced to allow fast detection of > VMA modification when running a page fault handler without holding > the mmap_sem. > > This patch provides protection agains the VMA modification done in : > - madvise() > - mremap() > - mpol_rebind_policy() > - vma_replace_policy() > - change_prot_numa() > - mlock(), munlock() > - mprotect() > - mmap_region() > - collapse_huge_page() > >>> > >>> I don't thinks it's anywhere near complete list of places where we touch > >>> vm_flags. What is your plan for the rest? > >> > >> The goal is only to protect places where change to the VMA is impacting the > >> page fault handling. If you think I missed one, please advise. > > > > That's very fragile approach. We rely here too much on specific compiler > > behaviour. > > > > Any write access to vm_flags can, in theory, be translated to several > > write accesses. For instance with setting vm_flags to 0 in the middle, > > which would result in sigfault on page fault to the vma. > > Indeed, just setting vm_flags to 0 will not result in sigfault, the real > job is done when the pte are updated and the bits allowing access are > cleared. Access to the pte is controlled by the pte lock. > Page fault handler is triggered based on the pte bits, not the content of > vm_flags and the speculative page fault is checking for the vma again once > the pte lock is held. So there is no concurrency when dealing with the pte > bits. Suppose we are getting page fault to readable VMA, pte is clear at the time of page fault. In this case we need to consult vm_flags to check if the vma is read-accessible. If by the time of check vm_flags happend to be '0' we would get SIGSEGV as the vma appears to be non-readable. Where is my logic faulty? > Regarding the compiler behaviour, there are memory barriers and locking > which should prevent that. Which locks barriers are you talking about? We need at least READ_ONCE/WRITE_ONCE to access vm_flags everywhere. -- Kirill A. Shutemov
Re: [PATCH v7 7/9] mm: Add address parameter to arch_validate_prot()
Khalid Azizwrites: > A protection flag may not be valid across entire address space and > hence arch_validate_prot() might need the address a protection bit is > being set on to ensure it is a valid protection flag. For example, sparc > processors support memory corruption detection (as part of ADI feature) > flag on memory addresses mapped on to physical RAM but not on PFN mapped > pages or addresses mapped on to devices. This patch adds address to the > parameters being passed to arch_validate_prot() so protection bits can > be validated in the relevant context. > > Signed-off-by: Khalid Aziz > Cc: Khalid Aziz > --- > v7: > - new patch > > arch/powerpc/include/asm/mman.h | 2 +- > arch/powerpc/kernel/syscalls.c | 2 +- > include/linux/mman.h| 2 +- > mm/mprotect.c | 2 +- > 4 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h > index 30922f699341..bc74074304a2 100644 > --- a/arch/powerpc/include/asm/mman.h > +++ b/arch/powerpc/include/asm/mman.h > @@ -40,7 +40,7 @@ static inline bool arch_validate_prot(unsigned long prot) > return false; > return true; > } > -#define arch_validate_prot(prot) arch_validate_prot(prot) > +#define arch_validate_prot(prot, addr) arch_validate_prot(prot) This can be simpler, as just: #define arch_validate_prot arch_validate_prot cheers
Re: [PATCH 11/13] powerpc/64s: idle POWER9 can execute stop without ptesync
Nicholas Pigginwrites: > Signed-off-by: Nicholas Piggin > --- > arch/powerpc/kernel/idle_book3s.S | 7 --- > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 24 If you can split this into a KVM and non-KVM patch that would be helpful. cheers
Re: [PATCH 10/13] powerpc/64s: idle simplify KVM idle on POWER9
Nicholas Pigginwrites: > POWER9 CPUs have independent MMU contexts per thread so KVM > does not have to bring sibling threads into real-mode when > switching MMU mode to guest. This can simplify POWER9 sleep/wake > paths and avoids hwsyncs. > > Signed-off-by: Nicholas Piggin > --- > arch/powerpc/include/asm/kvm_book3s_asm.h | 4 > arch/powerpc/kernel/idle_book3s.S | 8 ++- > arch/powerpc/kvm/book3s_hv.c | 37 > ++- > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +++ This will need to go via, or at least be shared with Paul's tree. So if it's possible, splitting it out of this series would be easier. cheers
Re: [PATCH 08/13] powerpc/64s: irq replay remove spurious irq reason
Nicholas Pigginwrites: > On Sun, 06 Aug 2017 09:00:32 +1000 > Benjamin Herrenschmidt wrote: > >> On Sun, 2017-08-06 at 03:02 +1000, Nicholas Piggin wrote: >> > HVI interrupts have always used 0x500, so remove the dead branch. >> >> Maybe we should fix that and "catch" in incorrect entry via 0x500 >> which would mean the XIVE is trying to deliver guest irqs to the OS... > > I should be more clear, when I say 0x500, it is only in reference to > the constant used by the soft-irq replay. After patch 6 the replay is > sent to the 0xea0 common handler. > >> That can happen if some LPCR bits aren't set properly and/or KVM >> doesn't pull the guest in time. I had bugs like that in my early >> dev so I've been running with a b . at 0x500 for a while :-) > > So that's a separate issue of hardware actually doing a 0x500. I > had this > > http://patchwork.ozlabs.org/patch/750033/ Hmm, yeah I was going to merge that. But I got side tracked by HVICE, ie. the "interrupts don't go to 0x500" is only true if we set HVICE. Which we currently always do, in __setup_cpu_power9(), but then we have a DT CPU feature for it, so we really shouldn't be always enabling HVICE, we should leave it up to the DT CPU feature code. And then it becomes controlled by the device tree, which makes your patch potentially wrong depending on the device tree CPU features. Ugh. cheers
Re: [RFC Part1 PATCH v3 05/17] x86, realmode: Don't decrypt trampoline area under SEV
On 7/26/2017 11:03 AM, Borislav Petkov wrote: Subject: x86/realmode: ... Done. On Mon, Jul 24, 2017 at 02:07:45PM -0500, Brijesh Singh wrote: From: Tom LendackyWhen SEV is active the trampoline area will need to be in encrypted memory so only mark the area decrypted if SME is active. Signed-off-by: Tom Lendacky Signed-off-by: Brijesh Singh --- arch/x86/realmode/init.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c index 1f71980..c7eeca7 100644 --- a/arch/x86/realmode/init.c +++ b/arch/x86/realmode/init.c @@ -63,9 +63,11 @@ static void __init setup_real_mode(void) /* * If SME is active, the trampoline area will need to be in * decrypted memory in order to bring up other processors -* successfully. +* successfully. For SEV the trampoline area needs to be in +* encrypted memory, so only do this for SME. Or simply say: "It is not needed for SEV." Will do. Thanks, Tom
Re: [1/6] powerpc: NMI IPI improve lock primitive
On Wed, 2017-08-09 at 12:41:21 UTC, Nicholas Piggin wrote: > When the NMI IPI lock is contended, spin at low SMT priority, using > loads only, and with interrupts enabled (where possible). This > improves behaviour under high contention (e.g., a system crash when > a number of CPUs are trying to enter the debugger). > > Signed-off-by: Nicholas PigginSeries applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/0459ddfdb31e7d812b555a2530ecba cheers
Re: powerpc/configs: Re-enable HARD/SOFT lockup detectors
On Wed, 2017-08-09 at 10:57:55 UTC, Michael Ellerman wrote: > In commit 05a4a9527931 ("kernel/watchdog: split up config options"), > CONFIG_LOCKUP_DETECTOR was split into two separate config options, > CONFIG_HARDLOCKUP_DETECTOR and CONFIG_SOFTLOCKUP_DETECTOR. > > Our defconfigs still have CONFIG_LOCKUP_DETECTOR=y, but that is no longer > user selectable, and we don't mention the new options, so we end up with > none of them enabled. > > So update the defconfigs to turn on the new SOFT and HARD options, the > end result being the same as what we had previously. > > Fixes: 05a4a9527931 ("kernel/watchdog: split up config options") > Signed-off-by: Michael EllermanApplied to powerpc fixes. https://git.kernel.org/powerpc/c/7310d5c8c55e8987a854507f71022f cheers
Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller
On Thu, 2017-08-10 at 09:19 +0200, Cédric Le Goater wrote: > > > > > + /* Perform the acknowledge hypervisor to register cycle */ > > > > > + ack = be16_to_cpu(__raw_readw(xive_tima + TM_SPC_ACK_OS_REG)); > > > > > > > > Why do you need the raw_readw() + be16_to_cpu + mb, rather than one of > > > > the higher level IO helpers? > > > > > > This is one of the many ways to do MMIOs on the TIMA. This memory > > > region defines a set of offsets and sizes for which loads and > > > stores have different effects. > > > > > > See the arch/powerpc/include/asm/xive-regs.h file and > > > arch/powerpc/kvm/book3s_xive_template.c for some more usage. > > > > Sure, much like any IO region. My point is, why do you want this kind > > of complex combo, rather than say an in_be16() or readw_be(). > > > > The code is inherited from the native backend. > > I think this is because we know what we are doing and we skip > the synchronization routines of the higher level IO helpers. > That might not be necessary for sPAPR. Ben ? It's a performance optimisation, we know we don't need the sync,twi,isync crap of the higher level accessor here. Cheers, Ben.
Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller
On Thu, 2017-08-10 at 08:45 +0200, Cédric Le Goater wrote: > > The problem with doorbells on POWER9 guests is that they may have > > to trap and be emulated by the hypervisor, since the guest threads > > on P9 don't have to match the HW threads of the core. > > Well, the pseries cause_ipi() handler does : > > if (doorbell_try_core_ipi(cpu)) > return; > > to limit the doorbells to the same core. So we should be fine ? No. It's theorically possible to create a guest that think it has 4 threads on P9 but those threads run on different cores of the host. The doorbells are useful if KVM uses a "P8 style" whole-core dispatch model or with PowerVM. We should probably invent some kind of DT property to tell the guest I suppoes. > If not > I suppose we should check CPU_FTR_ARCH_300 and use IPIs only for XIVE. > > > Thus it's quite possible that using XIVE for IPIs is actually faster > > than doorbells in that case. > > How can we measure that ? ebizzy may be. Or a simple socket ping pong with processes pinned to different threads. However the current KVM for P9 doesn't do threads yet afaik. Cheers, Ben.
[PATCH v2 2/3] livepatch: send a fake signal to all blocking tasks
Live patching consistency model is of LEAVE_PATCHED_SET and SWITCH_THREAD. This means that all tasks in the system have to be marked one by one as safe to call a new patched function. Safe means when a task is not (sleeping) in a set of patched functions. That is, no patched function is on the task's stack. Another clearly safe place is the boundary between kernel and userspace. The patching waits for all tasks to get outside of the patched set or to cross the boundary. The transition is completed afterwards. The problem is that a task can block the transition for quite a long time, if not forever. It could sleep in a set of patched functions, for example. Luckily we can force the task to leave the set by sending it a fake signal, that is a signal with no data in signal pending structures (no handler, no sign of proper signal delivered). Suspend/freezer use this to freeze the tasks as well. The task gets TIF_SIGPENDING set and is woken up (if it has been sleeping in the kernel before) or kicked by rescheduling IPI (if it was running on other CPU). This causes the task to go to kernel/userspace boundary where the signal would be handled and the task would be marked as safe in terms of live patching. There are tasks which are not affected by this technique though. The fake signal is not sent to kthreads. They should be handled in a different way. They can be woken up so they leave the patched set and their TIF_PATCH_PENDING can be cleared thanks to stack checking. For the sake of completeness, if the task is in TASK_RUNNING state but not currently running on some CPU it doesn't get the IPI, but it would eventually handle the signal anyway. Second, if the task runs in the kernel (in TASK_RUNNING state) it gets the IPI, but the signal is not handled on return from the interrupt. It would be handled on return to the userspace in the future when the fake signal is sent again. Stack checking deals with these cases in a better way. If the task was sleeping in a syscall it would be woken by our fake signal, it would check if TIF_SIGPENDING is set (by calling signal_pending() predicate) and return ERESTART* or EINTR. Syscalls with ERESTART* return values are restarted in case of the fake signal (see do_signal()). EINTR is propagated back to the userspace program. This could disturb the program, but... * each process dealing with signals should react accordingly to EINTR return values. * syscalls returning EINTR happen to be quite common situation in the system even if no fake signal is sent. * freezer sends the fake signal and does not deal with EINTR anyhow. Thus EINTR values are returned when the system is resumed. The very safe marking is done in architectures' "entry" on syscall and interrupt/exception exit paths, and in a stack checking functions of livepatch. TIF_PATCH_PENDING is cleared and the next recalc_sigpending() drops TIF_SIGPENDING. In connection with this, also call klp_update_patch_state() before do_signal(), so that recalc_sigpending() in dequeue_signal() can clear TIF_PATCH_PENDING immediately and thus prevent a double call of do_signal(). Note that the fake signal is not sent to stopped/traced tasks. Such task prevents the patching to finish till it continues again (is not traced anymore). Last, sending the fake signal is not automatic. It is done only when admin requests it by writing 1 to force sysfs attribute in livepatch sysfs directory. Signed-off-by: Miroslav BenesCc: Oleg Nesterov Cc: Michael Ellerman Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Andy Lutomirski Cc: linuxppc-dev@lists.ozlabs.org Cc: x...@kernel.org --- Documentation/ABI/testing/sysfs-kernel-livepatch | 4 ++- Documentation/livepatch/livepatch.txt| 5 ++- arch/powerpc/kernel/signal.c | 6 ++-- arch/x86/entry/common.c | 6 ++-- kernel/livepatch/core.c | 9 -- kernel/livepatch/transition.c| 40 kernel/livepatch/transition.h| 1 + kernel/signal.c | 4 ++- 8 files changed, 64 insertions(+), 11 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-kernel-livepatch b/Documentation/ABI/testing/sysfs-kernel-livepatch index b7a487ca8852..45f4e3551d27 100644 --- a/Documentation/ABI/testing/sysfs-kernel-livepatch +++ b/Documentation/ABI/testing/sysfs-kernel-livepatch @@ -16,9 +16,11 @@ Contact: live-patch...@vger.kernel.org The attribute allows administrator to affect the course of an existing transition. - Reading from the file returns all available operations. + Reading from the file returns all available operations, which + may be "signal" (signalling remaining tasks). Writing
Re: [PATCH] scsi-mq: Always unprepare before requeuing a request
"Martin K. Petersen"writes: >> One of the two scsi-mq functions that requeue a request unprepares a >> request before requeueing (scsi_io_completion()) but the other >> function not (__scsi_queue_insert()). Make sure that a request is >> unprepared before requeuing it. > > Applied to 4.13/scsi-fixes. Thanks much! This seems to be preventing my Power8 box, which uses IPR, from booting. Bisect said so: # first bad commit: [270065e92c317845d69095ec8e3d18616b5b39d5] scsi: scsi-mq: Always unprepare before requeuing a request And if I revert that on top of next-20170809 my system boots again. The symptom is that it just gets "stuck" during boot and never gets to mounting root, full log below, the end is: . ready ready sd 0:2:4:0: [sde] 554287104 512-byte logical blocks: (284 GB/264 GiB) sd 0:2:4:0: [sde] 4096-byte physical blocks sd 0:2:5:0: [sdf] 272646144 512-byte logical blocks: (140 GB/130 GiB) sd 0:2:5:0: [sdf] 4096-byte physical blocks sd 0:2:4:0: [sde] Write Protect is off sd 0:2:4:0: [sde] Mode Sense: 0b 00 00 08 sd 0:2:5:0: [sdf] Write Protect is off sd 0:2:5:0: [sdf] Mode Sense: 0b 00 00 08 And it just sits there for at least hours. I compared a good and bad boot log and there appears to be essentially no difference. Certainly nothing that looks suspicous. Any ideas? cheers [ 29.492780] kexec_core: Starting new kernel Allocated 2621440 bytes for 2048 pacas at cfd8 Page sizes from device-tree: base_shift=12: shift=12, sllp=0x, avpnm=0x, tlbiel=1, penc=0 base_shift=12: shift=16, sllp=0x, avpnm=0x, tlbiel=1, penc=7 base_shift=12: shift=24, sllp=0x, avpnm=0x, tlbiel=1, penc=56 base_shift=16: shift=16, sllp=0x0110, avpnm=0x, tlbiel=1, penc=1 base_shift=16: shift=24, sllp=0x0110, avpnm=0x, tlbiel=1, penc=8 base_shift=24: shift=24, sllp=0x0100, avpnm=0x0001, tlbiel=0, penc=0 base_shift=34: shift=34, sllp=0x0120, avpnm=0x07ff, tlbiel=0, penc=3 Huge page(16GB) memory: addr = 0x8 size = 0x4 pages = 1 Huge page(16GB) memory: addr = 0xC size = 0x4 pages = 1 Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24 Using 1TB segments Initializing hash mmu with SLB Linux version 4.12.0-gcc-6.3.1-10811-g270065e92c31 (mich...@ka3.ozlabs.ibm.com) (gcc version 6.3.1 20170214 (Custom e9096cb27f4bd642)) #384 SMP Thu Aug 10 19:43:35 AEST 2017 Using pSeries machine description bootconsole [udbg0] enabled Partition configured for 96 cpus. CPU maps initialized for 8 threads per core (thread shift is 3) Freed 2490368 bytes for unused pacas -> smp_release_cpus() spinning_secondaries = 0 <- smp_release_cpus() - ppc64_pft_size= 0x1c phys_mem_size = 0x10 dcache_bsize = 0x80 icache_bsize = 0x80 cpu_features = 0x07fc7aec18500249 possible= 0x5fff18500649 always = 0x18100040 cpu_user_features = 0xdc0065c2 0xef00 mmu_features = 0x7c006001 firmware_features = 0x0001c45ffc5f htab_hash_mask= 0x1f - numa: NODE_DATA [mem 0x36300-0x3] numa: NODE_DATA [mem 0x7ffcc2300-0x7ffccbfff] numa: NODE_DATA [mem 0x7ffcb8600-0x7ffcc22ff] numa: NODE_DATA(2) on node 1 numa: NODE_DATA [mem 0x7ffcae900-0x7ffcb85ff] numa: NODE_DATA(3) on node 1 node 2 must be removed before remove section 2045 PCI host bridge /pci@8002015 ranges: MEM 0x3fc0c000..0x3fc0cfff -> 0xc000 MEM 0x3018..0x301b -> 0x0003d018 PCI host bridge /pci@800201b ranges: MEM 0x3fc2f000..0x3fc2f7ff -> 0xf000 PCI host bridge /pci@800201e ranges: MEM 0x3fc2c000..0x3fc2dfff -> 0xc000 MEM 0x3058..0x305b -> 0x0003d058 PPC64 nvram contains 15360 bytes Top of RAM: 0x10, Total RAM: 0x10 Memory hole size: 0MB Zone ranges: DMA [mem 0x-0x000f] DMA32empty Normal empty Movable zone start for each node Early memory node ranges node 0: [mem 0x-0x0003] node 1: [mem 0x0004-0x0007] node 2: [mem 0x0008-0x000b] node 3: [mem 0x000c-0x000f] Initmem setup node 0 [mem 0x-0x0003] On node 0 totalpages: 262144 DMA zone: 256 pages used for memmap DMA zone: 0 pages reserved DMA zone: 262144 pages, LIFO batch:1 Initmem setup node 1 [mem 0x0004-0x0007] On node 1 totalpages: 262144 DMA zone: 256 pages used for memmap DMA zone: 0 pages reserved DMA zone: 262144 pages, LIFO batch:1 Initmem setup node 2 [mem 0x0008-0x000b] On node 2 totalpages: 262144 DMA zone: 256 pages used for memmap
Re: [PATCH 07/10] powerpc/xive: add XIVE exploitation mode to CAS
On 08/08/2017 10:56 AM, Cédric Le Goater wrote: > On POWER9, the Client Architecture Support (CAS) negotiation process > determines whether the guest operates in XIVE Legacy compatibility or > in XIVE exploitation mode. > > Now that we have initial guest support for the XIVE interrupt > controller, let's inform the hypervisor what we can do. > > Signed-off-by: Cédric Le Goater> --- > arch/powerpc/kernel/prom_init.c | 15 ++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c > index 613f79f03877..25c14f543bd7 100644 > --- a/arch/powerpc/kernel/prom_init.c > +++ b/arch/powerpc/kernel/prom_init.c > @@ -177,6 +177,7 @@ struct platform_support { > bool hash_mmu; > bool radix_mmu; > bool radix_gtse; > + bool xive; > }; > > /* Platforms codes are now obsolete in the kernel. Now only used within this > @@ -1054,6 +1055,12 @@ static void __init prom_parse_platform_support(u8 > index, u8 val, > support->radix_gtse = true; > } > break; > + case OV5_INDX(OV5_XIVE_SUPPORT): /* XIVE Exploitation mode */ > + if (val & OV5_FEAT(OV5_XIVE_SUPPORT)) { This should be : + if (val & OV5_FEAT(OV5_XIVE_EXPLOIT)) { I will fix it in the next version of the patchset. C. > + prom_debug("XIVE - exploitation mode\n"); > + support->xive = true; > + } > + break; > } > } > > @@ -1062,7 +1069,8 @@ static void __init prom_check_platform_support(void) > struct platform_support supported = { > .hash_mmu = false, > .radix_mmu = false, > - .radix_gtse = false > + .radix_gtse = false, > + .xive = false > }; > int prop_len = prom_getproplen(prom.chosen, > "ibm,arch-vec-5-platform-support"); > @@ -1095,6 +1103,11 @@ static void __init prom_check_platform_support(void) > /* We're probably on a legacy hypervisor */ > prom_debug("Assuming legacy hash support\n"); > } > + > + if (supported.xive) { > + prom_debug("Asking for XIVE\n"); > + ibm_architecture_vec.vec5.intarch = OV5_FEAT(OV5_XIVE_EXPLOIT); > + } > } > > static void __init prom_send_capabilities(void) >
[FIX PATCH v0] powerpc: Fix memory unplug failure on radix guest
For a PowerKVM guest, it is possible to specify a DIMM device in addition to the system RAM at boot time. When such a cold plugged DIMM device is removed from a radix guest, we hit the following warning in the guest kernel resulting in the eventual failure of memory unplug: remove_pud_table: unaligned range WARNING: CPU: 3 PID: 164 at arch/powerpc/mm/pgtable-radix.c:597 remove_pagetable+0x468/0xca0 Call Trace: remove_pagetable+0x464/0xca0 (unreliable) radix__remove_section_mapping+0x24/0x40 remove_section_mapping+0x28/0x60 arch_remove_memory+0xcc/0x120 remove_memory+0x1ac/0x270 dlpar_remove_lmb+0x1ac/0x210 dlpar_memory+0xbc4/0xeb0 pseries_hp_work_fn+0x1a4/0x230 process_one_work+0x1cc/0x660 worker_thread+0xac/0x6d0 kthread+0x16c/0x1b0 ret_from_kernel_thread+0x5c/0x74 The DIMM memory that is cold plugged gets merged to the same memblock region as RAM and hence gets mapped at 1G alignment. However since the removal is done for one LMB (lmb size 256MB) at a time, the address of the LMB (which is 256MB aligned) would get flagged as unaligned in remove_pud_table() resulting in the above failure. This problem is not seen for hot plugged memory because for the hot plugged memory, the mappings are created separately for each LMB and hence they all get aligned at 256MB. To fix this problem for the cold plugged memory, let us mark the cold plugged memblock region explicitly as HOTPLUGGED so that the region doesn't get merged with RAM. All the memory that is discovered via ibm,dynamic-memory-configuration is marked so(1). Next identify such regions in radix_init_pgtable() and create separate mappings within that region for each LMB so that they get don't get aligned like RAM region at 1G (2). (1) For PowerKVM guests, all boot time memory is represented via memory@ nodes and hot plugged/pluggable memory is represented via ibm,dynamic-memory-reconfiguration property. We are marking all hotplugged memory that is in ASSIGNED state during boot as HOTPLUGGED. With this only cold plugged memory gets marked for PowerKVM but need to check how this will affect PowerVM guests. (2) To create separate mappings for every LMB in the hot plugged region, we need lmb-size. I am currently using memory_block_size_bytes() API to get the lmb-size. Since this is early init time code, the machine type isn't probed yet and hence memory_block_size_bytes() would return the default LMB size as 16MB. Hence we end up creating separate mappings at much lower granularity than what we can ideally do for pseries machine. Signed-off-by: Bharata B Rao--- arch/powerpc/kernel/prom.c | 1 + arch/powerpc/mm/pgtable-radix.c | 17 ++--- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index f830562..24ecf53 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -524,6 +524,7 @@ static int __init early_init_dt_scan_drconf_memory(unsigned long node) size = 0x8000ul - base; } memblock_add(base, size); + memblock_mark_hotplug(base, size); } while (--rngs); } memblock_dump_all(); diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c index 671a45d..180d25a 100644 --- a/arch/powerpc/mm/pgtable-radix.c +++ b/arch/powerpc/mm/pgtable-radix.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -255,15 +256,25 @@ static void __init radix_init_pgtable(void) { unsigned long rts_field; struct memblock_region *reg; + phys_addr_t addr; + u64 lmb_size = memory_block_size_bytes(); /* We don't support slb for radix */ mmu_slb_size = 0; /* * Create the linear mapping, using standard page size for now */ - for_each_memblock(memory, reg) - WARN_ON(create_physical_mapping(reg->base, - reg->base + reg->size)); + for_each_memblock(memory, reg) { + if (memblock_is_hotpluggable(reg)) { + for (addr = reg->base; addr < (reg->base + reg->size); + addr += lmb_size) + WARN_ON(create_physical_mapping(addr, + addr + lmb_size)); + } else { + WARN_ON(create_physical_mapping(reg->base, + reg->base + reg->size)); + } + } /* Find out how many PID bits are supported */ if (cpu_has_feature(CPU_FTR_HVMODE)) { -- 2.7.4
Re: [PATCH 05/16] mm: Protect VMA modifications using VMA sequence count
On 10/08/2017 02:58, Kirill A. Shutemov wrote: > On Wed, Aug 09, 2017 at 12:43:33PM +0200, Laurent Dufour wrote: >> On 09/08/2017 12:12, Kirill A. Shutemov wrote: >>> On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote: The VMA sequence count has been introduced to allow fast detection of VMA modification when running a page fault handler without holding the mmap_sem. This patch provides protection agains the VMA modification done in : - madvise() - mremap() - mpol_rebind_policy() - vma_replace_policy() - change_prot_numa() - mlock(), munlock() - mprotect() - mmap_region() - collapse_huge_page() >>> >>> I don't thinks it's anywhere near complete list of places where we touch >>> vm_flags. What is your plan for the rest? >> >> The goal is only to protect places where change to the VMA is impacting the >> page fault handling. If you think I missed one, please advise. > > That's very fragile approach. We rely here too much on specific compiler > behaviour. > > Any write access to vm_flags can, in theory, be translated to several > write accesses. For instance with setting vm_flags to 0 in the middle, > which would result in sigfault on page fault to the vma. Indeed, just setting vm_flags to 0 will not result in sigfault, the real job is done when the pte are updated and the bits allowing access are cleared. Access to the pte is controlled by the pte lock. Page fault handler is triggered based on the pte bits, not the content of vm_flags and the speculative page fault is checking for the vma again once the pte lock is held. So there is no concurrency when dealing with the pte bits. Regarding the compiler behaviour, there are memory barriers and locking which should prevent that. Thanks, Laurent.
Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller
On 08/10/2017 07:54 AM, David Gibson wrote: > On Thu, Aug 10, 2017 at 02:46:00PM +1000, Benjamin Herrenschmidt wrote: >> On Thu, 2017-08-10 at 14:28 +1000, David Gibson wrote: >>> >>> Also, will POWER9 always have doorbells? In which case you could >>> reduce it to 3 options. >> >> The problem with doorbells on POWER9 guests is that they may have >> to trap and be emulated by the hypervisor, since the guest threads >> on P9 don't have to match the HW threads of the core. >> >> Thus it's quite possible that using XIVE for IPIs is actually faster >> than doorbells in that case. > > Ok, well unless there's a compelling reason to use doorbells you can > instead make it 3 cases with POWER9 never using doorbells. yes I suppose we could improve things for XIVE. For XICS, there are another 3 backends behind icp_ops->cause_ipi. I need to take a look. Thanks, C.
Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller
+static void xive_spapr_update_pending(struct xive_cpu *xc) +{ + u8 nsr, cppr; + u16 ack; + + /* Perform the acknowledge hypervisor to register cycle */ + ack = be16_to_cpu(__raw_readw(xive_tima + TM_SPC_ACK_OS_REG)); >>> >>> Why do you need the raw_readw() + be16_to_cpu + mb, rather than one of >>> the higher level IO helpers? >> >> This is one of the many ways to do MMIOs on the TIMA. This memory >> region defines a set of offsets and sizes for which loads and >> stores have different effects. >> >> See the arch/powerpc/include/asm/xive-regs.h file and >> arch/powerpc/kvm/book3s_xive_template.c for some more usage. > > Sure, much like any IO region. My point is, why do you want this kind > of complex combo, rather than say an in_be16() or readw_be(). > The code is inherited from the native backend. I think this is because we know what we are doing and we skip the synchronization routines of the higher level IO helpers. That might not be necessary for sPAPR. Ben ? Thanks, C.
Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller
On 08/10/2017 06:46 AM, Benjamin Herrenschmidt wrote: > On Thu, 2017-08-10 at 14:28 +1000, David Gibson wrote: >> >> Also, will POWER9 always have doorbells? In which case you could >> reduce it to 3 options. > > The problem with doorbells on POWER9 guests is that they may have > to trap and be emulated by the hypervisor, since the guest threads > on P9 don't have to match the HW threads of the core. Well, the pseries cause_ipi() handler does : if (doorbell_try_core_ipi(cpu)) return; to limit the doorbells to the same core. So we should be fine ? If not I suppose we should check CPU_FTR_ARCH_300 and use IPIs only for XIVE. > Thus it's quite possible that using XIVE for IPIs is actually faster > than doorbells in that case. How can we measure that ? ebizzy may be. C.
Re: [PATCH 10/13] powerpc/64s: idle simplify KVM idle on POWER9
On Tue, Aug 08, 2017 at 10:42:57PM +1000, Nicholas Piggin wrote: > On Tue, 8 Aug 2017 16:06:43 +0530 > Gautham R Shenoywrote: > > > Hi Nicholas, > > > > On Sun, Aug 06, 2017 at 03:02:38AM +1000, Nicholas Piggin wrote: > > > POWER9 CPUs have independent MMU contexts per thread so KVM > > > does not have to bring sibling threads into real-mode when > > > switching MMU mode to guest. This can simplify POWER9 sleep/wake > > > paths and avoids hwsyncs. > > > @@ -2858,11 +2883,13 @@ static noinline void kvmppc_run_core(struct > > > kvmppc_vcore *vc) > > > > > > /* Let secondaries go back to the offline loop */ > > > for (i = 0; i < controlled_threads; ++i) { > > > - kvmppc_release_hwthread(pcpu + i); > > > if (sip && sip->napped[i]) > > > kvmppc_ipi_thread(pcpu + i); > > > cpumask_clear_cpu(pcpu + i, >kvm->arch.cpu_in_guest); > > > } > > > > We are sending an IPI to the thread that has exited the guest and is > > currently napping. The IPI wakes it up so that it can executes > > offline loop. But we haven't released the hwthread yet, which means > > that hwthread_req for this thread is still set. > > > > The thread wakes up from nap, executes the pnv_powersave_wakeup code > > where it can enter kvm_start_guest. Is this a legitimate race or am I > > missing something? > > Oh I think it's just a silly mistake in my patch, good catch. Ah,np! > Would moving this loop below the one below solve it? I wasn't > completely happy with uglifying these loops by making the > primary release different than secondary... maybe I will just > move the difference into kvmppc_release_hwthread and which is > less intrusive to callers. I think moving it to kvmppc_release_hwthread is a good idea. > > Thanks, > Nick > -- Thanks and Regards gautham.
Re: [PATCH 02/10] powerpc/xive: guest exploitation of the XIVE interrupt controller
On Thu, Aug 10, 2017 at 02:46:00PM +1000, Benjamin Herrenschmidt wrote: > On Thu, 2017-08-10 at 14:28 +1000, David Gibson wrote: > > > > Also, will POWER9 always have doorbells? In which case you could > > reduce it to 3 options. > > The problem with doorbells on POWER9 guests is that they may have > to trap and be emulated by the hypervisor, since the guest threads > on P9 don't have to match the HW threads of the core. > > Thus it's quite possible that using XIVE for IPIs is actually faster > than doorbells in that case. Ok, well unless there's a compelling reason to use doorbells you can instead make it 3 cases with POWER9 never using doorbells. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature