Re: pseries/le: Fix endiannes issue in RTAS call from xmon
On 26/11/2014 04:31, Michael Ellerman wrote: OK. You say LPAR, by which you mean under phyp I think. I haven't seen this under KVM, and it looks like KVM doesn't implement set-indicator so that would explain that. Yes LPAR implies phyp, and KVM don't implement set-indicator so this doesn't happen in that case. I'll take this as a bug fix and CC it to stable. That's a good point. Thanks, Laurent. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RESEND, V3] powerpc, xmon: Enable HW instruction breakpoint on POWER8
On Tue, 2014-25-11 at 10:08:48 UTC, Anshuman Khandual wrote: This patch enables support for hardware instruction breakpoints on POWER8 with the help of a new register CIABR (Completed Instruction Address Breakpoint Register). With this patch, single hardware instruction breakpoint can be added and cleared during any active xmon debug session. This hardware based instruction breakpoint mechanism works correctly along with the existing TRAP based instruction breakpoints available on xmon. Hi Anshuman, diff --git a/arch/powerpc/include/asm/xmon.h b/arch/powerpc/include/asm/xmon.h index 5eb8e59..5d17aec 100644 --- a/arch/powerpc/include/asm/xmon.h +++ b/arch/powerpc/include/asm/xmon.h @@ -29,5 +29,11 @@ static inline void xmon_register_spus(struct list_head *list) { }; extern int cpus_are_in_xmon(void); #endif This file is the exported interface *of xmon*, it's not the place to put things that xmon needs internally. For now just put it in xmon.c +#if defined(CONFIG_PPC_BOOK3S_64) defined(CONFIG_PPC_SPLPAR) +#include asm/plpar_wrappers.h +#else +static inline long plapr_set_ciabr(unsigned long ciabr) {return 0; }; +#endif Also the ifdef is overly verbose, CONFIG_PPC_SPLPAR essentially depends on CONFIG_PPC_BOOK3S_64. So you can just use #ifdef CONFIG_PPC_SPLPAR. diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index b988b5a..c2f601a 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -271,6 +273,55 @@ static inline void cinval(void *p) } /* + * write_ciabr + * + * This function writes a value to the + * CIARB register either directly through + * mtspr instruction if the kernel is in HV + * privilege mode or call a hypervisor function + * to achieve the same in case the kernel is in + * supervisor privilege mode. + */ I'm not really sure a function this small needs a documentation block. But if you're going to add one, PLEASE make sure it's an actual kernel-doc style comment. You can check with: $ ./scripts/kernel-doc -text arch/powerpc/xmon/xmon.c Which you'll notice prints: Warning(arch/powerpc/xmon/xmon.c): no structured comments found You need something like: /** * write_ciabr() - write the CIABR SPR * @ciabr: The value to write. * * This function writes a value to the CIABR register either directly through * mtspr instruction if the kernel is in HV privilege mode or calls a * hypervisor function to achieve the same in case the kernel is in supervisor * privilege mode. */ The rest of the patch is OK. But I was hoping you'd notice that we no longer support any cpus that implement CPU_FTR_IABR. And so you can just repurpose all the iabr logic for ciabr. Something like this, untested: diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index b988b5addf86..6894650bff7f 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -51,6 +51,12 @@ #include asm/paca.h #endif +#ifdef CONFIG_PPC_SPLPAR +#include asm/plpar_wrappers.h +#else +static inline long plapr_set_ciabr(unsigned long ciabr) { return 0; }; +#endif + #include nonstdio.h #include dis-asm.h @@ -270,6 +276,31 @@ static inline void cinval(void *p) asm volatile (dcbi 0,%0; icbi 0,%0 : : r (p)); } +static void write_ciabr(unsigned long ciabr) +{ + if (!cpu_has_feature(CPU_FTR_ARCH_207S)) + return; + + if (cpu_has_feature(CPU_FTR_HVMODE)) { + mtspr(SPRN_CIABR, ciabr); + return; + } + + plapr_set_ciabr(ciabr); +} + +static void set_ciabr(unsigned long addr) +{ + addr = ~CIABR_PRIV; + + if (cpu_has_feature(CPU_FTR_HVMODE)) + addr |= CIABR_PRIV_HYPER; + else + addr |= CIABR_PRIV_SUPER; + + write_ciabr(addr); +} + /* * Disable surveillance (the service processor watchdog function) * while we are in xmon. @@ -764,9 +795,9 @@ static void insert_cpu_bpts(void) brk.len = 8; __set_breakpoint(brk); } - if (iabr cpu_has_feature(CPU_FTR_IABR)) - mtspr(SPRN_IABR, iabr-address -| (iabr-enabled (BP_IABR|BP_IABR_TE))); + + if (iabr) + set_ciabr(iabr-address); } static void remove_bpts(void) @@ -792,8 +823,7 @@ static void remove_bpts(void) static void remove_cpu_bpts(void) { hw_breakpoint_disable(); - if (cpu_has_feature(CPU_FTR_IABR)) - mtspr(SPRN_IABR, 0); + write_ciabr(0); } /* Command interpreting routine */ @@ -1127,7 +1157,7 @@ static char *breakpoint_help_string = b addr [cnt] set breakpoint at given instr addr\n bc clear all breakpoints\n bc n/addr clear breakpoint number n or at addr\n -bi addr [cnt] set hardware instr breakpoint (POWER3/RS64 only)\n +bi addr [cnt] set hardware instr breakpoint (POWER8 only)\n bd addr [cnt] set hardware data breakpoint\n ; @@ -1166,7 +1196,7 @@
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
Hi Michael, thanks for your reply! some general thought: This change was introduced mid 2013 but we don't have a single user relying on this code change yet, right? Disabling might_sleep() for functions that clearly state that they may sleep to get some special case running feels wrong to me. On Tue, Nov 25, 2014 at 12:43:24PM +0100, David Hildenbrand wrote: I recently discovered that commit 662bbcb2747c2422cf98d3d97619509379eee466 removed/skipped all might_sleep checks for might_fault() calls when in atomic. Yes. You can add e.g. might_sleep in your code if, for some reason, it is illegal to call it in an atomic context. But faults are legal in atomic if you handle the possible errors, and faults do not necessary cause caller to sleep, so might_fault should not call might_sleep. My point is that and (almost at) everywhere where we use pagefault_disable, we are using the inatomic variants. Also the documentation of copy_to_user() clearly states at various points that this function may sleep: - git grep This function may sleep yields Context: User context only. This function may sleep. e.g. s390, x86, mips Patching out the might_sleep() from these functions seems more to be a hack to solve another problem - not using the inatomic variants or finding them not to be sufficient for a task? So calling might_sleep() in these functions seems very right to me. Reason was to allow calling copy_(to|from)_user while in pagefault_disabled(), because otherwise, CONFIG_DEBUG_ATOMIC_SLEEP would find false positives. That wasn't the only reason BTW. Andi Kleen also showed that compiler did a terrible job optimizing get/put user when might_sleep was called. See e.g. this thread: https://lkml.org/lkml/2013/8/14/652 There was even an lwn.net article about it, that I don't seem to be able to locate. Thanks, I'll try to look it up! but: might_sleep() will only be called when lock debugging / sleep in atomic is in, so this doesn't seem to be a problem for me in a production environment. or am I missing something? So might_sleep is not appropriate for lightweight operations like __get_user, which people literally expect to be a single instruction. I agree that __.* variants should not call might_fault() (like I mentioned after the table below). I also have a project going which handles very short packets by copying them into guest memory directly without waking up a thread. I do it by calling recvmsg on a socket, then switching to a thread if I get back EFAULT. Not yet clean enough to upstream but it does seem to cut the latency down quite a bit, so I'd like to have the option. Generally, a caller that does something like this under a spinlock: preempt_disable pagefault_disable error = copy_to_user So why can't we use the inatomic variant? This seems to be atomic environment to me. Calling a function that states that it may sleep doesn't feel right to me. pagefault_enable preempt_enable_no_resched is not doing anything wrong and should not get a warning, as long as error is handled correctly later. You can also find the discussion around the patches educational: http://article.gmane.org/gmane.comp.emulators.kvm.devel/109928 However we have the inatomic variants of these function for this purpose. Does inatomic install fixups now? If not, I think this would rather be the way to go. Last I checked, it didn't so it did not satisfy this purpose. See this comment from x86: * Copy data from kernel space to user space. Caller must check * the specified block with access_ok() before calling this function. * The caller should also make sure he pins the user space address * so that we don't result in page fault and sleep. Also - switching to inatomic would scatter if (atomic) all over code. It's much better to simply call the same function (e.g. recvmsg) and have it fail gracefully: after all, we have code to handle get/put user errors anyway. The result of this change was that all guest access (that correctly uses might_fault()) doesn't perform atomic checks when CONFIG_DEBUG_ATOMIC_SLEEP is enabled. We lost a mighty debugging feature for user access. What's the path you are trying to debug? Well, we had a problem where we held a spin_lock and called copy_(from|to)_user(). We experienced very random deadlocks that took some guy almost a week to debug. The simple might_sleep() check would have showed this error immediately. I would have said that in 99,9% of all copy_to_user() calls we want to check might_sleep(). That pagefault_disable() is a special case that should be handled differently - in my opinion. If your code can faults, then it's safe to call from atomic context. If it can't, it must pin the page. You can not do access_ok checks and then assume access won't fault. As I wasn't able to come up with any other reason why this should be
Re: powerpc/powernv: Fix the hmi event version check.
On 11/26/2014 09:14 AM, Michael Ellerman wrote: On Thu, 2014-20-11 at 04:14:36 UTC, Mahesh Salgaonkar wrote: From: Mahesh Salgaonkar mah...@linux.vnet.ibm.com The current HMI event structure is an ABI and carries a version field to accommodate future changes without affecting/rearranging current structure members that are valid for previous versions. The current version check if (hmi_evt-version != OpalHMIEvt_V1) seems to consider that version will always be V1 which may not be true in future. If we start supporting HMI event V1, this check would fail without printing anything on older kernels. This patch fixes this issue. It's not clear what you mean when you say this check would fail without printing anything. The check will fail, and it will print something, ie. the error message. What you mean is the check will fail, and the HMI info will not be printed. My Bad, Yes. I meant 'HMI info will not be printed'. Do you want me to re spin the patch with correction. I'll CC this to stable unless you disagree. Yes. This patch needs to go to stable. Thanks, -Mahesh. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Disabled LocalPlus Controller (LPC) clock on MPC512x
Hello. My Freescale TWR-MPC5125 board instantly reboots if I touch any physical address on the LocalPlus Bus (LPB) for the first time when Linux has already booted. This effect is reproduced by using /dev/mem or loading a kernel module which works with any peripherals on LPB. It took me some time to find out that such crash is caused by clk_disable_unused() in drivers/clk/clk.c, which disables LocalPlus Controller (LPC) clock if I don't touch LPB addresses in the previous initcalls. So starting Linux with clk_ignore_unused bootparam or inserting dummy LPB reading to some initcall is a temporary fix. Is it correct to gate LPC clock? If yes, how to avoid the mentioned crashes properly? There's a piece of code in arch/powerpc/platforms/512x/clock-commonclk.c which is doubtful for me: /* * pre-enable those internal clock items which never get * claimed by any peripheral driver, to not have the clock * subsystem disable them late at startup */ clk_prepare_enable(clks[MPC512x_CLK_DUMMY]); clk_prepare_enable(clks[MPC512x_CLK_E300]);/* PowerPC CPU */ clk_prepare_enable(clks[MPC512x_CLK_DDR]);/* DRAM */ clk_prepare_enable(clks[MPC512x_CLK_MEM]);/* SRAM */ clk_prepare_enable(clks[MPC512x_CLK_IPS]);/* SoC periph */ clk_prepare_enable(clks[MPC512x_CLK_LPC]);/* boot media */ Does it mean that these clocks should be registered with CLK_IGNORE_UNUSED flag? Thanks a lot. Best regards, Alexander ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 07/14] of/reconfig: Always use the same structure for notifiers
On Tue, 25 Nov 2014 21:11:58 -0600 , Nathan Fontenot nf...@linux.vnet.ibm.com wrote: On 11/25/2014 05:07 PM, Benjamin Herrenschmidt wrote: On Mon, 2014-11-24 at 22:33 +, Grant Likely wrote: The OF_RECONFIG notifier callback uses a different structure depending on whether it is a node change or a property change. This is silly, and not very safe. Rework the code to use the same data structure regardless of the type of notifier. I fell pretty good about this one except... diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index b9d1dfdbe5bb..9fe6002c1d5a 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1711,12 +1711,11 @@ static void stage_topology_update(int core_id) static int dt_update_callback(struct notifier_block *nb, unsigned long action, void *data) { - struct of_prop_reconfig *update; + struct of_reconfig_data *update = data; int rc = NOTIFY_DONE; switch (action) { case OF_RECONFIG_UPDATE_PROPERTY: - update = (struct of_prop_reconfig *)data; Should we assert/bug on !update-dn / update-prop ? (Same for the rest of the patch) Or do you reckon it's pointless ? I'm not sure it's worth it, if those are NULL pointers the drivers/of code would have tried to use them before invoking the notifier chain. We won't make it this far if they're NULL. Agreed. I'm going to merge it as-is. g. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: What's the path you are trying to debug? Well, we had a problem where we held a spin_lock and called copy_(from|to)_user(). We experienced very random deadlocks that took some guy almost a week to debug. The simple might_sleep() check would have showed this error immediately. This must have been a very old kernel. A modern kernel will return an error from copy_to_user. Which is really the point of the patch you are trying to revert. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH v1 1/1] powerpc/85xx: Add support for Emerson/Artesyn MVME2500.
Add support for the Artesyn MVME2500 Single Board Computer. The MVME2500 is a 6U form factor VME64 computer with: - A single Freescale QorIQ P2010 CPU - 1 GB of DDR3 onboard memory - Three Gigabit Ethernets - Five 16550 compatible UARTS - One USB 2.0 port, one SHDC socket and one SATA connector - One PCI/PCI eXpress Mezzanine Card (PMC/XMC) Slot - MultiProcessor Interrupt Controller (MPIC) - A DS1375T Real Time Clock (RTC) and 512 KB of Non-Volatile Memory - Two 64 KB EEPROMs - U-Boot in 16 SPI Flash This patch is based on linux-3.18-rc6 and has been boot tested. Signed-off-by: Alessio Igor Bogani alessio.bog...@elettra.eu --- arch/powerpc/boot/dts/mvme2500.dts | 324 +++ arch/powerpc/boot/dts/mvme2500.dtsi | 28 +++ arch/powerpc/configs/85xx/mvme2500_defconfig | 222 ++ arch/powerpc/platforms/85xx/Kconfig | 8 + arch/powerpc/platforms/85xx/Makefile | 1 + arch/powerpc/platforms/85xx/mvme2500.c | 95 6 files changed, 678 insertions(+) create mode 100644 arch/powerpc/boot/dts/mvme2500.dts create mode 100644 arch/powerpc/boot/dts/mvme2500.dtsi create mode 100644 arch/powerpc/configs/85xx/mvme2500_defconfig create mode 100644 arch/powerpc/platforms/85xx/mvme2500.c diff --git a/arch/powerpc/boot/dts/mvme2500.dts b/arch/powerpc/boot/dts/mvme2500.dts new file mode 100644 index 000..fca05fc --- /dev/null +++ b/arch/powerpc/boot/dts/mvme2500.dts @@ -0,0 +1,324 @@ +/* + * Device tree source for the Emerson/Artesyn MVME2500 + * + * Copyright 2014 Elettra-Sincrotrone Trieste S.C.p.A. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the + * Free Software Foundation; either version 2 of the License, or (at your + * option) any later version. + * + * Based on: P2020 DS Device Tree Source + * Copyright 2009 Freescale Semiconductor Inc. + */ + +/include/ fsl/p2020si-pre.dtsi + +/ { + model = MVME2500; + compatible = artesyn,MVME2500; + + aliases { + serial2 = serial2; + serial3 = serial3; + serial4 = serial4; + serial5 = serial5; + }; + + memory { + device_type = memory; + }; + + board_soc: soc: soc@ffe0 { + ranges = 0x0 0 0xffe0 0x10; + + i2c@3000 { + hwmon@4c { + compatible = adi,adt7461; + reg = 0x4c; + }; + + rtc@68 { + compatible = dallas,ds1337; + reg = 0x68; + interrupts = 8 1 0 0; + }; + + eeprom-vpd@54 { + compatible = atmel,24c64; + reg = 0x54; + }; + + eeprom@52 { + compatible = atmel,24c512; + reg = 0x52; + }; + + eeprom@53 { + compatible = atmel,24c512; + reg = 0x53; + }; + + spd@50 { + compatible = atmel,24c02; + reg = 0x50; + }; + + }; + + spi0: spi@7000 { + fsl,espi-num-chipselects = 2; + + flash@0 { + #address-cells = 1; + #size-cells = 1; + compatible = atmel,at25df641; + reg = 0; + spi-max-frequency = 1000; + partition@u-boot { + label = u-boot; + reg = 0x 0x000A; + read-only; + }; + partition@dtb { + label = dtb; + reg = 0x000A 0x0002; + }; + partition@misc { + label = misc; + reg = 0x000C 0x0004; + }; + partition@u-boot-env { + label = u-boot-env; + reg = 0x0010 0x0002; + }; + partition@kernel { + label = kernel; +
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
Hmm you sent same mail to me off-list, and I replied there. Now there's a copy on list - I'm just going to assume it's exactly identical, pasting my response here as well. If there are more questions I missed, let me know. On Wed, Nov 26, 2014 at 09:23:31AM +0100, David Hildenbrand wrote: Sorry I haven't put you on cc, must have lost you while packing my list :) Thanks for your answer! This change was introduced in 2013, and I haven't seen an instance making use of your described scenario, right? My work is still out of tree. RHEL6 shipped some patches that use this. I don't know whether any instances in-tree use this capability. But it just doesn't make sense for might_fault to call might_sleep because a fault does not imply sleep. On Tue, Nov 25, 2014 at 12:43:24PM +0100, David Hildenbrand wrote: I recently discovered that commit 662bbcb2747c2422cf98d3d97619509379eee466 removed/skipped all might_sleep checks for might_fault() calls when in atomic. Yes. You can add e.g. might_sleep in your code if, for some reason, it is illegal to call it in an atomic context. But faults are legal in atomic if you handle the possible errors, and faults do not necessary cause caller to sleep, so might_fault should not call might_sleep. I think we should use in_atomic variants for this purpose (as done in the code for now) - especially as pagefault_disable has been relying on the preempt count for a long time. But see my comment below about existing documentation. IIUC they are not interchangeable. *in_atomic seems to require that page is pinned. *user does not: it installs a fixup so you get an error code if you try to access an invalid address. Besides, this would just lead to a ton of if (atomic) return inatomic else return user in code, for no good purpose. Reason was to allow calling copy_(to|from)_user while in pagefault_disabled(), because otherwise, CONFIG_DEBUG_ATOMIC_SLEEP would find false positives. That wasn't the only reason BTW. Andi Kleen also showed that compiler did a terrible job optimizing get/put user when might_sleep was called. might_fault() should never call might_sleep() when lock debugging is off, so there should be no performance problem or am I missing something? CONFIG_DEBUG_ATOMIC_SLEEP too, no? See e.g. this thread: https://lkml.org/lkml/2013/8/14/652 There was even an lwn.net article about it, that I don't seem to be able to locate. Thanks, will see if I can find it. So might_sleep is not appropriate for lightweight operations like __get_user, which people literally expect to be a single instruction. Yes, as discussed below, __.* variants should never call it. So that would be even more inconsistent. They fault exactly the same as the non __ variants. I also have a project going which handles very short packets by copying them into guest memory directly without waking up a thread. I do it by calling recvmsg on a socket, then switching to a thread if I get back EFAULT. Not yet clean enough to upstream but it does seem to cut the latency down quite a bit, so I'd like to have the option. Generally, a caller that does something like this under a spinlock: preempt_disable pagefault_disable error = copy_to_user pagefault_enable preempt_enable_no_resched is not doing anything wrong and should not get a warning, as long as error is handled correctly later. I think this would be a perfect fit for an inatomic variant, no? No because inatomic does not handle faults. I mean even the copy_to_user documentation of e.g. s390, x86, mips clearly says: Context: User context only.-This function may sleep. So the comment is incomplete - I didn't get around fixing that. It may sleep but not in atomic context. So calling it from your described scenario is wrong. And as the interface says, it might_sleep() and therefore also call the check in might_fault(). Exactly the reverse. There's no way for it to sleep except on fault and that only if preempttion is on. You can also find the discussion around the patches educational: http://article.gmane.org/gmane.comp.emulators.kvm.devel/109928 However we have the inatomic variants of these function for this purpose. Does inatomic install fixups now? I think this varies between architectures but I am no expert. But as 99,9% of all pagefault_disable code uses inatomic, I would have guessed that this is rather the way to got than simply using the non atomic variant that clearly states on the interface that it might sleep? Let's go ahead and make the comment more exact then. Last I checked, it didn't so it did not satisfy this purpose. See this comment from x86: * Copy data from kernel space to user space. Caller must check * the specified block with access_ok() before calling this function. * The caller should also make sure
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: What's the path you are trying to debug? Well, we had a problem where we held a spin_lock and called copy_(from|to)_user(). We experienced very random deadlocks that took some guy almost a week to debug. The simple might_sleep() check would have showed this error immediately. This must have been a very old kernel. A modern kernel will return an error from copy_to_user. Which is really the point of the patch you are trying to revert. That's assuming you disabled preemption. If you didn't, and take a spinlock, you have deadlocks even without userspace access. -- MST ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
Am 26.11.2014 um 16:17 schrieb Michael S. Tsirkin: On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: What's the path you are trying to debug? Well, we had a problem where we held a spin_lock and called copy_(from|to)_user(). We experienced very random deadlocks that took some guy almost a week to debug. The simple might_sleep() check would have showed this error immediately. This must have been a very old kernel. A modern kernel will return an error from copy_to_user. I disagree. copy_to_user will not return while holding a spinlock, because it does not know! How should it? See: spin_lock will call preempt_disable, but thats a no-op for a non-preempt kernel. So the mere fact that we hold a spin_lock is not known by any user access function. (or others). No? Christian ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: What's the path you are trying to debug? Well, we had a problem where we held a spin_lock and called copy_(from|to)_user(). We experienced very random deadlocks that took some guy almost a week to debug. The simple might_sleep() check would have showed this error immediately. This must have been a very old kernel. A modern kernel will return an error from copy_to_user. Which is really the point of the patch you are trying to revert. That's assuming you disabled preemption. If you didn't, and take a spinlock, you have deadlocks even without userspace access. (Thanks for your resent, my first email was sent directly to you ... grml) This is what happened on our side (very recent kernel): spin_lock(lock) copy_to_user(...) spin_unlock(lock) 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_ as old value 2. we slept during copy_to_user() 3. the thread got scheduled onto another cpu 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked the spinlock tried to unlocked it). 5. lock remained locked - deadlock Christian came up with the following explanation: Without preemption, spin_lock() will not touch the preempt counter. disable_pfault() will always touch it. Therefore, with preemption disabled, copy_to_user() has no idea that it is running in atomic context - and will therefore try to sleep. So copy_to_user() will on s390: 1. run as atomic while spin_lock() with preemption enabled. 2. run as not atomic while spin_lock() with preemption disabled. 3. run as atomic while pagefault_disabled() with preemption enabled or disabled. 4. run as not atomic when really not atomic. And exactly nr 2. is the thing that produced the deadlock in our scenario and the reason why I want a might_sleep() :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 04:30:32PM +0100, Christian Borntraeger wrote: Am 26.11.2014 um 16:17 schrieb Michael S. Tsirkin: On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: What's the path you are trying to debug? Well, we had a problem where we held a spin_lock and called copy_(from|to)_user(). We experienced very random deadlocks that took some guy almost a week to debug. The simple might_sleep() check would have showed this error immediately. This must have been a very old kernel. A modern kernel will return an error from copy_to_user. I disagree. copy_to_user will not return while holding a spinlock, because it does not know! How should it? See: spin_lock will call preempt_disable, but thats a no-op for a non-preempt kernel. So the mere fact that we hold a spin_lock is not known by any user access function. (or others). No? Christian Well might_sleep() merely checks preempt count and irqs_disabled too. If you want debugging things to trigger, you need to enable a bunch of config options. That's not new. -- MST ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 04:32:07PM +0100, David Hildenbrand wrote: On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: What's the path you are trying to debug? Well, we had a problem where we held a spin_lock and called copy_(from|to)_user(). We experienced very random deadlocks that took some guy almost a week to debug. The simple might_sleep() check would have showed this error immediately. This must have been a very old kernel. A modern kernel will return an error from copy_to_user. Which is really the point of the patch you are trying to revert. That's assuming you disabled preemption. If you didn't, and take a spinlock, you have deadlocks even without userspace access. (Thanks for your resent, my first email was sent directly to you ... grml) This is what happened on our side (very recent kernel): spin_lock(lock) copy_to_user(...) spin_unlock(lock) That's a deadlock even without copy_to_user - it's enough for the thread to be preempted and another one to try taking the lock. 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_ as old value 2. we slept during copy_to_user() 3. the thread got scheduled onto another cpu 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked the spinlock tried to unlocked it). 5. lock remained locked - deadlock Christian came up with the following explanation: Without preemption, spin_lock() will not touch the preempt counter. disable_pfault() will always touch it. Therefore, with preemption disabled, copy_to_user() has no idea that it is running in atomic context - and will therefore try to sleep. So copy_to_user() will on s390: 1. run as atomic while spin_lock() with preemption enabled. 2. run as not atomic while spin_lock() with preemption disabled. 3. run as atomic while pagefault_disabled() with preemption enabled or disabled. 4. run as not atomic when really not atomic. And exactly nr 2. is the thing that produced the deadlock in our scenario and the reason why I want a might_sleep() :) IMHO it's not copy to user that causes the problem. It's the misuse of spinlocks with preemption on. So might_sleep would make you think copy_to_user is the problem, and e.g. let you paper over it by moving copy_to_user out. Enable lock prover and you will see what the real issue is, which is you didn't disable preempt. and if you did, copy_to_user would be okay. -- MST ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
This is what happened on our side (very recent kernel): spin_lock(lock) copy_to_user(...) spin_unlock(lock) That's a deadlock even without copy_to_user - it's enough for the thread to be preempted and another one to try taking the lock. 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_ as old value 2. we slept during copy_to_user() 3. the thread got scheduled onto another cpu 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked the spinlock tried to unlocked it). 5. lock remained locked - deadlock Christian came up with the following explanation: Without preemption, spin_lock() will not touch the preempt counter. disable_pfault() will always touch it. Therefore, with preemption disabled, copy_to_user() has no idea that it is running in atomic context - and will therefore try to sleep. So copy_to_user() will on s390: 1. run as atomic while spin_lock() with preemption enabled. 2. run as not atomic while spin_lock() with preemption disabled. 3. run as atomic while pagefault_disabled() with preemption enabled or disabled. 4. run as not atomic when really not atomic. should have been more clear at that point: preemption enabled == kernel compiled with preemption support preemption disabled == kernel compiled without preemption support And exactly nr 2. is the thing that produced the deadlock in our scenario and the reason why I want a might_sleep() :) IMHO it's not copy to user that causes the problem. It's the misuse of spinlocks with preemption on. As I said, preemption was off. So might_sleep would make you think copy_to_user is the problem, and e.g. let you paper over it by moving copy_to_user out. Actually implementing different way of locking easily fixed the problem for us. The old might_sleep() checks would have given us the problem within a few seconds (I tested it). Enable lock prover and you will see what the real issue is, which is you didn't disable preempt. and if you did, copy_to_user would be okay. Our kernel is compiled without preemption and we turned on all lock/atomic sleep debugging aid. No problem was detected. But the question is if we shouldn't rather provide a: copy_to_user_nosleep() implementation that can be called from pagefault_disable() because it won't sleep. and a copy_to_user_sleep() implementation that cannot be called from pagefault_disable(). Another way to fix it would be a reworked pagefault_disable() function that somehow sets a flag, so copy_to_user() knows that it is in fact called from a valid context, not just from some atomic context. So we could trigger might_sleep() when detecting a !pagefault_disable context. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
Am 26.11.2014 um 16:37 schrieb Michael S. Tsirkin: On Wed, Nov 26, 2014 at 04:30:32PM +0100, Christian Borntraeger wrote: Am 26.11.2014 um 16:17 schrieb Michael S. Tsirkin: On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: What's the path you are trying to debug? Well, we had a problem where we held a spin_lock and called copy_(from|to)_user(). We experienced very random deadlocks that took some guy almost a week to debug. The simple might_sleep() check would have showed this error immediately. This must have been a very old kernel. A modern kernel will return an error from copy_to_user. I disagree. copy_to_user will not return while holding a spinlock, because it does not know! How should it? See: spin_lock will call preempt_disable, but thats a no-op for a non-preempt kernel. So the mere fact that we hold a spin_lock is not known by any user access function. (or others). No? Christian Well might_sleep() merely checks preempt count and irqs_disabled too. If you want debugging things to trigger, you need to enable a bunch of config options. That's not new. You miss the point of the whole thread: The problem is that even with debug options enabled, holding a spinlock would not trigger a bug on copy_to_user. So the problem is not the good path, the problem is that a debugging aid for detecting a broken case was lost. Even with all kernel debugging enabled. That is because CONFIG_DEBUG_ATOMIC_SLEEP selects PREEMPT_COUNT. That means: spin_lock will then be considered as in_atomic and no message comes. Without CONFIG_DEBUG_ATOMIC_SLEEP spin_lock will not touch the preempt_count but we also dont see a message because might_fault is now a nop I understand that you dont like Davids changes due to other side effects that you have mentioned. So lets focus on how we can fix the debug option. Ok? Christian ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
Am 26.11.2014 um 16:47 schrieb Michael S. Tsirkin: On Wed, Nov 26, 2014 at 04:32:07PM +0100, David Hildenbrand wrote: On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: What's the path you are trying to debug? Well, we had a problem where we held a spin_lock and called copy_(from|to)_user(). We experienced very random deadlocks that took some guy almost a week to debug. The simple might_sleep() check would have showed this error immediately. This must have been a very old kernel. A modern kernel will return an error from copy_to_user. Which is really the point of the patch you are trying to revert. That's assuming you disabled preemption. If you didn't, and take a spinlock, you have deadlocks even without userspace access. (Thanks for your resent, my first email was sent directly to you ... grml) This is what happened on our side (very recent kernel): spin_lock(lock) copy_to_user(...) spin_unlock(lock) That's a deadlock even without copy_to_user - it's enough for the thread to be preempted and another one to try taking the lock. Huh? With CONFIG_PREEMPT spin_lock will disable preemption. (we had preempt = server anyway). But please: One step back. The problem is not the good path. The problem is that we lost a debugging aid for a known to be broken case. In other words: Our code had a bug. Older kernels detected that kind of bug. With your change we no longer saw the sleeping while atomic. Thats it. See my other mail. Christian ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 05:02:23PM +0100, David Hildenbrand wrote: This is what happened on our side (very recent kernel): spin_lock(lock) copy_to_user(...) spin_unlock(lock) That's a deadlock even without copy_to_user - it's enough for the thread to be preempted and another one to try taking the lock. 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_ as old value 2. we slept during copy_to_user() 3. the thread got scheduled onto another cpu 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked the spinlock tried to unlocked it). 5. lock remained locked - deadlock Christian came up with the following explanation: Without preemption, spin_lock() will not touch the preempt counter. disable_pfault() will always touch it. Therefore, with preemption disabled, copy_to_user() has no idea that it is running in atomic context - and will therefore try to sleep. So copy_to_user() will on s390: 1. run as atomic while spin_lock() with preemption enabled. 2. run as not atomic while spin_lock() with preemption disabled. 3. run as atomic while pagefault_disabled() with preemption enabled or disabled. 4. run as not atomic when really not atomic. should have been more clear at that point: preemption enabled == kernel compiled with preemption support preemption disabled == kernel compiled without preemption support And exactly nr 2. is the thing that produced the deadlock in our scenario and the reason why I want a might_sleep() :) IMHO it's not copy to user that causes the problem. It's the misuse of spinlocks with preemption on. As I said, preemption was off. off - disabled at compile time? But the code is broken for people that do enable it. So might_sleep would make you think copy_to_user is the problem, and e.g. let you paper over it by moving copy_to_user out. Actually implementing different way of locking easily fixed the problem for us. The old might_sleep() checks would have given us the problem within a few seconds (I tested it). Or enable CONFIG_PREMPT, with same effect (copy_to_user will report an error). Do you check return code from copy to user? If not then you have another bug ... Enable lock prover and you will see what the real issue is, which is you didn't disable preempt. and if you did, copy_to_user would be okay. Our kernel is compiled without preemption and we turned on all lock/atomic sleep debugging aid. No problem was detected. But your code is still buggy with preemption on, isn't it? But the question is if we shouldn't rather provide a: copy_to_user_nosleep() implementation that can be called from pagefault_disable() because it won't sleep. and a copy_to_user_sleep() implementation that cannot be called from pagefault_disable(). Another way to fix it would be a reworked pagefault_disable() function that somehow sets a flag, so copy_to_user() knows that it is in fact called from a valid context, not just from some atomic context. So we could trigger might_sleep() when detecting a !pagefault_disable contex I think all this is just directing people to paper over the problem. You should normally disable preemption if you take spinlocks. Yes it might happen to work if preempt is compiled out and you don't trigger scheduler, but Linux might add scheduler calls at any point without notice, code must be preempt safe. Maybe add a debug option warning about spinlocks taken with preempt on. That would make sense I think. -- MST ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
Am 26.11.2014 um 17:19 schrieb Michael S. Tsirkin: On Wed, Nov 26, 2014 at 05:02:23PM +0100, David Hildenbrand wrote: This is what happened on our side (very recent kernel): spin_lock(lock) copy_to_user(...) spin_unlock(lock) That's a deadlock even without copy_to_user - it's enough for the thread to be preempted and another one to try taking the lock. 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_ as old value 2. we slept during copy_to_user() 3. the thread got scheduled onto another cpu 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked the spinlock tried to unlocked it). 5. lock remained locked - deadlock Christian came up with the following explanation: Without preemption, spin_lock() will not touch the preempt counter. disable_pfault() will always touch it. Therefore, with preemption disabled, copy_to_user() has no idea that it is running in atomic context - and will therefore try to sleep. So copy_to_user() will on s390: 1. run as atomic while spin_lock() with preemption enabled. 2. run as not atomic while spin_lock() with preemption disabled. 3. run as atomic while pagefault_disabled() with preemption enabled or disabled. 4. run as not atomic when really not atomic. should have been more clear at that point: preemption enabled == kernel compiled with preemption support preemption disabled == kernel compiled without preemption support And exactly nr 2. is the thing that produced the deadlock in our scenario and the reason why I want a might_sleep() :) IMHO it's not copy to user that causes the problem. It's the misuse of spinlocks with preemption on. As I said, preemption was off. off - disabled at compile time? But the code is broken for people that do enable it. [...] You should normally disable preemption if you take spinlocks. Your are telling that any sequence of spin_lock ... spin_unlock is broken with CONFIG_PREEMPT? Michael, that is bullshit. spin_lock will take care of CONFIG_PREEMPT just fine. Only sequences like spin_lock ... schedule ... spin_unlock are broken. But as I said. That is not the problem that we are discussing here. Christian ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 05:07:13PM +0100, Christian Borntraeger wrote: Am 26.11.2014 um 16:47 schrieb Michael S. Tsirkin: On Wed, Nov 26, 2014 at 04:32:07PM +0100, David Hildenbrand wrote: On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: What's the path you are trying to debug? Well, we had a problem where we held a spin_lock and called copy_(from|to)_user(). We experienced very random deadlocks that took some guy almost a week to debug. The simple might_sleep() check would have showed this error immediately. This must have been a very old kernel. A modern kernel will return an error from copy_to_user. Which is really the point of the patch you are trying to revert. That's assuming you disabled preemption. If you didn't, and take a spinlock, you have deadlocks even without userspace access. (Thanks for your resent, my first email was sent directly to you ... grml) This is what happened on our side (very recent kernel): spin_lock(lock) copy_to_user(...) spin_unlock(lock) That's a deadlock even without copy_to_user - it's enough for the thread to be preempted and another one to try taking the lock. Huh? With CONFIG_PREEMPT spin_lock will disable preemption. (we had preempt = server anyway). Are you sure? Can you point me where it does this please? But please: One step back. The problem is not the good path. The problem is that we lost a debugging aid for a known to be broken case. In other words: Our code had a bug. Older kernels detected that kind of bug. With your change we no longer saw the sleeping while atomic. Thats it. See my other mail. Christian You want to add more debugging tools, fine. But this one was giving users in field false positives. The point is that *_user is safe with preempt off. It returns an error gracefully. It does not sleep. It does not trigger the scheduler in that context. David's patch makes it say it does, so it's wrong. -- MST ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 05:30:35PM +0100, Christian Borntraeger wrote: Am 26.11.2014 um 17:19 schrieb Michael S. Tsirkin: On Wed, Nov 26, 2014 at 05:02:23PM +0100, David Hildenbrand wrote: This is what happened on our side (very recent kernel): spin_lock(lock) copy_to_user(...) spin_unlock(lock) That's a deadlock even without copy_to_user - it's enough for the thread to be preempted and another one to try taking the lock. 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_ as old value 2. we slept during copy_to_user() 3. the thread got scheduled onto another cpu 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked the spinlock tried to unlocked it). 5. lock remained locked - deadlock Christian came up with the following explanation: Without preemption, spin_lock() will not touch the preempt counter. disable_pfault() will always touch it. Therefore, with preemption disabled, copy_to_user() has no idea that it is running in atomic context - and will therefore try to sleep. So copy_to_user() will on s390: 1. run as atomic while spin_lock() with preemption enabled. 2. run as not atomic while spin_lock() with preemption disabled. 3. run as atomic while pagefault_disabled() with preemption enabled or disabled. 4. run as not atomic when really not atomic. should have been more clear at that point: preemption enabled == kernel compiled with preemption support preemption disabled == kernel compiled without preemption support And exactly nr 2. is the thing that produced the deadlock in our scenario and the reason why I want a might_sleep() :) IMHO it's not copy to user that causes the problem. It's the misuse of spinlocks with preemption on. As I said, preemption was off. off - disabled at compile time? But the code is broken for people that do enable it. [...] You should normally disable preemption if you take spinlocks. Your are telling that any sequence of spin_lock ... spin_unlock is broken with CONFIG_PREEMPT? Michael, that is bullshit. spin_lock will take care of CONFIG_PREEMPT just fine. Only sequences like spin_lock ... schedule ... spin_unlock are broken. But as I said. That is not the problem that we are discussing here. Christian I'm saying spin_lock without _irqsave is often a bug. I am also saying this code in mm/fault.c: __do_page_fault ... /* * If we're in an interrupt, have no user context or are running * in an atomic region then we must not take the fault: */ if (unlikely(in_atomic() || !mm)) { bad_area_nosemaphore(regs, error_code, address); return; } means that a fault won't cause sleep if called in atomic context. And a bunch of code relies on this. This is why might_fault does: * it would be nicer only to annotate paths which are not under * pagefault_disable, however that requires a larger audit and * providing helpers like get_user_atomic. */ if (in_atomic()) return; __might_sleep(__FILE__, __LINE__, 0); If you see this violated, let's figure out why. -- MST ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
Am 26.11.2014 um 17:32 schrieb Michael S. Tsirkin: [...] This is what happened on our side (very recent kernel): spin_lock(lock) copy_to_user(...) spin_unlock(lock) That's a deadlock even without copy_to_user - it's enough for the thread to be preempted and another one to try taking the lock. Huh? With CONFIG_PREEMPT spin_lock will disable preemption. (we had preempt = server anyway). Are you sure? Can you point me where it does this please? spin_lock -- raw_spin_lock -- _raw_spin_lock -- __raw_spin_lock static inline void __raw_spin_lock(raw_spinlock_t *lock) { preempt_disable(); - spin_acquire(lock-dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); } Michael, please be serious. The whole kernel would be broken if spin_lock would not disable preemption. But please: One step back. The problem is not the good path. The problem is that we lost a debugging aid for a known to be broken case. In other words: Our code had a bug. Older kernels detected that kind of bug. With your change we no longer saw the sleeping while atomic. Thats it. See my other mail. Christian You want to add more debugging tools, fine. We dont want to add, we want to fix something that used to work But this one was giving users in field false positives. So lets try to fix those, ok? If we cant, then tough luck. But coming up with wrong statements is not helpful. The point is that *_user is safe with preempt off. It returns an error gracefully. It does not sleep. It does not trigger the scheduler in that context. There are special cases where your statement is true. But its not in general. copy_to_user might fault and that fault might sleep and reschedule. For example handle_mm_fault might go down to pud_alloc, pmd_alloc etc and all these functions could do an GFP_KERNEL allocation. Which might sleep. Which will schedule. David's patch makes it say it does, so it's wrong. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 05:51:08PM +0100, Christian Borntraeger wrote: But this one was giving users in field false positives. So lets try to fix those, ok? If we cant, then tough luck. Sure. I think the simplest way might be to make spinlock disable premption when CONFIG_DEBUG_ATOMIC_SLEEP is enabled. As a result, userspace access will fail and caller will get a nice error. But coming up with wrong statements is not helpful. True. Sorry that I did that. The point is that *_user is safe with preempt off. It returns an error gracefully. It does not sleep. It does not trigger the scheduler in that context. There are special cases where your statement is true. But its not in general. copy_to_user might fault and that fault might sleep and reschedule. Yes. But not if called inatomic. For example handle_mm_fault might go down to pud_alloc, pmd_alloc etc and all these functions could do an GFP_KERNEL allocation. Which might sleep. Which will schedule. David's patch makes it say it does, so it's wrong. Absolutely. I think you can already debug your case easily, by enabling CONFIG_PREEMPT. This seems counter-intuitive, and distro debug kernels don't seem to do this. -- MST ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 07:04:47PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 26, 2014 at 05:51:08PM +0100, Christian Borntraeger wrote: But this one was giving users in field false positives. So lets try to fix those, ok? If we cant, then tough luck. Sure. I think the simplest way might be to make spinlock disable premption when CONFIG_DEBUG_ATOMIC_SLEEP is enabled. Specifically maybe DEBUG_ATOMIC_SLEEP should select PREEMPT_COUNT? As a result, userspace access will fail and caller will get a nice error. But coming up with wrong statements is not helpful. True. Sorry that I did that. The point is that *_user is safe with preempt off. It returns an error gracefully. It does not sleep. It does not trigger the scheduler in that context. There are special cases where your statement is true. But its not in general. copy_to_user might fault and that fault might sleep and reschedule. Yes. But not if called inatomic. For example handle_mm_fault might go down to pud_alloc, pmd_alloc etc and all these functions could do an GFP_KERNEL allocation. Which might sleep. Which will schedule. David's patch makes it say it does, so it's wrong. Absolutely. I think you can already debug your case easily, by enabling CONFIG_PREEMPT. This seems counter-intuitive, and distro debug kernels don't seem to do this. -- MST ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc: 32 bit getcpu VDSO function uses 64 bit instructions
I used some 64 bit instructions when adding the 32 bit getcpu VDSO function. Fix it. Fixes: 18ad51dd342a (powerpc: Add VDSO version of getcpu) Cc: sta...@vger.kernel.org Signed-off-by: Anton Blanchard an...@samba.org --- arch/powerpc/kernel/vdso32/getcpu.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/vdso32/getcpu.S b/arch/powerpc/kernel/vdso32/getcpu.S index 23eb9a9..c62be60 100644 --- a/arch/powerpc/kernel/vdso32/getcpu.S +++ b/arch/powerpc/kernel/vdso32/getcpu.S @@ -30,8 +30,8 @@ V_FUNCTION_BEGIN(__kernel_getcpu) .cfi_startproc mfspr r5,SPRN_SPRG_VDSO_READ - cmpdi cr0,r3,0 - cmpdi cr1,r4,0 + cmpwi cr0,r3,0 + cmpwi cr1,r4,0 clrlwi r6,r5,16 rlwinm r7,r5,16,31-15,31-0 beq cr0,1f -- 2.1.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v1 1/1] powerpc/85xx: Add support for Emerson/Artesyn MVME2500.
On Wed, 2014-11-26 at 15:17 +0100, Alessio Igor Bogani wrote: + board_soc: soc: soc@ffe0 { There's no need for two labels on the same node. + ranges = 0x0 0 0xffe0 0x10; + + i2c@3000 { + hwmon@4c { + compatible = adi,adt7461; + reg = 0x4c; + }; + + rtc@68 { + compatible = dallas,ds1337; + reg = 0x68; + interrupts = 8 1 0 0; + }; + + eeprom-vpd@54 { + compatible = atmel,24c64; + reg = 0x54; + }; eeprom-vpd? Node name isn't the right place to put the intended usage of the contents of the EEPROM. + + eeprom@52 { + compatible = atmel,24c512; + reg = 0x52; + }; + + eeprom@53 { + compatible = atmel,24c512; + reg = 0x53; + }; + + spd@50 { + compatible = atmel,24c02; + reg = 0x50; + }; Likewise, I suspect this is also an eeprom. + }; + + spi0: spi@7000 { + fsl,espi-num-chipselects = 2; + + flash@0 { + #address-cells = 1; + #size-cells = 1; + compatible = atmel,at25df641; + reg = 0; + spi-max-frequency = 1000; + partition@u-boot { + label = u-boot; + reg = 0x 0x000A; + read-only; + }; + partition@dtb { + label = dtb; + reg = 0x000A 0x0002; + }; Unfortunately you seem to have copied a bad example here... After the @ should be a number that matches reg. Better yet, don't put partition information in the dts at all -- it's not hardware description. Use the mtdparts command line. + lbc: localbus@ffe05000 { + reg = 0 0xffe05000 0 0x1000; + + ranges = 0x0 0x0 0x0 0xfff0 0x0008 + 0x1 0x0 0x0 0xffc4 0x0001 + 0x2 0x0 0x0 0xffc5 0x0001 + 0x3 0x0 0x0 0xffc6 0x0001 + 0x4 0x0 0x0 0xffc7 0x0001 + 0x6 0x0 0x0 0xffc8 0x0001 + 0x5 0x0 0x0 0xffdf 0x1000; It's not possible to program the LBC with a window of only 0x1000 bytes. + + serial2: serial@1,0 { + #cell-index = 2; + device_type = serial; + compatible = ns16550; + reg = 0x1 0x0 0x100; + clock-frequency = 1843200; + interrupts = 11 2 0 0; + }; + + serial3: serial@2,0 { + #cell-index = 3; + device_type = serial; + compatible = ns16550; + reg = 0x2 0x0 0x100; + clock-frequency = 1843200; + interrupts = 1 2 0 0; + }; Why do you need cell-index, what connection do these values have to actual hardware (e.g. values written to a register, rather than numbers in a manual), and why did the name change to #cell-index? + interrupts = 9 1 0 0 ; Whitespace +/include/ mvme2500.dtsi Are you going to have more than one .dts using this .dtsi? If not, why separate this part? diff --git a/arch/powerpc/boot/dts/mvme2500.dtsi b/arch/powerpc/boot/dts/mvme2500.dtsi new file mode 100644 index 000..6966f13 --- /dev/null +++ b/arch/powerpc/boot/dts/mvme2500.dtsi [snip] +/include/ fsl/pq3-mpic-message-B.dtsi +}; Why is this being included from a board file rather than from the SoC file? diff --git a/arch/powerpc/configs/85xx/mvme2500_defconfig b/arch/powerpc/configs/85xx/mvme2500_defconfig new file mode 100644 index 000..06fe629 --- /dev/null +++ b/arch/powerpc/configs/85xx/mvme2500_defconfig Why does this board need its own defconfig? If it's just for the address space stuff, maybe it could be a more general mpc85xx_2g_1g_1g_defconfig. xes_mpc85xx_defconfig uses the same layout (though it's SMP). Maybe other boards could share it in the future, or users of existing boards might prefer it... Better still would be
Re: powerpc/powernv: Fix the hmi event version check.
On Wed, 2014-11-26 at 15:56 +0530, Mahesh Jagannath Salgaonkar wrote: On 11/26/2014 09:14 AM, Michael Ellerman wrote: On Thu, 2014-20-11 at 04:14:36 UTC, Mahesh Salgaonkar wrote: From: Mahesh Salgaonkar mah...@linux.vnet.ibm.com The current HMI event structure is an ABI and carries a version field to accommodate future changes without affecting/rearranging current structure members that are valid for previous versions. The current version check if (hmi_evt-version != OpalHMIEvt_V1) seems to consider that version will always be V1 which may not be true in future. If we start supporting HMI event V1, this check would fail without printing anything on older kernels. This patch fixes this issue. It's not clear what you mean when you say this check would fail without printing anything. The check will fail, and it will print something, ie. the error message. What you mean is the check will fail, and the HMI info will not be printed. My Bad, Yes. I meant 'HMI info will not be printed'. Do you want me to re spin the patch with correction. No that's fine, I've fixed it up to be: The current HMI event structure is an ABI and carries a version field to accommodate future changes without affecting/rearranging current structure members that are valid for previous versions. The current version check if (hmi_evt-version != OpalHMIEvt_V1) doesn't accomodate the fact that the version number may change in future. If firmware starts returning an HMI event with version 1, this check will fail and no HMI information will be printed on older kernels. This patch fixes this issue. I'll CC this to stable unless you disagree. Yes. This patch needs to go to stable. Yep I've added Cc stable to the patch. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: 32 bit getcpu VDSO function uses 64 bit instructions
On Thu, 2014-11-27 at 08:11 +1100, Anton Blanchard wrote: I used some 64 bit instructions when adding the 32 bit getcpu VDSO function. Fix it. Ouch. The symptom is a SIGILL I presume? Could we catch this by forcing -m32 in the CFLAGS for vdso32 ? cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: 32 bit getcpu VDSO function uses 64 bit instructions
On Thu, Nov 27, 2014 at 09:38:17AM +1100, Michael Ellerman wrote: On Thu, 2014-11-27 at 08:11 +1100, Anton Blanchard wrote: I used some 64 bit instructions when adding the 32 bit getcpu VDSO function. Fix it. Ouch. The symptom is a SIGILL I presume? Could we catch this by forcing -m32 in the CFLAGS for vdso32 ? GCC has added -many to the assembler flags for over ten years now, so no that will not work. You can use -mppc or similar with the assembler if you invoke it correctly (use $(CC) -print-prog-name=as to figure out how to call the assembler, if you want to stay sane and use the same one as the rest of the toolchain you use). Segher ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: 32 bit getcpu VDSO function uses 64 bit instructions
On Wed, Nov 26, 2014 at 05:23:18PM -0600, Segher Boessenkool wrote: GCC has added -many to the assembler flags for over ten years now, so no that will not work. You can use -mppc or similar with the assembler if you invoke it correctly (use $(CC) -print-prog-name=as to figure s/correctly/directly/ out how to call the assembler, if you want to stay sane and use the same one as the rest of the toolchain you use). Segher ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH REPOST 3/3] powerpc/vphn: move endianness fixing to vphn_unpack_associativity()
On Mon, 2014-11-17 at 18:42 +0100, Greg Kurz wrote: The first argument to vphn_unpack_associativity() is a const long *, but the parsing code expects __be64 values actually. This is inconsistent. We should either pass a const __be64 * or change vphn_unpack_associativity() so that it fixes endianness by itself. This patch does the latter, since the caller doesn't need to know about endianness and this allows to fix significant 64-bit values only. Please note that the previous code was able to cope with 32-bit fields being split accross two consecutives 64-bit values. Since PAPR+ doesn't say this cannot happen, the behaviour was kept. It requires extra checking to know when fixing is needed though. While I agree with moving the endian fixing down, the patch makes me nervous. Note that I don't fully understand the format of what we are parsing here so I might be wrong but ... #define VPHN_FIELD_UNUSED(0x) #define VPHN_FIELD_MSB (0x8000) #define VPHN_FIELD_MASK (~VPHN_FIELD_MSB) - for (i = 1; i VPHN_ASSOC_BUFSIZE; i++) { - if (be16_to_cpup(field) == VPHN_FIELD_UNUSED) + for (i = 1, j = 0, k = 0; i VPHN_ASSOC_BUFSIZE;) { + u16 field; + + if (j % 4 == 0) { + fixed.packed[k] = cpu_to_be64(packed[k]); + k++; + } So we have essentially a bunch of 16-bit fields ... the above loads and swap a whole 4 of them at once. However that means not only we byteswap them individually, but we also flip the order of the fields. This is ok ? + field = be16_to_cpu(fixed.field[j]); + + if (field == VPHN_FIELD_UNUSED) /* All significant fields processed. */ break; For example, we might have USED,USED,USED,UNUSED ... after the swap, we now have UNUSED,USED,USED,USED ... and we stop parsing in the above line on the first one. Or am I missing something ? - if (be16_to_cpup(field) VPHN_FIELD_MSB) { + if (field VPHN_FIELD_MSB) { /* Data is in the lower 15 bits of this field */ - unpacked[i] = cpu_to_be32( - be16_to_cpup(field) VPHN_FIELD_MASK); - field++; + unpacked[i++] = cpu_to_be32(field VPHN_FIELD_MASK); + j++; } else { /* Data is in the lower 15 bits of this field * concatenated with the next 16 bit field */ - unpacked[i] = *((__be32 *)field); - field += 2; + if (unlikely(j % 4 == 3)) { + /* The next field is to be copied from the next + * 64-bit input value. We must fix it now. + */ + fixed.packed[k] = cpu_to_be64(packed[k]); + k++; + } + + unpacked[i++] = *((__be32 *)fixed.field[j]); + j += 2; } } @@ -1460,11 +1479,8 @@ static long hcall_vphn(unsigned long cpu, __be32 *associativity) long retbuf[PLPAR_HCALL9_BUFSIZE] = {0}; u64 flags = 1; int hwcpu = get_hard_smp_processor_id(cpu); - int i; rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, retbuf, flags, hwcpu); - for (i = 0; i VPHN_REGISTER_COUNT; i++) - retbuf[i] = cpu_to_be64(retbuf[i]); vphn_unpack_associativity(retbuf, associativity); return rc; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 3/6] pseries: Create new device hotplug entry point
On Mon, 2014-11-17 at 15:51 -0600, Nathan Fontenot wrote: For PowerVM systems, this patch creates the /sys/kernel/dlpar file that rtas hotplug events can be written to by drmgr and passed to the common entry point. There is no chance of updating how we receive hotplug requests on PowerVM systems. I'm not convinced this is the right place for that file, might be worth asking Greg KH what he thinks here. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 3/4] powernv: cpuidle: Redesign idle states management
@@ -37,8 +38,7 @@ /* * Pass requested state in r3: - * 0 - nap - * 1 - sleep + * r3 - PNV_THREAD_NAP/SLEEP/WINKLE * * To check IRQ_HAPPENED in r4 * 0 - don't check @@ -123,12 +123,62 @@ power7_enter_nap_mode: li r4,KVM_HWTHREAD_IN_NAP stb r4,HSTATE_HWTHREAD_STATE(r13) #endif - cmpwi cr0,r3,1 - beq 2f + stb r3,PACA_THREAD_IDLE_STATE(r13) + cmpwi cr1,r3,PNV_THREAD_SLEEP + bge cr1,2f IDLE_STATE_ENTER_SEQ(PPC_NAP) /* No return */ -2: IDLE_STATE_ENTER_SEQ(PPC_SLEEP) - /* No return */ +2: + /* Sleep or winkle */ + li r7,1 + mfspr r8,SPRN_PIR + /* + * The last 3 bits of PIR represents the thread id of a cpu + * in power8. This will need adjusting for power7. + */ + andi. r8,r8,0x07 /* Get thread id into r8 */ + rotld r7,r7,r8 + + ld r14,PACA_CORE_IDLE_STATE_PTR(r13) I assume we have already saved all non-volatile registers ? Because you are clobbering one here and more below. +lwarx_loop1: + lwarx r15,0,r14 + andcr15,r15,r7 /* Clear thread bit */ + + andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS + beq last_thread + + /* Not the last thread to goto sleep */ + stwcx. r15,0,r14 + bne-lwarx_loop1 + b common_enter + +last_thread: + LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround) + lbz r3,0(r3) + cmpwi r3,1 + bne common_enter This looks wrong. If the workaround is 0, we don't do the stwcx. at all... Did you try with pnv_need_fastsleep_workaround set to 0 ? It should work most of the time as long as you don't hit the fairly rare race window :) Also it would be nice to make the above a dynamically patches feature section, though that means pnv_need_fastsleep_workaround needs to turn into a CPU feature bit and that needs to be done *very* early on. Another option is to patch out manually from the pnv code the pair: andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS beq last_thread To turn them into nops by hand rather than using the feature system. + /* + * Last thread of the core entering sleep. Last thread needs to execute + * the hardware bug workaround code. Before that, set the lock bit to + * avoid the race of other threads waking up and undoing workaround + * before workaround is applied. + */ + ori r15,r15,PNV_CORE_IDLE_LOCK_BIT + stwcx. r15,0,r14 + bne-lwarx_loop1 + + /* Fast sleep workaround */ + li r3,1 + li r4,1 + li r0,OPAL_CONFIG_CPU_IDLE_STATE + bl opal_call_realmode + + /* Clear Lock bit */ It's a lock, I would add a lwsync here to be safe, and I would add an isync before the bne- above. Just to ensure that whatever is done inside that locked section remains in there. + andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS + stw r15,0(r14) + +common_enter: /* common code for all the threads entering sleep */ + IDLE_STATE_ENTER_SEQ(PPC_SLEEP) _GLOBAL(power7_idle) /* Now check if user or arch enabled NAP mode */ @@ -141,49 +191,16 @@ _GLOBAL(power7_idle) _GLOBAL(power7_nap) mr r4,r3 - li r3,0 + li r3,PNV_THREAD_NAP b power7_powersave_common /* No return */ _GLOBAL(power7_sleep) - li r3,1 + li r3,PNV_THREAD_SLEEP li r4,1 b power7_powersave_common /* No return */ -/* - * Make opal call in realmode. This is a generic function to be called - * from realmode from reset vector. It handles endianess. - * - * r13 - paca pointer - * r1 - stack pointer - * r3 - opal token - */ -opal_call_realmode: - mflrr12 - std r12,_LINK(r1) - ld r2,PACATOC(r13) - /* Set opal return address */ - LOAD_REG_ADDR(r0,return_from_opal_call) - mtlrr0 - /* Handle endian-ness */ - li r0,MSR_LE - mfmsr r12 - andcr12,r12,r0 - mtspr SPRN_HSRR1,r12 - mr r0,r3 /* Move opal token to r0 */ - LOAD_REG_ADDR(r11,opal) - ld r12,8(r11) - ld r2,0(r11) - mtspr SPRN_HSRR0,r12 - hrfid - -return_from_opal_call: - FIXUP_ENDIAN - ld r0,_LINK(r1) - mtlrr0 - blr - #define CHECK_HMI_INTERRUPT \ mfspr r0,SPRN_SRR1; \ BEGIN_FTR_SECTION_NESTED(66); \ @@ -196,10 +213,8 @@ ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66); \ /* Invoke opal call to handle hmi */\ ld r2,PACATOC(r13);\ ld r1,PACAR1(r13);
Re: [PATCH 3/3] KVM: PPC: BOOK3S: HV: Rename variable for better readability
On Mon, Oct 20, 2014 at 07:59:00PM +0530, Aneesh Kumar K.V wrote: Minor cleanup Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 25 + 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 78e689b066f1..2922f8d127ff 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -523,7 +523,7 @@ long kvmppc_h_bulk_remove(struct kvm_vcpu *vcpu) unsigned long *args = vcpu-arch.gpr[4]; __be64 *hp, *hptes[4]; unsigned long tlbrb[4]; - long int i, j, k, n, found, indexes[4]; + long int i, j, k, collected_hpte, found, indexes[4]; Hmmm... I don't find it more readable, because collected_hpte sounds like it contains a HPTE value. Also I don't like using a long name for something that is just a temporary value inside a function, and n is a suitable name for a temporary variable counting the number of things we have. I would prefer just adding a comment like this: - n = 0; + n = 0; /* # values collected in tlbrb[], indexes[] etc. */ Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 4/4] powernv: powerpc: Add winkle support for offline cpus
On Tue, 2014-11-25 at 16:47 +0530, Shreyas B. Prabhu wrote: diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S index 4673353..66874aa 100644 --- a/arch/powerpc/kernel/cpu_setup_power.S +++ b/arch/powerpc/kernel/cpu_setup_power.S @@ -55,6 +55,8 @@ _GLOBAL(__setup_cpu_power8) beqlr li r0,0 mtspr SPRN_LPID,r0 + mtspr SPRN_WORT,r0 + mtspr SPRN_WORC,r0 mfspr r3,SPRN_LPCR ori r3, r3, LPCR_PECEDH bl __init_LPCR @@ -75,6 +77,8 @@ _GLOBAL(__restore_cpu_power8) li r0,0 mtspr SPRN_LPID,r0 mfspr r3,SPRN_LPCR + mtspr SPRN_WORT,r0 + mtspr SPRN_WORC,r0 ori r3, r3, LPCR_PECEDH bl __init_LPCR bl __init_HFSCR Clearing WORT and WORC might not be the best thing. We know the HW folks have been trying to tune those values and we might need to preserve what the boot FW has set. Can you get in touch with them and double check what we should do here ? diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 3311c8d..c9897cb 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -112,6 +112,16 @@ BEGIN_FTR_SECTION cmpwi cr1,r13,2 + /* Check if last bit of HSPGR0 is set. This indicates whether we are + * waking up from winkle */ + li r3,1 + mfspr r4,SPRN_HSPRG0 + and r5,r4,r3 + cmpwi cr4,r5,1/* Store result in cr4 for later use */ + + andcr4,r4,r3 + mtspr SPRN_HSPRG0,r4 + There is an open question here whether adding a beq cr4,+8 after the cmpwi (or a +4 after the andc) is worthwhile. Can you check ? (either measure or talk to HW folks). Also we could write directly to r13... GET_PACA(r13) lbz r0,PACA_THREAD_IDLE_STATE(r13) cmpwi cr2,r0,PNV_THREAD_NAP diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S index c1d590f..78c30b0 100644 --- a/arch/powerpc/kernel/idle_power7.S +++ b/arch/powerpc/kernel/idle_power7.S @@ -19,8 +19,22 @@ #include asm/kvm_book3s_asm.h #include asm/opal.h #include asm/cpuidle.h +#include asm/mmu-hash64.h #undef DEBUG +/* + * Use unused space in the interrupt stack to save and restore + * registers for winkle support. + */ +#define _SDR1GPR3 +#define _RPR GPR4 +#define _SPURR GPR5 +#define _PURRGPR6 +#define _TSCRGPR7 +#define _DSCRGPR8 +#define _AMORGPR9 +#define _PMC5GPR10 +#define _PMC6GPR11 WORT/WORTC need saving restoring /* Idle state entry routines */ @@ -153,32 +167,60 @@ lwarx_loop1: b common_enter last_thread: - LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround) - lbz r3,0(r3) - cmpwi r3,1 - bne common_enter /* * Last thread of the core entering sleep. Last thread needs to execute * the hardware bug workaround code. Before that, set the lock bit to * avoid the race of other threads waking up and undoing workaround * before workaround is applied. */ + LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround) + lbz r3,0(r3) + cmpwi r3,1 + bne common_enter + ori r15,r15,PNV_CORE_IDLE_LOCK_BIT stwcx. r15,0,r14 bne-lwarx_loop1 /* Fast sleep workaround */ + mfcrr16 /* Backup CR to a non-volatile register */ li r3,1 li r4,1 li r0,OPAL_CONFIG_CPU_IDLE_STATE bl opal_call_realmode + mtcrr16 /* Restore CR */ Why isn't the above already in the previous patch ? Also see my comment about using a non-volatile CR instead. /* Clear Lock bit */ andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS stw r15,0(r14) -common_enter: /* common code for all the threads entering sleep */ +common_enter: /* common code for all the threads entering sleep or winkle*/ + bgt cr1,enter_winkle IDLE_STATE_ENTER_SEQ(PPC_SLEEP) +enter_winkle: + /* + * Note all register i.e per-core, per-subcore or per-thread is saved + * here since any thread in the core might wake up first + */ + mfspr r3,SPRN_SDR1 + std r3,_SDR1(r1) + mfspr r3,SPRN_RPR + std r3,_RPR(r1) + mfspr r3,SPRN_SPURR + std r3,_SPURR(r1) + mfspr r3,SPRN_PURR + std r3,_PURR(r1) + mfspr r3,SPRN_TSCR + std r3,_TSCR(r1) + mfspr r3,SPRN_DSCR + std r3,_DSCR(r1) + mfspr r3,SPRN_AMOR + std r3,_AMOR(r1) + mfspr r3,SPRN_PMC5 + std r3,_PMC5(r1) + mfspr r3,SPRN_PMC6 + std r3,_PMC6(r1) + IDLE_STATE_ENTER_SEQ(PPC_WINKLE) _GLOBAL(power7_idle) /* Now check if user or arch enabled NAP mode */ @@ -201,6 +243,12 @@
Right location in sysfs for dlpar file
Hi Greg, So Nathan is working on a patch series to cleanup and improve our DLPAR infrastructure which is basically our hotplug mechanism when running under the PowerVM (aka pHyp) and KVM hypervisors. I'll let Nathan give you a bit more details/background and answer subsequent question you might have as this is really his area of expertise. To cut a long story short, we need a sysfs file that allows our userspace tools to notify the kernel of hotplug events coming from the management console (which talks to userspace daemons using a proprietary protocol) to initiate the hotplug operations, which in turn get dispatched internally in the kernel to the right subsystem (memory, cpu, pci, ...) based on the resource type. On IRC, Greg suggested /sys/firmware and /sys/hypervisor which both look like a reasonable option to me, probably better than dlpar... Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH v2 0/3] fix a kernel panic on fsl corenet board when CONFIG_CLK_PPC_CORENET is enabled
Hello Mike, Could you please apply this patch? This patch has been acked for a while. Thanks, Yuantian -Original Message- From: Linuxppc-dev [mailto:linuxppc-dev-bounces+b29983=freescale@lists.ozlabs.org] On Behalf Of Scott Wood Sent: Friday, November 07, 2014 12:08 PM To: Kevin Hao Cc: Mike Turquette; Gerhard Sittig; Lu Jingchang-B35083; linuxppc-dev@lists.ozlabs.org Subject: Re: [PATCH v2 0/3] fix a kernel panic on fsl corenet board when CONFIG_CLK_PPC_CORENET is enabled On Sun, 2014-10-19 at 14:11 +0800, Kevin Hao wrote: Hi, I have done a boot test on p2014rdb and t4240qds boards. I don't have an access to mpc512x board, so only build test for that. v2: - Revert the commit da788acb2838 first. - Invoke of_clk_init() from a common place. v1 This tries to fix a kernel panic introduced by commit da788acb2838 (clk: ppc-corenet: Fix Section mismatch warning). Kevin Hao (3): Revert clk: ppc-corenet: Fix Section mismatch warning powerpc: call of_clk_init() from time_init() clk: ppc-corenet: fix section mismatch warning arch/powerpc/kernel/time.c| 5 arch/powerpc/platforms/512x/clock-commonclk.c | 11 --- drivers/clk/clk-ppc-corenet.c | 43 --- 3 files changed, 16 insertions(+), 43 deletions(-) Acked-by: Scott Wood scottw...@freescale.com Whose tree should this go through? -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: 32 bit getcpu VDSO function uses 64 bit instructions
On Thu, 2014-11-27 at 09:38 +1100, Michael Ellerman wrote: On Thu, 2014-11-27 at 08:11 +1100, Anton Blanchard wrote: I used some 64 bit instructions when adding the 32 bit getcpu VDSO function. Fix it. Ouch. The symptom is a SIGILL I presume? Nope, you don't get a SIGILL when executing 64-bit instructions in 32-bit mode, so it'll happily just execute the instruction, doing a full 64-bit compare. I'm guessing that the upper 32-bits of both r3 and r4 contain zeros, so we're probably just getting lucky. Could we catch this by forcing -m32 in the CFLAGS for vdso32 ? As Segher mentioned, GCC passing -many down to the assembler means -m32 won't help. It was due to Anton disabling that gcc feature, that this was caught. Peter ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[git pull] Please pull mpe.git for-linus branch (for powerpc)
Hi Linus, Here are five fixes for you to pull please. I think these are all rc6 material, but I'm still learning so let me know if you disagree :) They're all CC'ed to stable except the Fix PE state format one which went in this release. cheers The following changes since commit d7ce4377494adfaf8afb15ecf4f07d399bbf13d9: powerpc/fsl_msi: mark the msi cascade handler IRQF_NO_THREAD (2014-11-17 22:00:30 -0600) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux.git for-linus for you to fetch changes up to 152d44a853e42952f6c8a504fb1f8eefd21fd5fd: powerpc: 32 bit getcpu VDSO function uses 64 bit instructions (2014-11-27 09:42:12 +1100) Anton Blanchard (1): powerpc: 32 bit getcpu VDSO function uses 64 bit instructions Gavin Shan (2): powerpc/eeh: Fix PE state format powerpc/powernv: Replace OPAL_DEASSERT_RESET with EEH_RESET_DEACTIVATE Laurent Dufour (1): powerpc/pseries: Fix endiannes issue in RTAS call from xmon Mahesh Salgaonkar (1): powerpc/powernv: Fix the hmi event version check. arch/powerpc/kernel/eeh_sysfs.c | 2 +- arch/powerpc/kernel/vdso32/getcpu.S | 4 ++-- arch/powerpc/platforms/powernv/opal-hmi.c | 2 +- arch/powerpc/platforms/powernv/pci-ioda.c | 2 +- arch/powerpc/xmon/xmon.c | 6 +++--- 5 files changed, 8 insertions(+), 8 deletions(-) signature.asc Description: This is a digitally signed message part ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 3/4] powernv: cpuidle: Redesign idle states management
Hi Ben, On Thursday 27 November 2014 06:07 AM, Benjamin Herrenschmidt wrote: @@ -37,8 +38,7 @@ /* * Pass requested state in r3: - * 0 - nap - * 1 - sleep + * r3 - PNV_THREAD_NAP/SLEEP/WINKLE * * To check IRQ_HAPPENED in r4 * 0 - don't check @@ -123,12 +123,62 @@ power7_enter_nap_mode: li r4,KVM_HWTHREAD_IN_NAP stb r4,HSTATE_HWTHREAD_STATE(r13) #endif -cmpwi cr0,r3,1 -beq 2f +stb r3,PACA_THREAD_IDLE_STATE(r13) +cmpwi cr1,r3,PNV_THREAD_SLEEP +bge cr1,2f IDLE_STATE_ENTER_SEQ(PPC_NAP) /* No return */ -2: IDLE_STATE_ENTER_SEQ(PPC_SLEEP) -/* No return */ +2: +/* Sleep or winkle */ +li r7,1 +mfspr r8,SPRN_PIR +/* + * The last 3 bits of PIR represents the thread id of a cpu + * in power8. This will need adjusting for power7. + */ +andi. r8,r8,0x07 /* Get thread id into r8 */ +rotld r7,r7,r8 + +ld r14,PACA_CORE_IDLE_STATE_PTR(r13) I assume we have already saved all non-volatile registers ? Because you are clobbering one here and more below. Yes. At this stage the all non-volatile registers are already saved in stack. +lwarx_loop1: +lwarx r15,0,r14 +andcr15,r15,r7 /* Clear thread bit */ + +andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS +beq last_thread + +/* Not the last thread to goto sleep */ +stwcx. r15,0,r14 +bne-lwarx_loop1 +b common_enter + +last_thread: +LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround) +lbz r3,0(r3) +cmpwi r3,1 +bne common_enter This looks wrong. If the workaround is 0, we don't do the stwcx. at all... Did you try with pnv_need_fastsleep_workaround set to 0 ? It should work most of the time as long as you don't hit the fairly rare race window :) My bad. I missed the stwcx. in the pnv_need_fastsleep_workaround = 0 path. Also it would be nice to make the above a dynamically patches feature section, though that means pnv_need_fastsleep_workaround needs to turn into a CPU feature bit and that needs to be done *very* early on. Another option is to patch out manually from the pnv code the pair: andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS beq last_thread To turn them into nops by hand rather than using the feature system. Okay. I'll see which works out best here. +/* + * Last thread of the core entering sleep. Last thread needs to execute + * the hardware bug workaround code. Before that, set the lock bit to + * avoid the race of other threads waking up and undoing workaround + * before workaround is applied. + */ +ori r15,r15,PNV_CORE_IDLE_LOCK_BIT +stwcx. r15,0,r14 +bne-lwarx_loop1 + +/* Fast sleep workaround */ +li r3,1 +li r4,1 +li r0,OPAL_CONFIG_CPU_IDLE_STATE +bl opal_call_realmode + +/* Clear Lock bit */ It's a lock, I would add a lwsync here to be safe, and I would add an isync before the bne- above. Just to ensure that whatever is done inside that locked section remains in there. Okay. Will add it. +andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS +stw r15,0(r14) + +common_enter: /* common code for all the threads entering sleep */ +IDLE_STATE_ENTER_SEQ(PPC_SLEEP) _GLOBAL(power7_idle) /* Now check if user or arch enabled NAP mode */ @@ -141,49 +191,16 @@ _GLOBAL(power7_idle) _GLOBAL(power7_nap) mr r4,r3 -li r3,0 +li r3,PNV_THREAD_NAP b power7_powersave_common /* No return */ _GLOBAL(power7_sleep) -li r3,1 +li r3,PNV_THREAD_SLEEP li r4,1 b power7_powersave_common /* No return */ -/* - * Make opal call in realmode. This is a generic function to be called - * from realmode from reset vector. It handles endianess. - * - * r13 - paca pointer - * r1 - stack pointer - * r3 - opal token - */ -opal_call_realmode: -mflrr12 -std r12,_LINK(r1) -ld r2,PACATOC(r13) -/* Set opal return address */ -LOAD_REG_ADDR(r0,return_from_opal_call) -mtlrr0 -/* Handle endian-ness */ -li r0,MSR_LE -mfmsr r12 -andcr12,r12,r0 -mtspr SPRN_HSRR1,r12 -mr r0,r3 /* Move opal token to r0 */ -LOAD_REG_ADDR(r11,opal) -ld r12,8(r11) -ld r2,0(r11) -mtspr SPRN_HSRR0,r12 -hrfid - -return_from_opal_call: -FIXUP_ENDIAN -ld r0,_LINK(r1) -mtlrr0 -blr - #define CHECK_HMI_INTERRUPT \ mfspr r0,SPRN_SRR1; \ BEGIN_FTR_SECTION_NESTED(66); \ @@ -196,10 +213,8 @@ ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S,
Re: [PATCH v2 4/4] powernv: powerpc: Add winkle support for offline cpus
Hi Ben, On Thursday 27 November 2014 07:25 AM, Benjamin Herrenschmidt wrote: On Tue, 2014-11-25 at 16:47 +0530, Shreyas B. Prabhu wrote: diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S index 4673353..66874aa 100644 --- a/arch/powerpc/kernel/cpu_setup_power.S +++ b/arch/powerpc/kernel/cpu_setup_power.S @@ -55,6 +55,8 @@ _GLOBAL(__setup_cpu_power8) beqlr li r0,0 mtspr SPRN_LPID,r0 +mtspr SPRN_WORT,r0 +mtspr SPRN_WORC,r0 mfspr r3,SPRN_LPCR ori r3, r3, LPCR_PECEDH bl __init_LPCR @@ -75,6 +77,8 @@ _GLOBAL(__restore_cpu_power8) li r0,0 mtspr SPRN_LPID,r0 mfspr r3,SPRN_LPCR +mtspr SPRN_WORT,r0 +mtspr SPRN_WORC,r0 ori r3, r3, LPCR_PECEDH bl __init_LPCR bl __init_HFSCR Clearing WORT and WORC might not be the best thing. We know the HW folks have been trying to tune those values and we might need to preserve what the boot FW has set. Can you get in touch with them and double check what we should do here ? I observed these were always 0. I'll speak to HW folks as you suggested. diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 3311c8d..c9897cb 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -112,6 +112,16 @@ BEGIN_FTR_SECTION cmpwi cr1,r13,2 +/* Check if last bit of HSPGR0 is set. This indicates whether we are + * waking up from winkle */ +li r3,1 +mfspr r4,SPRN_HSPRG0 +and r5,r4,r3 +cmpwi cr4,r5,1/* Store result in cr4 for later use */ + +andcr4,r4,r3 +mtspr SPRN_HSPRG0,r4 + There is an open question here whether adding a beq cr4,+8 after the cmpwi (or a +4 after the andc) is worthwhile. Can you check ? (either measure or talk to HW folks). Okay. This because mtspr is heavier op than beq? Also we could write directly to r13... You mean use mr r13,r4 instead or GET_PACA? GET_PACA(r13) lbz r0,PACA_THREAD_IDLE_STATE(r13) cmpwi cr2,r0,PNV_THREAD_NAP diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S index c1d590f..78c30b0 100644 --- a/arch/powerpc/kernel/idle_power7.S +++ b/arch/powerpc/kernel/idle_power7.S @@ -19,8 +19,22 @@ #include asm/kvm_book3s_asm.h #include asm/opal.h #include asm/cpuidle.h +#include asm/mmu-hash64.h #undef DEBUG +/* + * Use unused space in the interrupt stack to save and restore + * registers for winkle support. + */ +#define _SDR1 GPR3 +#define _RPRGPR4 +#define _SPURR GPR5 +#define _PURR GPR6 +#define _TSCR GPR7 +#define _DSCR GPR8 +#define _AMOR GPR9 +#define _PMC5 GPR10 +#define _PMC6 GPR11 WORT/WORTC need saving restoring The reason I skipped this was because these were always 0. But since its set by FW, I'll save and restore them. /* Idle state entry routines */ @@ -153,32 +167,60 @@ lwarx_loop1: b common_enter last_thread: -LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround) -lbz r3,0(r3) -cmpwi r3,1 -bne common_enter /* * Last thread of the core entering sleep. Last thread needs to execute * the hardware bug workaround code. Before that, set the lock bit to * avoid the race of other threads waking up and undoing workaround * before workaround is applied. */ +LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround) +lbz r3,0(r3) +cmpwi r3,1 +bne common_enter + ori r15,r15,PNV_CORE_IDLE_LOCK_BIT stwcx. r15,0,r14 bne-lwarx_loop1 /* Fast sleep workaround */ +mfcrr16 /* Backup CR to a non-volatile register */ li r3,1 li r4,1 li r0,OPAL_CONFIG_CPU_IDLE_STATE bl opal_call_realmode +mtcrr16 /* Restore CR */ Why isn't the above already in the previous patch ? Also see my comment about using a non-volatile CR instead. In the previous patch I wasn't using any CR after this OPAL call. Hence I had skipped it. As you suggested I'll avoid this by using CR[234]. /* Clear Lock bit */ andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS stw r15,0(r14) -common_enter: /* common code for all the threads entering sleep */ +common_enter: /* common code for all the threads entering sleep or winkle*/ +bgt cr1,enter_winkle IDLE_STATE_ENTER_SEQ(PPC_SLEEP) +enter_winkle: +/* + * Note all register i.e per-core, per-subcore or per-thread is saved + * here since any thread in the core might wake up first + */ +mfspr r3,SPRN_SDR1 +std r3,_SDR1(r1) +mfspr r3,SPRN_RPR +std r3,_RPR(r1) +mfspr r3,SPRN_SPURR +std r3,_SPURR(r1) +mfspr r3,SPRN_PURR +
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 07:04:47PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 26, 2014 at 05:51:08PM +0100, Christian Borntraeger wrote: But this one was giving users in field false positives. So lets try to fix those, ok? If we cant, then tough luck. Sure. I think the simplest way might be to make spinlock disable premption when CONFIG_DEBUG_ATOMIC_SLEEP is enabled. As a result, userspace access will fail and caller will get a nice error. Yes, _userspace_ now sees unpredictable behaviour, instead of that the kernel emits a big loud warning to the console. Please consider this simple example: int bar(char __user *ptr) { ... if (copy_to_user(ptr, ...) return -EFAULT; ... } SYSCALL_DEFINE1(foo, char __user *, ptr) { int rc; ... rc = bar(ptr); if (rc) goto out; ... out: return rc; } The above simple system call just works fine, with and without your change, however if somebody (incorrectly) changes sys_foo() to the code below: spin_lock(lock); rc = bar(ptr); if (rc) goto out; out: spin_unlock(lock); return rc; Broken code like above used to generate warnings. With your change we won't see any warnings anymore. Instead we get random and bad behaviour: For !CONFIG_PREEMPT if the page at ptr is not mapped, the kernel will see a fault, potentially schedule and potentially deadlock on lock. Without _any_ warning anymore. For CONFIG_PREEMPT if the page at ptr is mapped, everthing works. However if the page is not mapped, userspace now all of the sudden will see an invalid(!) -EFAULT return code, instead of that the kernel resolved the page fault. Yes, the kernel can't resolve the fault since we hold a spinlock. But the above bogus code did give warnings to give you an idea that something probably is not correct. Who on earth is supposed to debug crap like this??? What we really want is: Code like spin_lock(lock); if (copy_to_user(...)) rc = ... spin_unlock(lock); really *should* generate warnings like it did before. And *only* code like spin_lock(lock); page_fault_disable(); if (copy_to_user(...)) rc = ... page_fault_enable(); spin_unlock(lock); should not generate warnings, since the author hopefully knew what he did. We could achieve that by e.g. adding a couple of pagefault disabled bits within current_thread_info()-preempt_count, which would allow pagefault_disable() and pagefault_enable() to modify a different part of preempt_count than it does now, so there is a way to tell if pagefaults have been explicitly disabled or are just a side effect of preemption being disabled. This would allow might_fault() to restore its old sane behaviour for the !page_fault_disabled() case. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Thu, Nov 27, 2014 at 08:09:19AM +0100, Heiko Carstens wrote: On Wed, Nov 26, 2014 at 07:04:47PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 26, 2014 at 05:51:08PM +0100, Christian Borntraeger wrote: But this one was giving users in field false positives. So lets try to fix those, ok? If we cant, then tough luck. Sure. I think the simplest way might be to make spinlock disable premption when CONFIG_DEBUG_ATOMIC_SLEEP is enabled. As a result, userspace access will fail and caller will get a nice error. Yes, _userspace_ now sees unpredictable behaviour, instead of that the kernel emits a big loud warning to the console. So I don't object to adding more debugging at all. Sure, would be nice. But the fix is not an unconditional might_sleep within might_fault, this would trigger false positives. Rather, detect that you took a spinlock without disabling preemption. Please consider this simple example: int bar(char __user *ptr) { ... if (copy_to_user(ptr, ...) return -EFAULT; ... } SYSCALL_DEFINE1(foo, char __user *, ptr) { int rc; ... rc = bar(ptr); if (rc) goto out; ... out: return rc; } The above simple system call just works fine, with and without your change, however if somebody (incorrectly) changes sys_foo() to the code below: spin_lock(lock); rc = bar(ptr); if (rc) goto out; out: spin_unlock(lock); return rc; Broken code like above used to generate warnings. With your change we won't see any warnings anymore. Instead we get random and bad behaviour: For !CONFIG_PREEMPT if the page at ptr is not mapped, the kernel will see a fault, potentially schedule and potentially deadlock on lock. Without _any_ warning anymore. For CONFIG_PREEMPT if the page at ptr is mapped, everthing works. However if the page is not mapped, userspace now all of the sudden will see an invalid(!) -EFAULT return code, instead of that the kernel resolved the page fault. Yes, the kernel can't resolve the fault since we hold a spinlock. But the above bogus code did give warnings to give you an idea that something probably is not correct. Who on earth is supposed to debug crap like this??? What we really want is: Code like spin_lock(lock); if (copy_to_user(...)) rc = ... spin_unlock(lock); really *should* generate warnings like it did before. And *only* code like spin_lock(lock); page_fault_disable(); if (copy_to_user(...)) rc = ... page_fault_enable(); spin_unlock(lock); should not generate warnings, since the author hopefully knew what he did. We could achieve that by e.g. adding a couple of pagefault disabled bits within current_thread_info()-preempt_count, which would allow pagefault_disable() and pagefault_enable() to modify a different part of preempt_count than it does now, so there is a way to tell if pagefaults have been explicitly disabled or are just a side effect of preemption being disabled. This would allow might_fault() to restore its old sane behaviour for the !page_fault_disabled() case. Exactly. I agree, that would be a useful debugging tool. In fact this comment in mm/memory.c hints at this: * it would be nicer only to annotate paths which are not under * pagefault_disable, it further says * however that requires a larger audit and * providing helpers like get_user_atomic. but I think that what you outline is a better way to do this. -- MST ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev