Re: pseries/le: Fix endiannes issue in RTAS call from xmon

2014-11-26 Thread Laurent Dufour
On 26/11/2014 04:31, Michael Ellerman wrote:
 OK. You say LPAR, by which you mean under phyp I think. I haven't seen 
 this
 under KVM, and it looks like KVM doesn't implement set-indicator so that
 would explain that.

Yes LPAR implies phyp, and KVM don't implement set-indicator so this
doesn't happen in that case.

 I'll take this as a bug fix and CC it to stable.

That's a good point.

Thanks,
Laurent.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RESEND, V3] powerpc, xmon: Enable HW instruction breakpoint on POWER8

2014-11-26 Thread Michael Ellerman
On Tue, 2014-25-11 at 10:08:48 UTC, Anshuman Khandual wrote:
 This patch enables support for hardware instruction breakpoints
 on POWER8 with the help of a new register CIABR (Completed
 Instruction Address Breakpoint Register). With this patch, single
 hardware instruction breakpoint can be added and cleared during
 any active xmon debug session. This hardware based instruction
 breakpoint mechanism works correctly along with the existing TRAP
 based instruction breakpoints available on xmon.


Hi Anshuman,

 diff --git a/arch/powerpc/include/asm/xmon.h b/arch/powerpc/include/asm/xmon.h
 index 5eb8e59..5d17aec 100644
 --- a/arch/powerpc/include/asm/xmon.h
 +++ b/arch/powerpc/include/asm/xmon.h
 @@ -29,5 +29,11 @@ static inline void xmon_register_spus(struct list_head 
 *list) { };
  extern int cpus_are_in_xmon(void);
  #endif

This file is the exported interface *of xmon*, it's not the place to put things
that xmon needs internally.

For now just put it in xmon.c

 +#if defined(CONFIG_PPC_BOOK3S_64)  defined(CONFIG_PPC_SPLPAR)
 +#include asm/plpar_wrappers.h
 +#else
 +static inline long plapr_set_ciabr(unsigned long ciabr) {return 0; };
 +#endif

Also the ifdef is overly verbose, CONFIG_PPC_SPLPAR essentially depends on
CONFIG_PPC_BOOK3S_64. So you can just use #ifdef CONFIG_PPC_SPLPAR.

 diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
 index b988b5a..c2f601a 100644
 --- a/arch/powerpc/xmon/xmon.c
 +++ b/arch/powerpc/xmon/xmon.c
 @@ -271,6 +273,55 @@ static inline void cinval(void *p)
  }
  
  /*
 + * write_ciabr
 + *
 + * This function writes a value to the
 + * CIARB register either directly through
 + * mtspr instruction if the kernel is in HV
 + * privilege mode or call a hypervisor function
 + * to achieve the same in case the kernel is in
 + * supervisor privilege mode.
 + */

I'm not really sure a function this small needs a documentation block.

But if you're going to add one, PLEASE make sure it's an actual kernel-doc
style comment.

You can check with:

$ ./scripts/kernel-doc -text arch/powerpc/xmon/xmon.c

Which you'll notice prints:

Warning(arch/powerpc/xmon/xmon.c): no structured comments found

You need something like:

/**
 * write_ciabr() - write the CIABR SPR
 * @ciabr: The value to write.
 *
 * This function writes a value to the CIABR register either directly through
 * mtspr instruction if the kernel is in HV privilege mode or calls a
 * hypervisor function to achieve the same in case the kernel is in supervisor
 * privilege mode.
 */



The rest of the patch is OK. But I was hoping you'd notice that we no longer
support any cpus that implement CPU_FTR_IABR. And so you can just repurpose all
the iabr logic for ciabr.

Something like this, untested:


diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index b988b5addf86..6894650bff7f 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -51,6 +51,12 @@
 #include asm/paca.h
 #endif
 
+#ifdef CONFIG_PPC_SPLPAR
+#include asm/plpar_wrappers.h
+#else
+static inline long plapr_set_ciabr(unsigned long ciabr) { return 0; };
+#endif
+
 #include nonstdio.h
 #include dis-asm.h
 
@@ -270,6 +276,31 @@ static inline void cinval(void *p)
asm volatile (dcbi 0,%0; icbi 0,%0 : : r (p));
 }
 
+static void write_ciabr(unsigned long ciabr)
+{
+   if (!cpu_has_feature(CPU_FTR_ARCH_207S))
+   return;
+
+   if (cpu_has_feature(CPU_FTR_HVMODE)) {
+   mtspr(SPRN_CIABR, ciabr);
+   return;
+   }
+
+   plapr_set_ciabr(ciabr);
+}
+
+static void set_ciabr(unsigned long addr)
+{
+   addr = ~CIABR_PRIV;
+
+   if (cpu_has_feature(CPU_FTR_HVMODE))
+   addr |= CIABR_PRIV_HYPER;
+   else
+   addr |= CIABR_PRIV_SUPER;
+
+   write_ciabr(addr);
+}
+
 /*
  * Disable surveillance (the service processor watchdog function)
  * while we are in xmon.
@@ -764,9 +795,9 @@ static void insert_cpu_bpts(void)
brk.len = 8;
__set_breakpoint(brk);
}
-   if (iabr  cpu_has_feature(CPU_FTR_IABR))
-   mtspr(SPRN_IABR, iabr-address
-| (iabr-enabled  (BP_IABR|BP_IABR_TE)));
+
+   if (iabr)
+   set_ciabr(iabr-address);
 }
 
 static void remove_bpts(void)
@@ -792,8 +823,7 @@ static void remove_bpts(void)
 static void remove_cpu_bpts(void)
 {
hw_breakpoint_disable();
-   if (cpu_has_feature(CPU_FTR_IABR))
-   mtspr(SPRN_IABR, 0);
+   write_ciabr(0);
 }
 
 /* Command interpreting routine */
@@ -1127,7 +1157,7 @@ static char *breakpoint_help_string =
 b addr [cnt]   set breakpoint at given instr addr\n
 bc   clear all breakpoints\n
 bc n/addr  clear breakpoint number n or at addr\n
-bi addr [cnt]  set hardware instr breakpoint (POWER3/RS64 only)\n
+bi addr [cnt]  set hardware instr breakpoint (POWER8 only)\n
 bd addr [cnt]  set hardware data breakpoint\n
 ;
 
@@ -1166,7 +1196,7 @@ 

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread David Hildenbrand
Hi Michael,

thanks for your reply! some general thought:

This change was introduced mid 2013 but we don't have a single user relying
on this code change yet, right?

Disabling might_sleep() for functions that clearly state that they may sleep to
get some special case running feels wrong to me.

 On Tue, Nov 25, 2014 at 12:43:24PM +0100, David Hildenbrand wrote:
  I recently discovered that commit 662bbcb2747c2422cf98d3d97619509379eee466
  removed/skipped all might_sleep checks for might_fault() calls when in 
  atomic.
 
 Yes.  You can add e.g. might_sleep in your code if, for some reason, it is
 illegal to call it in an atomic context.
 But faults are legal in atomic if you handle the possible
 errors, and faults do not necessary cause caller to sleep, so might_fault
 should not call might_sleep.

My point is that and (almost at) everywhere where we use pagefault_disable, we
are using the inatomic variants. Also the documentation of copy_to_user()
clearly states at various points that this function may sleep:

- git grep This function may sleep yields
   Context: User context only.  This function may sleep. e.g. s390, x86, mips

Patching out the might_sleep() from these functions seems more to be a hack to
solve another problem - not using the inatomic variants or finding them not to
be sufficient for a task?

So calling might_sleep() in these functions seems very right to me.

 
  Reason was to allow calling copy_(to|from)_user while in 
  pagefault_disabled(),
  because otherwise, CONFIG_DEBUG_ATOMIC_SLEEP would find false positives.
 
 That wasn't the only reason BTW.
 Andi Kleen also showed that compiler did a terrible job optimizing
 get/put user when might_sleep was called.
 See e.g. this thread:
 https://lkml.org/lkml/2013/8/14/652
 There was even an lwn.net article about it, that I don't seem to be
 able to locate.

Thanks, I'll try to look it up! but:

might_sleep() will only be called when lock debugging / sleep in atomic is in,
so this doesn't seem to be a problem for me in a production environment. or am
I missing something?

 So might_sleep is not appropriate for lightweight operations like __get_user,
 which people literally expect to be a single instruction.

I agree that __.* variants should not call might_fault() (like I mentioned
after the table below).

 
 
 I also have a project going which handles very short packets by copying
 them into guest memory directly without waking up a thread.
 I do it by calling recvmsg on a socket, then switching to
 a thread if I get back EFAULT.
 Not yet clean enough to upstream but it does seem to cut
 the latency down quite a bit, so I'd like to have the option.
 
 
 Generally, a caller that does something like this under a spinlock:
 preempt_disable
 pagefault_disable
 error = copy_to_user

So why can't we use the inatomic variant? This seems to be
atomic environment to me. Calling a function that states that it may sleep
doesn't feel right to me.

 pagefault_enable
 preempt_enable_no_resched
 
 is not doing anything wrong and should not get a warning,
 as long as error is handled correctly later.
 
 You can also find the discussion around the patches
 educational:
 http://article.gmane.org/gmane.comp.emulators.kvm.devel/109928
 
  However
  we have the inatomic variants of these function for this purpose.
 
 Does inatomic install fixups now?

If not, I think this would rather be the way to go.

 
 Last I checked, it didn't so it did not satisfy this purpose.
 See this comment from x86:
 
  * Copy data from kernel space to user space.  Caller must check
  * the specified block with access_ok() before calling this function.
  * The caller should also make sure he pins the user space address
  * so that we don't result in page fault and sleep.
 
 
 Also - switching to inatomic would scatter if (atomic) all
 over code. It's much better to simply call the same
 function (e.g. recvmsg) and have it fail gracefully:
 after all, we have code to handle get/put user errors
 anyway.
 
  The result of this change was that all guest access (that correctly uses
  might_fault()) doesn't perform atomic checks when CONFIG_DEBUG_ATOMIC_SLEEP 
  is
  enabled. We lost a mighty debugging feature for user access.
 
 What's the path you are trying to debug?

Well, we had a problem where we held a spin_lock and called
copy_(from|to)_user(). We experienced very random deadlocks that took some guy
almost a week to debug. The simple might_sleep() check would have showed this
error immediately.

I would have said that in 99,9% of all copy_to_user() calls we want to check
might_sleep(). That pagefault_disable() is a special case that should be
handled differently - in my opinion.

 
 If your code can faults, then it's safe to call
 from atomic context.
 If it can't, it must pin the page.  You can not do access_ok checks and
 then assume access won't fault.
 
  As I wasn't able to come up with any other reason why this should be
  

Re: powerpc/powernv: Fix the hmi event version check.

2014-11-26 Thread Mahesh Jagannath Salgaonkar
On 11/26/2014 09:14 AM, Michael Ellerman wrote:
 On Thu, 2014-20-11 at 04:14:36 UTC, Mahesh Salgaonkar wrote:
 From: Mahesh Salgaonkar mah...@linux.vnet.ibm.com

 The current HMI event structure is an ABI and carries a version field to
 accommodate future changes without affecting/rearranging current structure
 members that are valid for previous versions. The current version check
 if (hmi_evt-version != OpalHMIEvt_V1) seems to consider that version
 will always be V1 which may not be true in future. If we start supporting
 HMI event  V1, this check would fail without printing anything on older
 kernels. This patch fixes this issue.
 
 It's not clear what you mean when you say this check would fail without
 printing anything. The check will fail, and it will print something, ie. the
 error message.
 
 What you mean is the check will fail, and the HMI info will not be printed.

My Bad, Yes. I meant 'HMI info will not be printed'. Do you want me to
re spin the patch with correction.

 
 I'll CC this to stable unless you disagree.

Yes. This patch needs to go to stable.

Thanks,
-Mahesh.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Disabled LocalPlus Controller (LPC) clock on MPC512x

2014-11-26 Thread Alexander Popov

Hello.

My Freescale TWR-MPC5125 board instantly reboots if I touch
any physical address on the LocalPlus Bus (LPB) for the first time
when Linux has already booted.

This effect is reproduced by using /dev/mem or loading a kernel module
which works with any peripherals on LPB.

It took me some time to find out that such crash is caused by
clk_disable_unused() in drivers/clk/clk.c, which disables
LocalPlus Controller (LPC) clock if I don't touch LPB addresses in the
previous initcalls. So starting Linux with clk_ignore_unused bootparam
or inserting dummy LPB reading to some initcall is a temporary fix.

Is it correct to gate LPC clock? If yes, how to avoid the mentioned
crashes properly?

There's a piece of code in arch/powerpc/platforms/512x/clock-commonclk.c
which is doubtful for me:

/*
 * pre-enable those internal clock items which never get
 * claimed by any peripheral driver, to not have the clock
 * subsystem disable them late at startup
 */
clk_prepare_enable(clks[MPC512x_CLK_DUMMY]);
clk_prepare_enable(clks[MPC512x_CLK_E300]);/* PowerPC CPU */
clk_prepare_enable(clks[MPC512x_CLK_DDR]);/* DRAM */
clk_prepare_enable(clks[MPC512x_CLK_MEM]);/* SRAM */
clk_prepare_enable(clks[MPC512x_CLK_IPS]);/* SoC periph */
clk_prepare_enable(clks[MPC512x_CLK_LPC]);/* boot media */

Does it mean that these clocks should be registered with
CLK_IGNORE_UNUSED flag?

Thanks a lot.
Best regards,
Alexander
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 07/14] of/reconfig: Always use the same structure for notifiers

2014-11-26 Thread Grant Likely
On Tue, 25 Nov 2014 21:11:58 -0600
, Nathan Fontenot nf...@linux.vnet.ibm.com
 wrote:
 On 11/25/2014 05:07 PM, Benjamin Herrenschmidt wrote:
  On Mon, 2014-11-24 at 22:33 +, Grant Likely wrote:
  The OF_RECONFIG notifier callback uses a different structure depending
  on whether it is a node change or a property change. This is silly, and
  not very safe. Rework the code to use the same data structure regardless
  of the type of notifier.
  
  I fell pretty good about this one except...
  
  diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
  index b9d1dfdbe5bb..9fe6002c1d5a 100644
  --- a/arch/powerpc/mm/numa.c
  +++ b/arch/powerpc/mm/numa.c
  @@ -1711,12 +1711,11 @@ static void stage_topology_update(int core_id)
   static int dt_update_callback(struct notifier_block *nb,
 unsigned long action, void *data)
   {
  -  struct of_prop_reconfig *update;
  +  struct of_reconfig_data *update = data;
 int rc = NOTIFY_DONE;
   
 switch (action) {
 case OF_RECONFIG_UPDATE_PROPERTY:
  -  update = (struct of_prop_reconfig *)data;
  
  Should we assert/bug on !update-dn / update-prop ?
  
  (Same for the rest of the patch)
  
  Or do you reckon it's pointless ?
  
 
 I'm not sure it's worth it, if those are NULL pointers the drivers/of
 code would have tried to use them before invoking the notifier chain.
 We won't make it this far if they're NULL.

Agreed. I'm going to merge it as-is.

g.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Michael S. Tsirkin
On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
  What's the path you are trying to debug?
 
 Well, we had a problem where we held a spin_lock and called
 copy_(from|to)_user(). We experienced very random deadlocks that took some guy
 almost a week to debug. The simple might_sleep() check would have showed this
 error immediately.

This must have been a very old kernel.
A modern kernel will return an error from copy_to_user.
Which is really the point of the patch you are trying to revert.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH v1 1/1] powerpc/85xx: Add support for Emerson/Artesyn MVME2500.

2014-11-26 Thread Alessio Igor Bogani
Add support for the Artesyn MVME2500 Single Board Computer.

The MVME2500 is a 6U form factor VME64 computer with:

- A single Freescale QorIQ P2010 CPU
- 1 GB of DDR3 onboard memory
- Three Gigabit Ethernets
- Five 16550 compatible UARTS
- One USB 2.0 port, one SHDC socket and one SATA connector
- One PCI/PCI eXpress Mezzanine Card (PMC/XMC) Slot
- MultiProcessor Interrupt Controller (MPIC)
- A DS1375T Real Time Clock (RTC) and 512 KB of Non-Volatile Memory
- Two 64 KB EEPROMs
- U-Boot in 16 SPI Flash

This patch is based on linux-3.18-rc6 and has been boot tested.

Signed-off-by: Alessio Igor Bogani alessio.bog...@elettra.eu
---
 arch/powerpc/boot/dts/mvme2500.dts   | 324 +++
 arch/powerpc/boot/dts/mvme2500.dtsi  |  28 +++
 arch/powerpc/configs/85xx/mvme2500_defconfig | 222 ++
 arch/powerpc/platforms/85xx/Kconfig  |   8 +
 arch/powerpc/platforms/85xx/Makefile |   1 +
 arch/powerpc/platforms/85xx/mvme2500.c   |  95 
 6 files changed, 678 insertions(+)
 create mode 100644 arch/powerpc/boot/dts/mvme2500.dts
 create mode 100644 arch/powerpc/boot/dts/mvme2500.dtsi
 create mode 100644 arch/powerpc/configs/85xx/mvme2500_defconfig
 create mode 100644 arch/powerpc/platforms/85xx/mvme2500.c

diff --git a/arch/powerpc/boot/dts/mvme2500.dts 
b/arch/powerpc/boot/dts/mvme2500.dts
new file mode 100644
index 000..fca05fc
--- /dev/null
+++ b/arch/powerpc/boot/dts/mvme2500.dts
@@ -0,0 +1,324 @@
+/*
+ * Device tree source for the Emerson/Artesyn MVME2500
+ *
+ * Copyright 2014 Elettra-Sincrotrone Trieste S.C.p.A.
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ *
+ * Based on: P2020 DS Device Tree Source
+ * Copyright 2009 Freescale Semiconductor Inc.
+ */
+
+/include/ fsl/p2020si-pre.dtsi
+
+/ {
+   model = MVME2500;
+   compatible = artesyn,MVME2500;
+
+   aliases {
+   serial2 = serial2;
+   serial3 = serial3;
+   serial4 = serial4;
+   serial5 = serial5;
+   };
+
+   memory {
+   device_type = memory;
+   };
+
+   board_soc: soc: soc@ffe0 {
+   ranges = 0x0 0 0xffe0 0x10;
+
+   i2c@3000 {
+   hwmon@4c {
+   compatible = adi,adt7461;
+   reg = 0x4c;
+   };
+
+   rtc@68 {
+   compatible = dallas,ds1337;
+   reg = 0x68;
+   interrupts = 8 1 0 0;
+   };
+
+   eeprom-vpd@54 {
+   compatible = atmel,24c64;
+   reg = 0x54;
+   };
+
+   eeprom@52 {
+   compatible = atmel,24c512;
+   reg = 0x52;
+   };
+
+   eeprom@53 {
+   compatible = atmel,24c512;
+   reg = 0x53;
+   };
+
+   spd@50 {
+   compatible = atmel,24c02;
+   reg = 0x50;
+   };
+
+   };
+
+   spi0: spi@7000 {
+   fsl,espi-num-chipselects = 2;
+
+   flash@0 {
+   #address-cells = 1;
+   #size-cells = 1;
+   compatible = atmel,at25df641;
+   reg = 0;
+   spi-max-frequency = 1000;
+   partition@u-boot {
+   label = u-boot;
+   reg = 0x 0x000A;
+   read-only;
+   };
+   partition@dtb {
+   label = dtb;
+   reg = 0x000A 0x0002;
+   };
+   partition@misc {
+   label = misc;
+   reg = 0x000C 0x0004;
+   };
+   partition@u-boot-env {
+   label = u-boot-env;
+   reg = 0x0010 0x0002;
+   };
+   partition@kernel {
+   label = kernel;
+ 

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Michael S. Tsirkin
Hmm you sent same mail to me off-list, and I replied there.
Now there's a copy on list - I'm just going to assume
it's exactly identical, pasting my response here as well.
If there are more questions I missed, let me know.

On Wed, Nov 26, 2014 at 09:23:31AM +0100, David Hildenbrand wrote:
 Sorry I haven't put you on cc, must have lost you while packing my list :)
 Thanks for your answer!
 
 This change was introduced in 2013, and I haven't seen an instance making use
 of your described scenario, right?

My work is still out of tree. RHEL6 shipped some patches that use
this. I don't know whether any instances in-tree use this capability.

But it just doesn't make sense for might_fault to call might_sleep
because a fault does not imply sleep.

  On Tue, Nov 25, 2014 at 12:43:24PM +0100, David Hildenbrand wrote:
   I recently discovered that commit 662bbcb2747c2422cf98d3d97619509379eee466
   removed/skipped all might_sleep checks for might_fault() calls when in 
   atomic.
  
  Yes.  You can add e.g. might_sleep in your code if, for some reason, it is
  illegal to call it in an atomic context.
  But faults are legal in atomic if you handle the possible
  errors, and faults do not necessary cause caller to sleep, so might_fault
  should not call might_sleep.
 
 I think we should use in_atomic variants for this purpose (as done in the code
 for now) - especially as pagefault_disable has been relying on the preempt
 count for a long time. But see my comment below about existing documentation.

IIUC they are not interchangeable.
*in_atomic seems to require that page is pinned.
*user does not: it installs a fixup so you get an error code if you try
to access an invalid address.

Besides, this would just lead to a ton of
 if (atomic) return inatomic else return user
in code, for no good purpose.

  
   Reason was to allow calling copy_(to|from)_user while in 
   pagefault_disabled(),
   because otherwise, CONFIG_DEBUG_ATOMIC_SLEEP would find false positives.
  
  That wasn't the only reason BTW.
  Andi Kleen also showed that compiler did a terrible job optimizing
  get/put user when might_sleep was called.
 
 might_fault() should never call might_sleep() when lock debugging is off, so
 there should be no performance problem or am I missing something?

CONFIG_DEBUG_ATOMIC_SLEEP too, no?

  See e.g. this thread:
  https://lkml.org/lkml/2013/8/14/652
  There was even an lwn.net article about it, that I don't seem to be
  able to locate.
 
 Thanks, will see if I can find it.
 
  So might_sleep is not appropriate for lightweight operations like 
  __get_user,
  which people literally expect to be a single instruction.
 
 Yes, as discussed below, __.* variants should never call it.

So that would be even more inconsistent. They fault exactly the same as
the non __ variants.


  
  
  I also have a project going which handles very short packets by copying
  them into guest memory directly without waking up a thread.
  I do it by calling recvmsg on a socket, then switching to
  a thread if I get back EFAULT.
  Not yet clean enough to upstream but it does seem to cut
  the latency down quite a bit, so I'd like to have the option.
  
  
  Generally, a caller that does something like this under a spinlock:
  preempt_disable
  pagefault_disable
  error = copy_to_user
  pagefault_enable
  preempt_enable_no_resched
  
  is not doing anything wrong and should not get a warning,
  as long as error is handled correctly later.
 
 I think this would be a perfect fit for an inatomic variant, no?

No because inatomic does not handle faults.

 I mean even the copy_to_user documentation of e.g. s390, x86, mips
 clearly says:
   Context: User context only.-This function may sleep.

So the comment is incomplete - I didn't get around fixing that.
It may sleep but not in atomic context.

 So calling it from your described scenario is wrong. And as the interface 
 says,
 it might_sleep() and therefore also call the check in might_fault().

Exactly the reverse.
There's no way for it to sleep except on fault and that only if
preempttion is on.

  
  You can also find the discussion around the patches
  educational:
  http://article.gmane.org/gmane.comp.emulators.kvm.devel/109928
  
   However
   we have the inatomic variants of these function for this purpose.
  
  Does inatomic install fixups now?
 
 I think this varies between architectures but I am no expert. But as 99,9% of
 all pagefault_disable code uses inatomic, I would have guessed that this is
 rather the way to got than simply using the non atomic variant that clearly
 states on the interface that it might sleep?

Let's go ahead and make the comment more exact then.

  
  Last I checked, it didn't so it did not satisfy this purpose.
  See this comment from x86:
  
   * Copy data from kernel space to user space.  Caller must check
   * the specified block with access_ok() before calling this function.
   * The caller should also make sure 

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Michael S. Tsirkin
On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
   What's the path you are trying to debug?
  
  Well, we had a problem where we held a spin_lock and called
  copy_(from|to)_user(). We experienced very random deadlocks that took some 
  guy
  almost a week to debug. The simple might_sleep() check would have showed 
  this
  error immediately.
 
 This must have been a very old kernel.
 A modern kernel will return an error from copy_to_user.
 Which is really the point of the patch you are trying to revert.

That's assuming you disabled preemption. If you didn't, and take
a spinlock, you have deadlocks even without userspace access.

-- 
MST
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Christian Borntraeger
Am 26.11.2014 um 16:17 schrieb Michael S. Tsirkin:
 On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
 What's the path you are trying to debug?

 Well, we had a problem where we held a spin_lock and called
 copy_(from|to)_user(). We experienced very random deadlocks that took some 
 guy
 almost a week to debug. The simple might_sleep() check would have showed this
 error immediately.
 

 This must have been a very old kernel.
 A modern kernel will return an error from copy_to_user.

I disagree. copy_to_user will not return while holding a spinlock, because it 
does not know! How should it?
See: spin_lock will call preempt_disable, but thats a no-op for a non-preempt 
kernel. So the mere fact that we hold a spin_lock is not known by any user 
access function. (or others). No?

Christian



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread David Hildenbrand
 On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote:
  On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
What's the path you are trying to debug?
   
   Well, we had a problem where we held a spin_lock and called
   copy_(from|to)_user(). We experienced very random deadlocks that took 
   some guy
   almost a week to debug. The simple might_sleep() check would have showed 
   this
   error immediately.
  
  This must have been a very old kernel.
  A modern kernel will return an error from copy_to_user.
  Which is really the point of the patch you are trying to revert.
 
 That's assuming you disabled preemption. If you didn't, and take
 a spinlock, you have deadlocks even without userspace access.
 

(Thanks for your resent, my first email was sent directly to you ... grml)

This is what happened on our side (very recent kernel):

spin_lock(lock)
copy_to_user(...)
spin_unlock(lock)

1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_
   as old value
2. we slept during copy_to_user()
3. the thread got scheduled onto another cpu
4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked
   the spinlock tried to unlocked it).
5. lock remained locked - deadlock

Christian came up with the following explanation:
Without preemption, spin_lock() will not touch the preempt counter.
disable_pfault() will always touch it.

Therefore, with preemption disabled, copy_to_user() has no idea that it is
running in atomic context - and will therefore try to sleep.

So copy_to_user() will on s390:
1. run as atomic while spin_lock() with preemption enabled.
2. run as not atomic while spin_lock() with preemption disabled.
3.  run as atomic while pagefault_disabled() with preemption enabled or
disabled.
4. run as not atomic when really not atomic.

And exactly nr 2. is the thing that produced the deadlock in our scenario and
the reason why I want a might_sleep() :)

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Michael S. Tsirkin
On Wed, Nov 26, 2014 at 04:30:32PM +0100, Christian Borntraeger wrote:
 Am 26.11.2014 um 16:17 schrieb Michael S. Tsirkin:
  On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
  What's the path you are trying to debug?
 
  Well, we had a problem where we held a spin_lock and called
  copy_(from|to)_user(). We experienced very random deadlocks that took some 
  guy
  almost a week to debug. The simple might_sleep() check would have showed 
  this
  error immediately.
  
 
  This must have been a very old kernel.
  A modern kernel will return an error from copy_to_user.
 
 I disagree. copy_to_user will not return while holding a spinlock, because it 
 does not know! How should it?
 See: spin_lock will call preempt_disable, but thats a no-op for a non-preempt 
 kernel. So the mere fact that we hold a spin_lock is not known by any user 
 access function. (or others). No?
 
 Christian
 
 

Well might_sleep() merely checks preempt count and irqs_disabled too.
If you want debugging things to trigger, you need to enable
a bunch of config options.  That's not new.


-- 
MST
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Michael S. Tsirkin
On Wed, Nov 26, 2014 at 04:32:07PM +0100, David Hildenbrand wrote:
  On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote:
   On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
 What's the path you are trying to debug?

Well, we had a problem where we held a spin_lock and called
copy_(from|to)_user(). We experienced very random deadlocks that took 
some guy
almost a week to debug. The simple might_sleep() check would have 
showed this
error immediately.
   
   This must have been a very old kernel.
   A modern kernel will return an error from copy_to_user.
   Which is really the point of the patch you are trying to revert.
  
  That's assuming you disabled preemption. If you didn't, and take
  a spinlock, you have deadlocks even without userspace access.
  
 
 (Thanks for your resent, my first email was sent directly to you ... grml)
 
 This is what happened on our side (very recent kernel):
 
 spin_lock(lock)
 copy_to_user(...)
 spin_unlock(lock)

That's a deadlock even without copy_to_user - it's
enough for the thread to be preempted and another one
to try taking the lock.


 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_
as old value
 2. we slept during copy_to_user()
 3. the thread got scheduled onto another cpu
 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked
the spinlock tried to unlocked it).
 5. lock remained locked - deadlock
 
 Christian came up with the following explanation:
 Without preemption, spin_lock() will not touch the preempt counter.
 disable_pfault() will always touch it.
 
 Therefore, with preemption disabled, copy_to_user() has no idea that it is
 running in atomic context - and will therefore try to sleep.

 So copy_to_user() will on s390:
 1. run as atomic while spin_lock() with preemption enabled.
 2. run as not atomic while spin_lock() with preemption disabled.
 3.  run as atomic while pagefault_disabled() with preemption enabled or
 disabled.
 4. run as not atomic when really not atomic.
 
 And exactly nr 2. is the thing that produced the deadlock in our scenario and
 the reason why I want a might_sleep() :)

IMHO it's not copy to user that causes the problem.
It's the misuse of spinlocks with preemption on.

So might_sleep would make you think copy_to_user is
the problem, and e.g. let you paper over it by
moving copy_to_user out.

Enable lock prover and you will see what the real
issue is, which is you didn't disable preempt.
and if you did, copy_to_user would be okay.

-- 
MST
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread David Hildenbrand
  This is what happened on our side (very recent kernel):
  
  spin_lock(lock)
  copy_to_user(...)
  spin_unlock(lock)
 
 That's a deadlock even without copy_to_user - it's
 enough for the thread to be preempted and another one
 to try taking the lock.
 
 
  1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu 
  id_
 as old value
  2. we slept during copy_to_user()
  3. the thread got scheduled onto another cpu
  4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked
 the spinlock tried to unlocked it).
  5. lock remained locked - deadlock
  
  Christian came up with the following explanation:
  Without preemption, spin_lock() will not touch the preempt counter.
  disable_pfault() will always touch it.
  
  Therefore, with preemption disabled, copy_to_user() has no idea that it is
  running in atomic context - and will therefore try to sleep.
 
  So copy_to_user() will on s390:
  1. run as atomic while spin_lock() with preemption enabled.
  2. run as not atomic while spin_lock() with preemption disabled.
  3.  run as atomic while pagefault_disabled() with preemption enabled or
  disabled.
  4. run as not atomic when really not atomic.

should have been more clear at that point: 
preemption enabled == kernel compiled with preemption support
preemption disabled == kernel compiled without preemption support

  
  And exactly nr 2. is the thing that produced the deadlock in our scenario 
  and
  the reason why I want a might_sleep() :)
 
 IMHO it's not copy to user that causes the problem.
 It's the misuse of spinlocks with preemption on.

As I said, preemption was off.

 
 So might_sleep would make you think copy_to_user is
 the problem, and e.g. let you paper over it by
 moving copy_to_user out.

Actually implementing different way of locking easily fixed the problem for us.
The old might_sleep() checks would have given us the problem within a few
seconds (I tested it).

 
 Enable lock prover and you will see what the real
 issue is, which is you didn't disable preempt.
 and if you did, copy_to_user would be okay.
 

Our kernel is compiled without preemption and we turned on all lock/atomic
sleep debugging aid. No problem was detected.


But the question is if we shouldn't rather provide a:

  copy_to_user_nosleep() implementation that can be called from
pagefault_disable() because it won't sleep.
and a
  copy_to_user_sleep() implementation that cannot be called from
pagefault_disable().

Another way to fix it would be a reworked pagefault_disable() function that
somehow sets a flag, so copy_to_user() knows that it is in fact called from a
valid context, not just from some atomic context. So we could trigger
might_sleep() when detecting a !pagefault_disable context.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Christian Borntraeger
Am 26.11.2014 um 16:37 schrieb Michael S. Tsirkin:
 On Wed, Nov 26, 2014 at 04:30:32PM +0100, Christian Borntraeger wrote:
 Am 26.11.2014 um 16:17 schrieb Michael S. Tsirkin:
 On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
 What's the path you are trying to debug?

 Well, we had a problem where we held a spin_lock and called
 copy_(from|to)_user(). We experienced very random deadlocks that took some 
 guy
 almost a week to debug. The simple might_sleep() check would have showed 
 this
 error immediately.


 This must have been a very old kernel.
 A modern kernel will return an error from copy_to_user.

 I disagree. copy_to_user will not return while holding a spinlock, because 
 it does not know! How should it?
 See: spin_lock will call preempt_disable, but thats a no-op for a 
 non-preempt kernel. So the mere fact that we hold a spin_lock is not known 
 by any user access function. (or others). No?

 Christian


 
 Well might_sleep() merely checks preempt count and irqs_disabled too.
 If you want debugging things to trigger, you need to enable
 a bunch of config options.  That's not new.

You miss the point of the whole thread: The problem is that even with debug 
options enabled, holding a spinlock would not trigger a bug on copy_to_user. So 
the problem is not the good path, the problem is that a debugging aid for 
detecting a broken case was lost. Even with all kernel debugging enabled.

That is because CONFIG_DEBUG_ATOMIC_SLEEP selects PREEMPT_COUNT. That means: 
spin_lock will then be considered as in_atomic and no message comes. Without 
CONFIG_DEBUG_ATOMIC_SLEEP spin_lock will not touch the preempt_count but we 
also dont see a message because might_fault is now a nop

I understand that you dont like Davids changes due to other side effects that 
you have mentioned. So lets focus on how we can fix the debug option. Ok?

Christian

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Christian Borntraeger
Am 26.11.2014 um 16:47 schrieb Michael S. Tsirkin:
 On Wed, Nov 26, 2014 at 04:32:07PM +0100, David Hildenbrand wrote:
 On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
 What's the path you are trying to debug?

 Well, we had a problem where we held a spin_lock and called
 copy_(from|to)_user(). We experienced very random deadlocks that took 
 some guy
 almost a week to debug. The simple might_sleep() check would have showed 
 this
 error immediately.

 This must have been a very old kernel.
 A modern kernel will return an error from copy_to_user.
 Which is really the point of the patch you are trying to revert.

 That's assuming you disabled preemption. If you didn't, and take
 a spinlock, you have deadlocks even without userspace access.


 (Thanks for your resent, my first email was sent directly to you ... grml)

 This is what happened on our side (very recent kernel):

 spin_lock(lock)
 copy_to_user(...)
 spin_unlock(lock)
 
 That's a deadlock even without copy_to_user - it's
 enough for the thread to be preempted and another one
 to try taking the lock.

Huh? With CONFIG_PREEMPT spin_lock will disable preemption. (we had preempt = 
server anyway).

But please: One step back. The problem is not the good path. The problem is 
that we lost a debugging aid for a known to be broken case. In other words: Our 
code had a bug. Older kernels detected that kind of bug. With your change we no 
longer saw the sleeping while atomic. Thats it. See my other mail.

Christian

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Michael S. Tsirkin
On Wed, Nov 26, 2014 at 05:02:23PM +0100, David Hildenbrand wrote:
   This is what happened on our side (very recent kernel):
   
   spin_lock(lock)
   copy_to_user(...)
   spin_unlock(lock)
  
  That's a deadlock even without copy_to_user - it's
  enough for the thread to be preempted and another one
  to try taking the lock.
  
  
   1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu 
   id_
  as old value
   2. we slept during copy_to_user()
   3. the thread got scheduled onto another cpu
   4. spin_unlock failed as the _cpu id_ didn't match (another cpu that 
   locked
  the spinlock tried to unlocked it).
   5. lock remained locked - deadlock
   
   Christian came up with the following explanation:
   Without preemption, spin_lock() will not touch the preempt counter.
   disable_pfault() will always touch it.
   
   Therefore, with preemption disabled, copy_to_user() has no idea that it is
   running in atomic context - and will therefore try to sleep.
  
   So copy_to_user() will on s390:
   1. run as atomic while spin_lock() with preemption enabled.
   2. run as not atomic while spin_lock() with preemption disabled.
   3.  run as atomic while pagefault_disabled() with preemption enabled or
   disabled.
   4. run as not atomic when really not atomic.
 
 should have been more clear at that point: 
 preemption enabled == kernel compiled with preemption support
 preemption disabled == kernel compiled without preemption support
 
   
   And exactly nr 2. is the thing that produced the deadlock in our scenario 
   and
   the reason why I want a might_sleep() :)
  
  IMHO it's not copy to user that causes the problem.
  It's the misuse of spinlocks with preemption on.
 
 As I said, preemption was off.

off - disabled at compile time?

But the code is broken for people that do enable it.


  
  So might_sleep would make you think copy_to_user is
  the problem, and e.g. let you paper over it by
  moving copy_to_user out.
 
 Actually implementing different way of locking easily fixed the problem for 
 us.
 The old might_sleep() checks would have given us the problem within a few
 seconds (I tested it).


Or enable CONFIG_PREMPT, with same effect (copy_to_user will report
an error).

Do you check  return code from copy to user?
If not then you have another bug ...

  
  Enable lock prover and you will see what the real
  issue is, which is you didn't disable preempt.
  and if you did, copy_to_user would be okay.
  
 
 Our kernel is compiled without preemption and we turned on all lock/atomic
 sleep debugging aid. No problem was detected.

But your code is still buggy with preemption on, isn't it?


 
 But the question is if we shouldn't rather provide a:
 
   copy_to_user_nosleep() implementation that can be called from
 pagefault_disable() because it won't sleep.
 and a
   copy_to_user_sleep() implementation that cannot be called from
 pagefault_disable().
 
 Another way to fix it would be a reworked pagefault_disable() function that
 somehow sets a flag, so copy_to_user() knows that it is in fact called from 
 a
 valid context, not just from some atomic context. So we could trigger
 might_sleep() when detecting a !pagefault_disable contex

I think all this is just directing people to paper over the
problem. You should normally disable preemption if you take
spinlocks.
Yes it might happen to work if preempt is compiled out
and you don't trigger scheduler, but Linux might
add scheduler calls at any point without notice,
code must be preempt safe.

Maybe add a debug option warning about spinlocks taken
with preempt on.

That would make sense I think.


-- 
MST
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Christian Borntraeger
Am 26.11.2014 um 17:19 schrieb Michael S. Tsirkin:
 On Wed, Nov 26, 2014 at 05:02:23PM +0100, David Hildenbrand wrote:
 This is what happened on our side (very recent kernel):

 spin_lock(lock)
 copy_to_user(...)
 spin_unlock(lock)

 That's a deadlock even without copy_to_user - it's
 enough for the thread to be preempted and another one
 to try taking the lock.


 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu 
 id_
as old value
 2. we slept during copy_to_user()
 3. the thread got scheduled onto another cpu
 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked
the spinlock tried to unlocked it).
 5. lock remained locked - deadlock

 Christian came up with the following explanation:
 Without preemption, spin_lock() will not touch the preempt counter.
 disable_pfault() will always touch it.

 Therefore, with preemption disabled, copy_to_user() has no idea that it is
 running in atomic context - and will therefore try to sleep.

 So copy_to_user() will on s390:
 1. run as atomic while spin_lock() with preemption enabled.
 2. run as not atomic while spin_lock() with preemption disabled.
 3.  run as atomic while pagefault_disabled() with preemption enabled or
 disabled.
 4. run as not atomic when really not atomic.

 should have been more clear at that point: 
 preemption enabled == kernel compiled with preemption support
 preemption disabled == kernel compiled without preemption support


 And exactly nr 2. is the thing that produced the deadlock in our scenario 
 and
 the reason why I want a might_sleep() :)

 IMHO it's not copy to user that causes the problem.
 It's the misuse of spinlocks with preemption on.

 As I said, preemption was off.
 
 off - disabled at compile time?
 
 But the code is broken for people that do enable it.
[...]
 You should normally disable preemption if you take
 spinlocks.

Your are telling that any sequence of
spin_lock
...
spin_unlock

is broken with CONFIG_PREEMPT?
Michael, that is bullshit. spin_lock will take care of CONFIG_PREEMPT just 
fine. 

Only sequences like
spin_lock
...
schedule
...
spin_unlock
are broken.

But as I said. That is not the problem that we are discussing here.

Christian

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Michael S. Tsirkin
On Wed, Nov 26, 2014 at 05:07:13PM +0100, Christian Borntraeger wrote:
 Am 26.11.2014 um 16:47 schrieb Michael S. Tsirkin:
  On Wed, Nov 26, 2014 at 04:32:07PM +0100, David Hildenbrand wrote:
  On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote:
  On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote:
  What's the path you are trying to debug?
 
  Well, we had a problem where we held a spin_lock and called
  copy_(from|to)_user(). We experienced very random deadlocks that took 
  some guy
  almost a week to debug. The simple might_sleep() check would have 
  showed this
  error immediately.
 
  This must have been a very old kernel.
  A modern kernel will return an error from copy_to_user.
  Which is really the point of the patch you are trying to revert.
 
  That's assuming you disabled preemption. If you didn't, and take
  a spinlock, you have deadlocks even without userspace access.
 
 
  (Thanks for your resent, my first email was sent directly to you ... grml)
 
  This is what happened on our side (very recent kernel):
 
  spin_lock(lock)
  copy_to_user(...)
  spin_unlock(lock)
  
  That's a deadlock even without copy_to_user - it's
  enough for the thread to be preempted and another one
  to try taking the lock.
 
 Huh? With CONFIG_PREEMPT spin_lock will disable preemption. (we had preempt = 
 server anyway).

Are you sure? Can you point me where it does this please?

 But please: One step back. The problem is not the good path. The problem is 
 that we lost a debugging aid for a known to be broken case. In other words: 
 Our code had a bug. Older kernels detected that kind of bug. With your change 
 we no longer saw the sleeping while atomic. Thats it. See my other mail.
 
 Christian

You want to add more debugging tools, fine. But this one was
giving users in field false positives.

The point is that *_user is safe with preempt off.
It returns an error gracefully.
It does not sleep.
It does not trigger the scheduler in that context.


David's patch makes it say it does, so it's wrong.



-- 
MST
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Michael S. Tsirkin
On Wed, Nov 26, 2014 at 05:30:35PM +0100, Christian Borntraeger wrote:
 Am 26.11.2014 um 17:19 schrieb Michael S. Tsirkin:
  On Wed, Nov 26, 2014 at 05:02:23PM +0100, David Hildenbrand wrote:
  This is what happened on our side (very recent kernel):
 
  spin_lock(lock)
  copy_to_user(...)
  spin_unlock(lock)
 
  That's a deadlock even without copy_to_user - it's
  enough for the thread to be preempted and another one
  to try taking the lock.
 
 
  1. s390 locks/unlocks a spin lock with a compare and swap, using the 
  _cpu id_
 as old value
  2. we slept during copy_to_user()
  3. the thread got scheduled onto another cpu
  4. spin_unlock failed as the _cpu id_ didn't match (another cpu that 
  locked
 the spinlock tried to unlocked it).
  5. lock remained locked - deadlock
 
  Christian came up with the following explanation:
  Without preemption, spin_lock() will not touch the preempt counter.
  disable_pfault() will always touch it.
 
  Therefore, with preemption disabled, copy_to_user() has no idea that it 
  is
  running in atomic context - and will therefore try to sleep.
 
  So copy_to_user() will on s390:
  1. run as atomic while spin_lock() with preemption enabled.
  2. run as not atomic while spin_lock() with preemption disabled.
  3.  run as atomic while pagefault_disabled() with preemption enabled or
  disabled.
  4. run as not atomic when really not atomic.
 
  should have been more clear at that point: 
  preemption enabled == kernel compiled with preemption support
  preemption disabled == kernel compiled without preemption support
 
 
  And exactly nr 2. is the thing that produced the deadlock in our 
  scenario and
  the reason why I want a might_sleep() :)
 
  IMHO it's not copy to user that causes the problem.
  It's the misuse of spinlocks with preemption on.
 
  As I said, preemption was off.
  
  off - disabled at compile time?
  
  But the code is broken for people that do enable it.
 [...]
  You should normally disable preemption if you take
  spinlocks.
 
 Your are telling that any sequence of
 spin_lock
 ...
 spin_unlock
 
 is broken with CONFIG_PREEMPT?
 Michael, that is bullshit. spin_lock will take care of CONFIG_PREEMPT just 
 fine. 
 
 Only sequences like
 spin_lock
 ...
 schedule
 ...
 spin_unlock
 are broken.
 
 But as I said. That is not the problem that we are discussing here.
 
 Christian

I'm saying spin_lock without _irqsave is often a bug.


I am also saying this code in mm/fault.c:
__do_page_fault
...
/*
 * If we're in an interrupt, have no user context or are running
 * in an atomic region then we must not take the fault:
 */
if (unlikely(in_atomic() || !mm)) {
bad_area_nosemaphore(regs, error_code, address);
return;
}

means that a fault won't cause sleep if called in atomic context.

And a bunch of code relies on this.

This is why might_fault does:

 * it would be nicer only to annotate paths which are not under
 * pagefault_disable, however that requires a larger audit and
 * providing helpers like get_user_atomic.
 */
if (in_atomic())
return;

__might_sleep(__FILE__, __LINE__, 0);


If you see this violated, let's figure out why.

-- 
MST
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Christian Borntraeger
Am 26.11.2014 um 17:32 schrieb Michael S. Tsirkin:
[...]
 This is what happened on our side (very recent kernel):

 spin_lock(lock)
 copy_to_user(...)
 spin_unlock(lock)

 That's a deadlock even without copy_to_user - it's
 enough for the thread to be preempted and another one
 to try taking the lock.

 Huh? With CONFIG_PREEMPT spin_lock will disable preemption. (we had preempt 
 = server anyway).
 
 Are you sure? Can you point me where it does this please?

spin_lock -- raw_spin_lock -- _raw_spin_lock -- __raw_spin_lock

static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
   preempt_disable();   -
spin_acquire(lock-dep_map, 0, 0, _RET_IP_);
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}


Michael, please be serious. The whole kernel would be broken if spin_lock would 
not disable preemption.


 
 But please: One step back. The problem is not the good path. The problem is 
 that we lost a debugging aid for a known to be broken case. In other words: 
 Our code had a bug. Older kernels detected that kind of bug. With your 
 change we no longer saw the sleeping while atomic. Thats it. See my other 
 mail.

 Christian
 
 You want to add more debugging tools, fine.

We dont want to add, we want to fix something that used to work

 But this one was  giving users in field false positives.

So lets try to fix those, ok? If we cant, then tough luck. But coming up with 
wrong statements is not helpful.

 
 The point is that *_user is safe with preempt off.
 It returns an error gracefully.
 It does not sleep.
 It does not trigger the scheduler in that context.

There are special cases where your statement is true. But its not in general.
copy_to_user might fault and that fault might sleep and reschedule. For example 
handle_mm_fault might go down to pud_alloc, pmd_alloc etc and all these 
functions could do an GFP_KERNEL allocation. Which might sleep. Which will 
schedule.


 
 
 David's patch makes it say it does, so it's wrong.
 
 
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Michael S. Tsirkin
On Wed, Nov 26, 2014 at 05:51:08PM +0100, Christian Borntraeger wrote:
  But this one was  giving users in field false positives.
 
 So lets try to fix those, ok? If we cant, then tough luck.

Sure.
I think the simplest way might be to make spinlock disable
premption when CONFIG_DEBUG_ATOMIC_SLEEP is enabled.

As a result, userspace access will fail and caller will
get a nice error.



 But coming up with wrong statements is not helpful.

True. Sorry that I did that.

  
  The point is that *_user is safe with preempt off.
  It returns an error gracefully.
  It does not sleep.
  It does not trigger the scheduler in that context.
 
 There are special cases where your statement is true. But its not in general.
 copy_to_user might fault and that fault might sleep and reschedule.

Yes. But not if called inatomic.



 For example handle_mm_fault might go down to pud_alloc, pmd_alloc etc and all 
 these functions could do an GFP_KERNEL allocation. Which might sleep. Which 
 will schedule.
 
 
  
  
  David's patch makes it say it does, so it's wrong.
  
  
  

Absolutely.
I think you can already debug your case easily, by enabling CONFIG_PREEMPT.
This seems counter-intuitive, and distro debug kernels don't seem to do this.

-- 
MST
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Michael S. Tsirkin
On Wed, Nov 26, 2014 at 07:04:47PM +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 26, 2014 at 05:51:08PM +0100, Christian Borntraeger wrote:
   But this one was  giving users in field false positives.
  
  So lets try to fix those, ok? If we cant, then tough luck.
 
 Sure.
 I think the simplest way might be to make spinlock disable
 premption when CONFIG_DEBUG_ATOMIC_SLEEP is enabled.

Specifically maybe DEBUG_ATOMIC_SLEEP should select PREEMPT_COUNT?


 As a result, userspace access will fail and caller will
 get a nice error.
 
 
 
  But coming up with wrong statements is not helpful.
 
 True. Sorry that I did that.
 
   
   The point is that *_user is safe with preempt off.
   It returns an error gracefully.
   It does not sleep.
   It does not trigger the scheduler in that context.
  
  There are special cases where your statement is true. But its not in 
  general.
  copy_to_user might fault and that fault might sleep and reschedule.
 
 Yes. But not if called inatomic.
 
 
 
  For example handle_mm_fault might go down to pud_alloc, pmd_alloc etc and 
  all these functions could do an GFP_KERNEL allocation. Which might sleep. 
  Which will schedule.
  
  
   
   
   David's patch makes it say it does, so it's wrong.
   
   
   
 
 Absolutely.
 I think you can already debug your case easily, by enabling CONFIG_PREEMPT.
 This seems counter-intuitive, and distro debug kernels don't seem to do this.
 
 -- 
 MST
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc: 32 bit getcpu VDSO function uses 64 bit instructions

2014-11-26 Thread Anton Blanchard
I used some 64 bit instructions when adding the 32 bit getcpu VDSO
function. Fix it.

Fixes: 18ad51dd342a (powerpc: Add VDSO version of getcpu)
Cc: sta...@vger.kernel.org
Signed-off-by: Anton Blanchard an...@samba.org
---
 arch/powerpc/kernel/vdso32/getcpu.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/vdso32/getcpu.S 
b/arch/powerpc/kernel/vdso32/getcpu.S
index 23eb9a9..c62be60 100644
--- a/arch/powerpc/kernel/vdso32/getcpu.S
+++ b/arch/powerpc/kernel/vdso32/getcpu.S
@@ -30,8 +30,8 @@
 V_FUNCTION_BEGIN(__kernel_getcpu)
   .cfi_startproc
mfspr   r5,SPRN_SPRG_VDSO_READ
-   cmpdi   cr0,r3,0
-   cmpdi   cr1,r4,0
+   cmpwi   cr0,r3,0
+   cmpwi   cr1,r4,0
clrlwi  r6,r5,16
rlwinm  r7,r5,16,31-15,31-0
beq cr0,1f
-- 
2.1.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v1 1/1] powerpc/85xx: Add support for Emerson/Artesyn MVME2500.

2014-11-26 Thread Scott Wood
On Wed, 2014-11-26 at 15:17 +0100, Alessio Igor Bogani wrote:
 + board_soc: soc: soc@ffe0 {

There's no need for two labels on the same node.

 + ranges = 0x0 0 0xffe0 0x10;
 +
 + i2c@3000 {
 + hwmon@4c {
 + compatible = adi,adt7461;
 + reg = 0x4c;
 + };
 +
 + rtc@68 {
 + compatible = dallas,ds1337;
 + reg = 0x68;
 + interrupts = 8 1 0 0;
 + };
 +
 + eeprom-vpd@54 {
 + compatible = atmel,24c64;
 + reg = 0x54;
 + };

eeprom-vpd?

Node name isn't the right place to put the intended usage of the
contents of the EEPROM.

 +
 + eeprom@52 {
 + compatible = atmel,24c512;
 + reg = 0x52;
 + };
 +
 + eeprom@53 {
 + compatible = atmel,24c512;
 + reg = 0x53;
 + };
 +
 + spd@50 {
 + compatible = atmel,24c02;
 + reg = 0x50;
 + };

Likewise, I suspect this is also an eeprom.

 + };
 +
 + spi0: spi@7000 {
 + fsl,espi-num-chipselects = 2;
 +
 + flash@0 {
 + #address-cells = 1;
 + #size-cells = 1;
 + compatible = atmel,at25df641;
 + reg = 0;
 + spi-max-frequency = 1000;
 + partition@u-boot {
 + label = u-boot;
 + reg = 0x 0x000A;
 + read-only;
 + };
 + partition@dtb {
 + label = dtb;
 + reg = 0x000A 0x0002;
 + };

Unfortunately you seem to have copied a bad example here...  After the @
should be a number that matches reg.

Better yet, don't put partition information in the dts at all -- it's
not hardware description.  Use the mtdparts command line.

 + lbc: localbus@ffe05000 {
 + reg = 0 0xffe05000 0 0x1000;
 +
 + ranges = 0x0 0x0 0x0 0xfff0 0x0008
 +   0x1 0x0 0x0 0xffc4 0x0001
 +   0x2 0x0 0x0 0xffc5 0x0001
 +   0x3 0x0 0x0 0xffc6 0x0001
 +   0x4 0x0 0x0 0xffc7 0x0001
 +   0x6 0x0 0x0 0xffc8 0x0001
 +   0x5 0x0 0x0 0xffdf 0x1000;

It's not possible to program the LBC with a window of only 0x1000 bytes.

 +
 + serial2: serial@1,0 {
 + #cell-index = 2;
 + device_type = serial;
 + compatible = ns16550;
 + reg = 0x1 0x0 0x100;
 + clock-frequency = 1843200;
 + interrupts = 11 2 0 0;
 + };
 +
 + serial3: serial@2,0 {
 + #cell-index = 3;
 + device_type = serial;
 + compatible = ns16550;
 + reg = 0x2 0x0 0x100;
 + clock-frequency = 1843200;
 + interrupts = 1 2 0 0;
 + };

Why do you need cell-index, what connection do these values have to
actual hardware (e.g. values written to a register, rather than numbers
in a manual), and why did the name change to #cell-index?

 + interrupts = 9 1 0 0 ;

Whitespace

 +/include/ mvme2500.dtsi

Are you going to have more than one .dts using this .dtsi?  If not, why
separate this part?


 diff --git a/arch/powerpc/boot/dts/mvme2500.dtsi 
 b/arch/powerpc/boot/dts/mvme2500.dtsi
 new file mode 100644
 index 000..6966f13
 --- /dev/null
 +++ b/arch/powerpc/boot/dts/mvme2500.dtsi
[snip]
 +/include/ fsl/pq3-mpic-message-B.dtsi
 +};

Why is this being included from a board file rather than from the SoC
file?

 diff --git a/arch/powerpc/configs/85xx/mvme2500_defconfig 
 b/arch/powerpc/configs/85xx/mvme2500_defconfig
 new file mode 100644
 index 000..06fe629
 --- /dev/null
 +++ b/arch/powerpc/configs/85xx/mvme2500_defconfig

Why does this board need its own defconfig?

If it's just for the address space stuff, maybe it could be a more
general mpc85xx_2g_1g_1g_defconfig.  xes_mpc85xx_defconfig uses the same
layout (though it's SMP).  Maybe other boards could share it in the
future, or users of existing boards might prefer it...

Better still would be 

Re: powerpc/powernv: Fix the hmi event version check.

2014-11-26 Thread Michael Ellerman
On Wed, 2014-11-26 at 15:56 +0530, Mahesh Jagannath Salgaonkar wrote:
 On 11/26/2014 09:14 AM, Michael Ellerman wrote:
  On Thu, 2014-20-11 at 04:14:36 UTC, Mahesh Salgaonkar wrote:
  From: Mahesh Salgaonkar mah...@linux.vnet.ibm.com
 
  The current HMI event structure is an ABI and carries a version field to
  accommodate future changes without affecting/rearranging current structure
  members that are valid for previous versions. The current version check
  if (hmi_evt-version != OpalHMIEvt_V1) seems to consider that version
  will always be V1 which may not be true in future. If we start supporting
  HMI event  V1, this check would fail without printing anything on older
  kernels. This patch fixes this issue.
  
  It's not clear what you mean when you say this check would fail without
  printing anything. The check will fail, and it will print something, ie. 
  the
  error message.
  
  What you mean is the check will fail, and the HMI info will not be 
  printed.
 
 My Bad, Yes. I meant 'HMI info will not be printed'. Do you want me to
 re spin the patch with correction.

No that's fine, I've fixed it up to be:

The current HMI event structure is an ABI and carries a version field to
accommodate future changes without affecting/rearranging current structure
members that are valid for previous versions.

The current version check if (hmi_evt-version != OpalHMIEvt_V1)
doesn't accomodate the fact that the version number may change in
future.

If firmware starts returning an HMI event with version  1, this check
will fail and no HMI information will be printed on older kernels.

This patch fixes this issue.


  I'll CC this to stable unless you disagree.
 
 Yes. This patch needs to go to stable.

Yep I've added Cc stable to the patch.

cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: 32 bit getcpu VDSO function uses 64 bit instructions

2014-11-26 Thread Michael Ellerman
On Thu, 2014-11-27 at 08:11 +1100, Anton Blanchard wrote:
 I used some 64 bit instructions when adding the 32 bit getcpu VDSO
 function. Fix it.

Ouch. The symptom is a SIGILL I presume?

Could we catch this by forcing -m32 in the CFLAGS for vdso32 ?

cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: 32 bit getcpu VDSO function uses 64 bit instructions

2014-11-26 Thread Segher Boessenkool
On Thu, Nov 27, 2014 at 09:38:17AM +1100, Michael Ellerman wrote:
 On Thu, 2014-11-27 at 08:11 +1100, Anton Blanchard wrote:
  I used some 64 bit instructions when adding the 32 bit getcpu VDSO
  function. Fix it.
 
 Ouch. The symptom is a SIGILL I presume?
 
 Could we catch this by forcing -m32 in the CFLAGS for vdso32 ?

GCC has added -many to the assembler flags for over ten years now, so
no that will not work.  You can use -mppc or similar with the assembler
if you invoke it correctly (use  $(CC) -print-prog-name=as  to figure
out how to call the assembler, if you want to stay sane and use the
same one as the rest of the toolchain you use).


Segher
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: 32 bit getcpu VDSO function uses 64 bit instructions

2014-11-26 Thread Segher Boessenkool
On Wed, Nov 26, 2014 at 05:23:18PM -0600, Segher Boessenkool wrote:
 GCC has added -many to the assembler flags for over ten years now, so
 no that will not work.  You can use -mppc or similar with the assembler
 if you invoke it correctly (use  $(CC) -print-prog-name=as  to figure
s/correctly/directly/
 out how to call the assembler, if you want to stay sane and use the
 same one as the rest of the toolchain you use).
 
 
 Segher
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH REPOST 3/3] powerpc/vphn: move endianness fixing to vphn_unpack_associativity()

2014-11-26 Thread Benjamin Herrenschmidt
On Mon, 2014-11-17 at 18:42 +0100, Greg Kurz wrote:
 The first argument to vphn_unpack_associativity() is a const long *, but the
 parsing code expects __be64 values actually. This is inconsistent. We should
 either pass a const __be64 * or change vphn_unpack_associativity() so that
 it fixes endianness by itself.
 
 This patch does the latter, since the caller doesn't need to know about
 endianness and this allows to fix significant 64-bit values only. Please
 note that the previous code was able to cope with 32-bit fields being split
 accross two consecutives 64-bit values. Since PAPR+ doesn't say this cannot
 happen, the behaviour was kept. It requires extra checking to know when fixing
 is needed though.

While I agree with moving the endian fixing down, the patch makes me
nervous. Note that I don't fully understand the format of what we are
parsing here so I might be wrong but ...

  
  #define VPHN_FIELD_UNUSED(0x)
  #define VPHN_FIELD_MSB   (0x8000)
  #define VPHN_FIELD_MASK  (~VPHN_FIELD_MSB)
  
 - for (i = 1; i  VPHN_ASSOC_BUFSIZE; i++) {
 - if (be16_to_cpup(field) == VPHN_FIELD_UNUSED)
 + for (i = 1, j = 0, k = 0; i  VPHN_ASSOC_BUFSIZE;) {
 + u16 field;
 +
 + if (j % 4 == 0) {
 + fixed.packed[k] = cpu_to_be64(packed[k]);
 + k++;
 + }

So we have essentially a bunch of 16-bit fields ... the above loads and
swap a whole 4 of them at once. However that means not only we byteswap
them individually, but we also flip the order of the fields. This is
ok ?

 + field = be16_to_cpu(fixed.field[j]);
 +
 + if (field == VPHN_FIELD_UNUSED)
   /* All significant fields processed.
*/
   break;

For example, we might have USED,USED,USED,UNUSED ... after the swap, we
now have UNUSED,USED,USED,USED ... and we stop parsing in the above
line on the first one. Or am I missing something ? 

 - if (be16_to_cpup(field)  VPHN_FIELD_MSB) {
 + if (field  VPHN_FIELD_MSB) {
   /* Data is in the lower 15 bits of this field */
 - unpacked[i] = cpu_to_be32(
 - be16_to_cpup(field)  VPHN_FIELD_MASK);
 - field++;
 + unpacked[i++] = cpu_to_be32(field  VPHN_FIELD_MASK);
 + j++;
   } else {
   /* Data is in the lower 15 bits of this field
* concatenated with the next 16 bit field
*/
 - unpacked[i] = *((__be32 *)field);
 - field += 2;
 + if (unlikely(j % 4 == 3)) {
 + /* The next field is to be copied from the next
 +  * 64-bit input value. We must fix it now.
 +  */
 + fixed.packed[k] = cpu_to_be64(packed[k]);
 + k++;
 + }
 +
 + unpacked[i++] = *((__be32 *)fixed.field[j]);
 + j += 2;
   }
   }
  
 @@ -1460,11 +1479,8 @@ static long hcall_vphn(unsigned long cpu, __be32 
 *associativity)
   long retbuf[PLPAR_HCALL9_BUFSIZE] = {0};
   u64 flags = 1;
   int hwcpu = get_hard_smp_processor_id(cpu);
 - int i;
  
   rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, retbuf, flags, hwcpu);
 - for (i = 0; i  VPHN_REGISTER_COUNT; i++)
 - retbuf[i] = cpu_to_be64(retbuf[i]);
   vphn_unpack_associativity(retbuf, associativity);
  
   return rc;


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 3/6] pseries: Create new device hotplug entry point

2014-11-26 Thread Benjamin Herrenschmidt
On Mon, 2014-11-17 at 15:51 -0600, Nathan Fontenot wrote:
 
 For PowerVM systems, this patch creates the /sys/kernel/dlpar file that rtas
 hotplug events can be written to by drmgr and passed to the common entry 
 point.
 There is no chance of updating how we receive hotplug requests on PowerVM
 systems.

I'm not convinced this is the right place for that file, might be worth asking
Greg KH what he thinks here.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 3/4] powernv: cpuidle: Redesign idle states management

2014-11-26 Thread Benjamin Herrenschmidt

  
 @@ -37,8 +38,7 @@
  
  /*
   * Pass requested state in r3:
 - *   0 - nap
 - *   1 - sleep
 + *   r3 - PNV_THREAD_NAP/SLEEP/WINKLE
   *
   * To check IRQ_HAPPENED in r4
   *   0 - don't check
 @@ -123,12 +123,62 @@ power7_enter_nap_mode:
   li  r4,KVM_HWTHREAD_IN_NAP
   stb r4,HSTATE_HWTHREAD_STATE(r13)
  #endif
 - cmpwi   cr0,r3,1
 - beq 2f
 + stb r3,PACA_THREAD_IDLE_STATE(r13)
 + cmpwi   cr1,r3,PNV_THREAD_SLEEP
 + bge cr1,2f
   IDLE_STATE_ENTER_SEQ(PPC_NAP)
   /* No return */
 -2:   IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
 - /* No return */
 +2:
 + /* Sleep or winkle */
 + li  r7,1
 + mfspr   r8,SPRN_PIR
 + /*
 +  * The last 3 bits of PIR represents the thread id of a cpu
 +  * in power8. This will need adjusting for power7.
 +  */
 + andi.   r8,r8,0x07  /* Get thread id into r8 */
 + rotld   r7,r7,r8
 +
 + ld  r14,PACA_CORE_IDLE_STATE_PTR(r13)

I assume we have already saved all non-volatile registers ? Because you
are clobbering one here and more below.

 +lwarx_loop1:
 + lwarx   r15,0,r14
 + andcr15,r15,r7  /* Clear thread bit */
 +
 + andi.   r15,r15,PNV_CORE_IDLE_THREAD_BITS
 + beq last_thread
 +
 + /* Not the last thread to goto sleep */
 + stwcx.  r15,0,r14
 + bne-lwarx_loop1
 + b   common_enter
 +
 +last_thread:
 + LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
 + lbz r3,0(r3)
 + cmpwi   r3,1
 + bne common_enter

This looks wrong. If the workaround is 0, we don't do the stwcx. at
all... Did you try with pnv_need_fastsleep_workaround set to 0 ? It
should work most of the time as long as you don't hit the fairly
rare race window :)

Also it would be nice to make the above a dynamically patches feature
section, though that means pnv_need_fastsleep_workaround needs to turn
into a CPU feature bit and that needs to be done *very* early on.

Another option is to patch out manually from the pnv code the pair:

andi.   r15,r15,PNV_CORE_IDLE_THREAD_BITS
beq last_thread

To turn them into nops by hand rather than using the feature system.

 + /*
 +  * Last thread of the core entering sleep. Last thread needs to execute
 +  * the hardware bug workaround code. Before that, set the lock bit to
 +  * avoid the race of other threads waking up and undoing workaround
 +  * before workaround is applied.
 +  */
 + ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
 + stwcx.  r15,0,r14
 + bne-lwarx_loop1
 +
 + /* Fast sleep workaround */
 + li  r3,1
 + li  r4,1
 + li  r0,OPAL_CONFIG_CPU_IDLE_STATE
 + bl  opal_call_realmode
 +
 + /* Clear Lock bit */

It's a lock, I would add a lwsync here to be safe, and I would add an
isync before the bne- above. Just to ensure that whatever is done
inside that locked section remains in there.

 + andi.   r15,r15,PNV_CORE_IDLE_THREAD_BITS
 + stw r15,0(r14)
 +
 +common_enter: /* common code for all the threads entering sleep */
 + IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
  
  _GLOBAL(power7_idle)
   /* Now check if user or arch enabled NAP mode */
 @@ -141,49 +191,16 @@ _GLOBAL(power7_idle)
  
  _GLOBAL(power7_nap)
   mr  r4,r3
 - li  r3,0
 + li  r3,PNV_THREAD_NAP
   b   power7_powersave_common
   /* No return */
  
  _GLOBAL(power7_sleep)
 - li  r3,1
 + li  r3,PNV_THREAD_SLEEP
   li  r4,1
   b   power7_powersave_common
   /* No return */
  
 -/*
 - * Make opal call in realmode. This is a generic function to be called
 - * from realmode from reset vector. It handles endianess.
 - *
 - * r13 - paca pointer
 - * r1  - stack pointer
 - * r3  - opal token
 - */
 -opal_call_realmode:
 - mflrr12
 - std r12,_LINK(r1)
 - ld  r2,PACATOC(r13)
 - /* Set opal return address */
 - LOAD_REG_ADDR(r0,return_from_opal_call)
 - mtlrr0
 - /* Handle endian-ness */
 - li  r0,MSR_LE
 - mfmsr   r12
 - andcr12,r12,r0
 - mtspr   SPRN_HSRR1,r12
 - mr  r0,r3   /* Move opal token to r0 */
 - LOAD_REG_ADDR(r11,opal)
 - ld  r12,8(r11)
 - ld  r2,0(r11)
 - mtspr   SPRN_HSRR0,r12
 - hrfid
 -
 -return_from_opal_call:
 - FIXUP_ENDIAN
 - ld  r0,_LINK(r1)
 - mtlrr0
 - blr
 -
  #define CHECK_HMI_INTERRUPT  \
   mfspr   r0,SPRN_SRR1;   \
  BEGIN_FTR_SECTION_NESTED(66);
 \
 @@ -196,10 +213,8 @@ ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66); 
 \
   /* Invoke opal call to handle hmi */\
   ld  r2,PACATOC(r13);\
   ld  r1,PACAR1(r13);   

Re: [PATCH 3/3] KVM: PPC: BOOK3S: HV: Rename variable for better readability

2014-11-26 Thread Paul Mackerras
On Mon, Oct 20, 2014 at 07:59:00PM +0530, Aneesh Kumar K.V wrote:
 Minor cleanup
 
 Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 ---
  arch/powerpc/kvm/book3s_hv_rm_mmu.c | 25 +
  1 file changed, 13 insertions(+), 12 deletions(-)
 
 diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
 b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
 index 78e689b066f1..2922f8d127ff 100644
 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
 +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
 @@ -523,7 +523,7 @@ long kvmppc_h_bulk_remove(struct kvm_vcpu *vcpu)
   unsigned long *args = vcpu-arch.gpr[4];
   __be64 *hp, *hptes[4];
   unsigned long tlbrb[4];
 - long int i, j, k, n, found, indexes[4];
 + long int i, j, k, collected_hpte, found, indexes[4];

Hmmm... I don't find it more readable, because collected_hpte sounds
like it contains a HPTE value.  Also I don't like using a long name
for something that is just a temporary value inside a function, and
n is a suitable name for a temporary variable counting the number of
things we have.  I would prefer just adding a comment like this:

-   n = 0;
+   n = 0;  /* # values collected in tlbrb[], indexes[] etc. */

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 4/4] powernv: powerpc: Add winkle support for offline cpus

2014-11-26 Thread Benjamin Herrenschmidt
On Tue, 2014-11-25 at 16:47 +0530, Shreyas B. Prabhu wrote:

 diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
 b/arch/powerpc/kernel/cpu_setup_power.S
 index 4673353..66874aa 100644
 --- a/arch/powerpc/kernel/cpu_setup_power.S
 +++ b/arch/powerpc/kernel/cpu_setup_power.S
 @@ -55,6 +55,8 @@ _GLOBAL(__setup_cpu_power8)
   beqlr
   li  r0,0
   mtspr   SPRN_LPID,r0
 + mtspr   SPRN_WORT,r0
 + mtspr   SPRN_WORC,r0
   mfspr   r3,SPRN_LPCR
   ori r3, r3, LPCR_PECEDH
   bl  __init_LPCR
 @@ -75,6 +77,8 @@ _GLOBAL(__restore_cpu_power8)
   li  r0,0
   mtspr   SPRN_LPID,r0
   mfspr   r3,SPRN_LPCR
 + mtspr   SPRN_WORT,r0
 + mtspr   SPRN_WORC,r0
   ori r3, r3, LPCR_PECEDH
   bl  __init_LPCR
   bl  __init_HFSCR

Clearing WORT and WORC might not be the best thing. We know the HW folks
have been trying to tune those values and we might need to preserve what
the boot FW has set.

Can you get in touch with them and double check what we should do here ?

 diff --git a/arch/powerpc/kernel/exceptions-64s.S 
 b/arch/powerpc/kernel/exceptions-64s.S
 index 3311c8d..c9897cb 100644
 --- a/arch/powerpc/kernel/exceptions-64s.S
 +++ b/arch/powerpc/kernel/exceptions-64s.S
 @@ -112,6 +112,16 @@ BEGIN_FTR_SECTION
  
   cmpwi   cr1,r13,2
  
 + /* Check if last bit of HSPGR0 is set. This indicates whether we are
 +  * waking up from winkle */
 + li  r3,1
 + mfspr   r4,SPRN_HSPRG0
 + and r5,r4,r3
 + cmpwi   cr4,r5,1/* Store result in cr4 for later use */
 +
 + andcr4,r4,r3
 + mtspr   SPRN_HSPRG0,r4
 +

There is an open question here whether adding a beq cr4,+8 after the
cmpwi (or a +4 after the andc) is worthwhile. Can you check ? (either
measure or talk to HW folks). Also we could write directly to r13...

   GET_PACA(r13)
   lbz r0,PACA_THREAD_IDLE_STATE(r13)
   cmpwi   cr2,r0,PNV_THREAD_NAP
 diff --git a/arch/powerpc/kernel/idle_power7.S 
 b/arch/powerpc/kernel/idle_power7.S
 index c1d590f..78c30b0 100644
 --- a/arch/powerpc/kernel/idle_power7.S
 +++ b/arch/powerpc/kernel/idle_power7.S
 @@ -19,8 +19,22 @@
  #include asm/kvm_book3s_asm.h
  #include asm/opal.h
  #include asm/cpuidle.h
 +#include asm/mmu-hash64.h
  
  #undef DEBUG
 +/*
 + * Use unused space in the interrupt stack to save and restore
 + * registers for winkle support.
 + */
 +#define _SDR1GPR3
 +#define _RPR GPR4
 +#define _SPURR   GPR5
 +#define _PURRGPR6
 +#define _TSCRGPR7
 +#define _DSCRGPR8
 +#define _AMORGPR9
 +#define _PMC5GPR10
 +#define _PMC6GPR11

WORT/WORTC need saving restoring

  /* Idle state entry routines */
  
 @@ -153,32 +167,60 @@ lwarx_loop1:
   b   common_enter
  
  last_thread:
 - LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
 - lbz r3,0(r3)
 - cmpwi   r3,1
 - bne common_enter
   /*
* Last thread of the core entering sleep. Last thread needs to execute
* the hardware bug workaround code. Before that, set the lock bit to
* avoid the race of other threads waking up and undoing workaround
* before workaround is applied.
*/
 + LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
 + lbz r3,0(r3)
 + cmpwi   r3,1
 + bne common_enter
 +
   ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
   stwcx.  r15,0,r14
   bne-lwarx_loop1
  
   /* Fast sleep workaround */
 + mfcrr16 /* Backup CR to a non-volatile register */
   li  r3,1
   li  r4,1
   li  r0,OPAL_CONFIG_CPU_IDLE_STATE
   bl  opal_call_realmode
 + mtcrr16 /* Restore CR */

Why isn't the above already in the previous patch ? Also see my comment
about using a non-volatile CR instead.

   /* Clear Lock bit */
   andi.   r15,r15,PNV_CORE_IDLE_THREAD_BITS
   stw r15,0(r14)
  
 -common_enter: /* common code for all the threads entering sleep */
 +common_enter: /* common code for all the threads entering sleep or winkle*/
 + bgt cr1,enter_winkle
   IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
 +enter_winkle:
 + /*
 +  * Note all register i.e per-core, per-subcore or per-thread is saved
 +  * here since any thread in the core might wake up first
 +  */
 + mfspr   r3,SPRN_SDR1
 + std r3,_SDR1(r1)
 + mfspr   r3,SPRN_RPR
 + std r3,_RPR(r1)
 + mfspr   r3,SPRN_SPURR
 + std r3,_SPURR(r1)
 + mfspr   r3,SPRN_PURR
 + std r3,_PURR(r1)
 + mfspr   r3,SPRN_TSCR
 + std r3,_TSCR(r1)
 + mfspr   r3,SPRN_DSCR
 + std r3,_DSCR(r1)
 + mfspr   r3,SPRN_AMOR
 + std r3,_AMOR(r1)
 + mfspr   r3,SPRN_PMC5
 + std r3,_PMC5(r1)
 + mfspr   r3,SPRN_PMC6
 + std r3,_PMC6(r1)
 + IDLE_STATE_ENTER_SEQ(PPC_WINKLE)
  
  _GLOBAL(power7_idle)
   /* Now check if user or arch enabled NAP mode */
 @@ -201,6 +243,12 @@ 

Right location in sysfs for dlpar file

2014-11-26 Thread Benjamin Herrenschmidt
Hi Greg,

So Nathan is working on a patch series to cleanup and improve our
DLPAR infrastructure which is basically our hotplug mechanism when
running under the PowerVM (aka pHyp) and KVM hypervisors.

I'll let Nathan give you a bit more details/background and answer
subsequent question you might have as this is really his area of
expertise.

To cut a long story short, we need a sysfs file that allows our
userspace tools to notify the kernel of hotplug events coming from
the management console (which talks to userspace daemons using a
proprietary protocol) to initiate the hotplug operations, which in
turn get dispatched internally in the kernel to the right subsystem
(memory, cpu, pci, ...) based on the resource type.

On IRC, Greg suggested /sys/firmware and /sys/hypervisor which both
look like a reasonable option to me, probably better than dlpar...

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

RE: [PATCH v2 0/3] fix a kernel panic on fsl corenet board when CONFIG_CLK_PPC_CORENET is enabled

2014-11-26 Thread Yuantian Tang
Hello Mike,

Could you please apply this patch?
This patch has been acked for a while.

Thanks,
Yuantian

 -Original Message-
 From: Linuxppc-dev
 [mailto:linuxppc-dev-bounces+b29983=freescale@lists.ozlabs.org] On
 Behalf Of Scott Wood
 Sent: Friday, November 07, 2014 12:08 PM
 To: Kevin Hao
 Cc: Mike Turquette; Gerhard Sittig; Lu Jingchang-B35083;
 linuxppc-dev@lists.ozlabs.org
 Subject: Re: [PATCH v2 0/3] fix a kernel panic on fsl corenet board when
 CONFIG_CLK_PPC_CORENET is enabled
 
 On Sun, 2014-10-19 at 14:11 +0800, Kevin Hao wrote:
  Hi,
 
  I have done a boot test on p2014rdb and t4240qds boards. I don't have
  an access to mpc512x board, so only build test for that.
 
  v2:
   - Revert the commit da788acb2838 first.
   - Invoke of_clk_init() from a common place.
 
  v1
  This tries to fix a kernel panic introduced by commit da788acb2838
  (clk: ppc-corenet: Fix Section mismatch warning).
 
  Kevin Hao (3):
Revert clk: ppc-corenet: Fix Section mismatch warning
powerpc: call of_clk_init() from time_init()
clk: ppc-corenet: fix section mismatch warning
 
   arch/powerpc/kernel/time.c|  5 
   arch/powerpc/platforms/512x/clock-commonclk.c | 11 ---
   drivers/clk/clk-ppc-corenet.c | 43 
  ---
   3 files changed, 16 insertions(+), 43 deletions(-)
 
 
 Acked-by: Scott Wood scottw...@freescale.com
 
 Whose tree should this go through?
 
 -Scott
 
 
 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: 32 bit getcpu VDSO function uses 64 bit instructions

2014-11-26 Thread Peter Bergner
On Thu, 2014-11-27 at 09:38 +1100, Michael Ellerman wrote:
 On Thu, 2014-11-27 at 08:11 +1100, Anton Blanchard wrote:
  I used some 64 bit instructions when adding the 32 bit getcpu VDSO
  function. Fix it.
 
 Ouch. The symptom is a SIGILL I presume?

Nope, you don't get a SIGILL when executing 64-bit instructions in
32-bit mode, so it'll happily just execute the instruction, doing
a full 64-bit compare.  I'm guessing that the upper 32-bits of both
r3 and r4 contain zeros, so we're probably just getting lucky.


 Could we catch this by forcing -m32 in the CFLAGS for vdso32 ?

As Segher mentioned, GCC passing -many down to the assembler means
-m32 won't help.  It was due to Anton disabling that gcc feature,
that this was caught.

Peter



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[git pull] Please pull mpe.git for-linus branch (for powerpc)

2014-11-26 Thread Michael Ellerman
Hi Linus,

Here are five fixes for you to pull please.

I think these are all rc6 material, but I'm still learning so let me know if
you disagree :)

They're all CC'ed to stable except the Fix PE state format one which went in
this release.

cheers


The following changes since commit d7ce4377494adfaf8afb15ecf4f07d399bbf13d9:

  powerpc/fsl_msi: mark the msi cascade handler IRQF_NO_THREAD (2014-11-17 
22:00:30 -0600)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux.git for-linus

for you to fetch changes up to 152d44a853e42952f6c8a504fb1f8eefd21fd5fd:

  powerpc: 32 bit getcpu VDSO function uses 64 bit instructions (2014-11-27 
09:42:12 +1100)


Anton Blanchard (1):
  powerpc: 32 bit getcpu VDSO function uses 64 bit instructions

Gavin Shan (2):
  powerpc/eeh: Fix PE state format
  powerpc/powernv: Replace OPAL_DEASSERT_RESET with EEH_RESET_DEACTIVATE

Laurent Dufour (1):
  powerpc/pseries: Fix endiannes issue in RTAS call from xmon

Mahesh Salgaonkar (1):
  powerpc/powernv: Fix the hmi event version check.

 arch/powerpc/kernel/eeh_sysfs.c   | 2 +-
 arch/powerpc/kernel/vdso32/getcpu.S   | 4 ++--
 arch/powerpc/platforms/powernv/opal-hmi.c | 2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 2 +-
 arch/powerpc/xmon/xmon.c  | 6 +++---
 5 files changed, 8 insertions(+), 8 deletions(-)





signature.asc
Description: This is a digitally signed message part
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 3/4] powernv: cpuidle: Redesign idle states management

2014-11-26 Thread Shreyas B Prabhu
Hi Ben,

On Thursday 27 November 2014 06:07 AM, Benjamin Herrenschmidt wrote:
 
  
 @@ -37,8 +38,7 @@
  
  /*
   * Pass requested state in r3:
 - *  0 - nap
 - *  1 - sleep
 + *  r3 - PNV_THREAD_NAP/SLEEP/WINKLE
   *
   * To check IRQ_HAPPENED in r4
   *  0 - don't check
 @@ -123,12 +123,62 @@ power7_enter_nap_mode:
  li  r4,KVM_HWTHREAD_IN_NAP
  stb r4,HSTATE_HWTHREAD_STATE(r13)
  #endif
 -cmpwi   cr0,r3,1
 -beq 2f
 +stb r3,PACA_THREAD_IDLE_STATE(r13)
 +cmpwi   cr1,r3,PNV_THREAD_SLEEP
 +bge cr1,2f
  IDLE_STATE_ENTER_SEQ(PPC_NAP)
  /* No return */
 -2:  IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
 -/* No return */
 +2:
 +/* Sleep or winkle */
 +li  r7,1
 +mfspr   r8,SPRN_PIR
 +/*
 + * The last 3 bits of PIR represents the thread id of a cpu
 + * in power8. This will need adjusting for power7.
 + */
 +andi.   r8,r8,0x07  /* Get thread id into r8 */
 +rotld   r7,r7,r8
 +
 +ld  r14,PACA_CORE_IDLE_STATE_PTR(r13)
 
 I assume we have already saved all non-volatile registers ? Because you
 are clobbering one here and more below.

Yes. At this stage the all non-volatile registers are already saved in
stack.
 
 +lwarx_loop1:
 +lwarx   r15,0,r14
 +andcr15,r15,r7  /* Clear thread bit */
 +
 +andi.   r15,r15,PNV_CORE_IDLE_THREAD_BITS
 +beq last_thread
 +
 +/* Not the last thread to goto sleep */
 +stwcx.  r15,0,r14
 +bne-lwarx_loop1
 +b   common_enter
 +
 +last_thread:
 +LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
 +lbz r3,0(r3)
 +cmpwi   r3,1
 +bne common_enter
 
 This looks wrong. If the workaround is 0, we don't do the stwcx. at
 all... Did you try with pnv_need_fastsleep_workaround set to 0 ? It
 should work most of the time as long as you don't hit the fairly
 rare race window :)
 

My bad. I missed the stwcx. in the pnv_need_fastsleep_workaround = 0 path.
 Also it would be nice to make the above a dynamically patches feature
 section, though that means pnv_need_fastsleep_workaround needs to turn
 into a CPU feature bit and that needs to be done *very* early on.
 
 Another option is to patch out manually from the pnv code the pair:
 
   andi.   r15,r15,PNV_CORE_IDLE_THREAD_BITS
   beq last_thread
 
 To turn them into nops by hand rather than using the feature system.
 

Okay. I'll see which works out best here.
 +/*
 + * Last thread of the core entering sleep. Last thread needs to execute
 + * the hardware bug workaround code. Before that, set the lock bit to
 + * avoid the race of other threads waking up and undoing workaround
 + * before workaround is applied.
 + */
 +ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
 +stwcx.  r15,0,r14
 +bne-lwarx_loop1
 +
 +/* Fast sleep workaround */
 +li  r3,1
 +li  r4,1
 +li  r0,OPAL_CONFIG_CPU_IDLE_STATE
 +bl  opal_call_realmode
 +
 +/* Clear Lock bit */
 
 It's a lock, I would add a lwsync here to be safe, and I would add an
 isync before the bne- above. Just to ensure that whatever is done
 inside that locked section remains in there.


Okay. Will add it.
 +andi.   r15,r15,PNV_CORE_IDLE_THREAD_BITS
 +stw r15,0(r14)
 +
 +common_enter: /* common code for all the threads entering sleep */
 +IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
  
  _GLOBAL(power7_idle)
  /* Now check if user or arch enabled NAP mode */
 @@ -141,49 +191,16 @@ _GLOBAL(power7_idle)
  
  _GLOBAL(power7_nap)
  mr  r4,r3
 -li  r3,0
 +li  r3,PNV_THREAD_NAP
  b   power7_powersave_common
  /* No return */
  
  _GLOBAL(power7_sleep)
 -li  r3,1
 +li  r3,PNV_THREAD_SLEEP
  li  r4,1
  b   power7_powersave_common
  /* No return */
  
 -/*
 - * Make opal call in realmode. This is a generic function to be called
 - * from realmode from reset vector. It handles endianess.
 - *
 - * r13 - paca pointer
 - * r1  - stack pointer
 - * r3  - opal token
 - */
 -opal_call_realmode:
 -mflrr12
 -std r12,_LINK(r1)
 -ld  r2,PACATOC(r13)
 -/* Set opal return address */
 -LOAD_REG_ADDR(r0,return_from_opal_call)
 -mtlrr0
 -/* Handle endian-ness */
 -li  r0,MSR_LE
 -mfmsr   r12
 -andcr12,r12,r0
 -mtspr   SPRN_HSRR1,r12
 -mr  r0,r3   /* Move opal token to r0 */
 -LOAD_REG_ADDR(r11,opal)
 -ld  r12,8(r11)
 -ld  r2,0(r11)
 -mtspr   SPRN_HSRR0,r12
 -hrfid
 -
 -return_from_opal_call:
 -FIXUP_ENDIAN
 -ld  r0,_LINK(r1)
 -mtlrr0
 -blr
 -
  #define CHECK_HMI_INTERRUPT \
  mfspr   r0,SPRN_SRR1;   \
  BEGIN_FTR_SECTION_NESTED(66);   
 \
 @@ -196,10 +213,8 @@ ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 

Re: [PATCH v2 4/4] powernv: powerpc: Add winkle support for offline cpus

2014-11-26 Thread Shreyas B Prabhu
Hi Ben,

On Thursday 27 November 2014 07:25 AM, Benjamin Herrenschmidt wrote:
 On Tue, 2014-11-25 at 16:47 +0530, Shreyas B. Prabhu wrote:
 
 diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
 b/arch/powerpc/kernel/cpu_setup_power.S
 index 4673353..66874aa 100644
 --- a/arch/powerpc/kernel/cpu_setup_power.S
 +++ b/arch/powerpc/kernel/cpu_setup_power.S
 @@ -55,6 +55,8 @@ _GLOBAL(__setup_cpu_power8)
  beqlr
  li  r0,0
  mtspr   SPRN_LPID,r0
 +mtspr   SPRN_WORT,r0
 +mtspr   SPRN_WORC,r0
  mfspr   r3,SPRN_LPCR
  ori r3, r3, LPCR_PECEDH
  bl  __init_LPCR
 @@ -75,6 +77,8 @@ _GLOBAL(__restore_cpu_power8)
  li  r0,0
  mtspr   SPRN_LPID,r0
  mfspr   r3,SPRN_LPCR
 +mtspr   SPRN_WORT,r0
 +mtspr   SPRN_WORC,r0
  ori r3, r3, LPCR_PECEDH
  bl  __init_LPCR
  bl  __init_HFSCR
 
 Clearing WORT and WORC might not be the best thing. We know the HW folks
 have been trying to tune those values and we might need to preserve what
 the boot FW has set.
 
 Can you get in touch with them and double check what we should do here ?
 

I observed these were always 0. I'll speak to HW folks as you suggested.

 diff --git a/arch/powerpc/kernel/exceptions-64s.S 
 b/arch/powerpc/kernel/exceptions-64s.S
 index 3311c8d..c9897cb 100644
 --- a/arch/powerpc/kernel/exceptions-64s.S
 +++ b/arch/powerpc/kernel/exceptions-64s.S
 @@ -112,6 +112,16 @@ BEGIN_FTR_SECTION
  
  cmpwi   cr1,r13,2
  
 +/* Check if last bit of HSPGR0 is set. This indicates whether we are
 + * waking up from winkle */
 +li  r3,1
 +mfspr   r4,SPRN_HSPRG0
 +and r5,r4,r3
 +cmpwi   cr4,r5,1/* Store result in cr4 for later use */
 +
 +andcr4,r4,r3
 +mtspr   SPRN_HSPRG0,r4
 +
 
 There is an open question here whether adding a beq cr4,+8 after the
 cmpwi (or a +4 after the andc) is worthwhile. Can you check ? (either
 measure or talk to HW folks). 

Okay. This because mtspr is heavier op than beq?
 Also we could write directly to r13...

You mean use mr r13,r4 instead or GET_PACA?

  GET_PACA(r13)
  lbz r0,PACA_THREAD_IDLE_STATE(r13)
  cmpwi   cr2,r0,PNV_THREAD_NAP
 diff --git a/arch/powerpc/kernel/idle_power7.S 
 b/arch/powerpc/kernel/idle_power7.S
 index c1d590f..78c30b0 100644
 --- a/arch/powerpc/kernel/idle_power7.S
 +++ b/arch/powerpc/kernel/idle_power7.S
 @@ -19,8 +19,22 @@
  #include asm/kvm_book3s_asm.h
  #include asm/opal.h
  #include asm/cpuidle.h
 +#include asm/mmu-hash64.h
  
  #undef DEBUG
 +/*
 + * Use unused space in the interrupt stack to save and restore
 + * registers for winkle support.
 + */
 +#define _SDR1   GPR3
 +#define _RPRGPR4
 +#define _SPURR  GPR5
 +#define _PURR   GPR6
 +#define _TSCR   GPR7
 +#define _DSCR   GPR8
 +#define _AMOR   GPR9
 +#define _PMC5   GPR10
 +#define _PMC6   GPR11
 
 WORT/WORTC need saving restoring

The reason I skipped this was because these were always 0. But since its
set by FW, I'll save and restore them.

 
  /* Idle state entry routines */
  
 @@ -153,32 +167,60 @@ lwarx_loop1:
  b   common_enter
  
  last_thread:
 -LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
 -lbz r3,0(r3)
 -cmpwi   r3,1
 -bne common_enter
  /*
   * Last thread of the core entering sleep. Last thread needs to execute
   * the hardware bug workaround code. Before that, set the lock bit to
   * avoid the race of other threads waking up and undoing workaround
   * before workaround is applied.
   */
 +LOAD_REG_ADDR(r3, pnv_need_fastsleep_workaround)
 +lbz r3,0(r3)
 +cmpwi   r3,1
 +bne common_enter
 +
  ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
  stwcx.  r15,0,r14
  bne-lwarx_loop1
  
  /* Fast sleep workaround */
 +mfcrr16 /* Backup CR to a non-volatile register */
  li  r3,1
  li  r4,1
  li  r0,OPAL_CONFIG_CPU_IDLE_STATE
  bl  opal_call_realmode
 +mtcrr16 /* Restore CR */
 
 Why isn't the above already in the previous patch ? Also see my comment
 about using a non-volatile CR instead.

In the previous patch I wasn't using any CR after this OPAL call. Hence
I had skipped it. As you suggested I'll avoid this by using CR[234].
 
  /* Clear Lock bit */
  andi.   r15,r15,PNV_CORE_IDLE_THREAD_BITS
  stw r15,0(r14)
  
 -common_enter: /* common code for all the threads entering sleep */
 +common_enter: /* common code for all the threads entering sleep or winkle*/
 +bgt cr1,enter_winkle
  IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
 +enter_winkle:
 +/*
 + * Note all register i.e per-core, per-subcore or per-thread is saved
 + * here since any thread in the core might wake up first
 + */
 +mfspr   r3,SPRN_SDR1
 +std r3,_SDR1(r1)
 +mfspr   r3,SPRN_RPR
 +std r3,_RPR(r1)
 +mfspr   r3,SPRN_SPURR
 +std r3,_SPURR(r1)
 +mfspr   r3,SPRN_PURR
 +

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Heiko Carstens
On Wed, Nov 26, 2014 at 07:04:47PM +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 26, 2014 at 05:51:08PM +0100, Christian Borntraeger wrote:
   But this one was  giving users in field false positives.
  
  So lets try to fix those, ok? If we cant, then tough luck.
 
 Sure.
 I think the simplest way might be to make spinlock disable
 premption when CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
 
 As a result, userspace access will fail and caller will
 get a nice error.

Yes, _userspace_ now sees unpredictable behaviour, instead of that the
kernel emits a big loud warning to the console.

Please consider this simple example:

int bar(char __user *ptr)
{
...
if (copy_to_user(ptr, ...)
return -EFAULT;
...
}

SYSCALL_DEFINE1(foo, char __user *, ptr)
{
int rc;

...
rc = bar(ptr);
if (rc)
goto out;
...
out:
return rc;  
}

The above simple system call just works fine, with and without your change,
however if somebody (incorrectly) changes sys_foo() to the code below:

spin_lock(lock);
rc = bar(ptr);
if (rc)
goto out;
out:
spin_unlock(lock);
return rc;  

Broken code like above used to generate warnings. With your change we won't
see any warnings anymore. Instead we get random and bad behaviour:

For !CONFIG_PREEMPT if the page at ptr is not mapped, the kernel will see
a fault, potentially schedule and potentially deadlock on lock.
Without _any_ warning anymore.

For CONFIG_PREEMPT if the page at ptr is mapped, everthing works. However if
the page is not mapped, userspace now all of the sudden will see an invalid(!)
-EFAULT return code, instead of that the kernel resolved the page fault.
Yes, the kernel can't resolve the fault since we hold a spinlock. But the
above bogus code did give warnings to give you an idea that something probably
is not correct.

Who on earth is supposed to debug crap like this???

What we really want is:

Code like
spin_lock(lock);
if (copy_to_user(...))
rc = ...
spin_unlock(lock);
really *should* generate warnings like it did before.

And *only* code like
spin_lock(lock);
page_fault_disable();
if (copy_to_user(...))
rc = ...
page_fault_enable();
spin_unlock(lock);
should not generate warnings, since the author hopefully knew what he did.

We could achieve that by e.g. adding a couple of pagefault disabled bits
within current_thread_info()-preempt_count, which would allow
pagefault_disable() and pagefault_enable() to modify a different part of
preempt_count than it does now, so there is a way to tell if pagefaults have
been explicitly disabled or are just a side effect of preemption being
disabled.
This would allow might_fault() to restore its old sane behaviour for the
!page_fault_disabled() case.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

2014-11-26 Thread Michael S. Tsirkin
On Thu, Nov 27, 2014 at 08:09:19AM +0100, Heiko Carstens wrote:
 On Wed, Nov 26, 2014 at 07:04:47PM +0200, Michael S. Tsirkin wrote:
  On Wed, Nov 26, 2014 at 05:51:08PM +0100, Christian Borntraeger wrote:
But this one was  giving users in field false positives.
   
   So lets try to fix those, ok? If we cant, then tough luck.
  
  Sure.
  I think the simplest way might be to make spinlock disable
  premption when CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
  
  As a result, userspace access will fail and caller will
  get a nice error.
 
 Yes, _userspace_ now sees unpredictable behaviour, instead of that the
 kernel emits a big loud warning to the console.

So I don't object to adding more debugging at all.
Sure, would be nice.

But the fix is not an unconditional might_sleep
within might_fault, this would trigger false positives.

Rather, detect that you took a spinlock
without disabling preemption.


 Please consider this simple example:
 
 int bar(char __user *ptr)
 {
   ...
   if (copy_to_user(ptr, ...)
   return -EFAULT;
   ...
 }
 
 SYSCALL_DEFINE1(foo, char __user *, ptr)
 {
   int rc;
 
   ...
   rc = bar(ptr);
   if (rc)
   goto out;
   ...
 out:
   return rc;  
 }
 
 The above simple system call just works fine, with and without your change,
 however if somebody (incorrectly) changes sys_foo() to the code below:
 
   spin_lock(lock);
   rc = bar(ptr);
   if (rc)
   goto out;
 out:
   spin_unlock(lock);
   return rc;  
 
 Broken code like above used to generate warnings. With your change we won't
 see any warnings anymore. Instead we get random and bad behaviour:
 
 For !CONFIG_PREEMPT if the page at ptr is not mapped, the kernel will see
 a fault, potentially schedule and potentially deadlock on lock.
 Without _any_ warning anymore.
 
 For CONFIG_PREEMPT if the page at ptr is mapped, everthing works. However if
 the page is not mapped, userspace now all of the sudden will see an invalid(!)
 -EFAULT return code, instead of that the kernel resolved the page fault.
 Yes, the kernel can't resolve the fault since we hold a spinlock. But the
 above bogus code did give warnings to give you an idea that something probably
 is not correct.
 
 Who on earth is supposed to debug crap like this???
 
 What we really want is:
 
 Code like
   spin_lock(lock);
   if (copy_to_user(...))
   rc = ...
   spin_unlock(lock);
 really *should* generate warnings like it did before.
 
 And *only* code like
   spin_lock(lock);
   page_fault_disable();
   if (copy_to_user(...))
   rc = ...
   page_fault_enable();
   spin_unlock(lock);
 should not generate warnings, since the author hopefully knew what he did.
 
 We could achieve that by e.g. adding a couple of pagefault disabled bits
 within current_thread_info()-preempt_count, which would allow
 pagefault_disable() and pagefault_enable() to modify a different part of
 preempt_count than it does now, so there is a way to tell if pagefaults have
 been explicitly disabled or are just a side effect of preemption being
 disabled.
 This would allow might_fault() to restore its old sane behaviour for the
 !page_fault_disabled() case.

Exactly. I agree, that would be a useful debugging tool.

In fact this comment in mm/memory.c hints at this:
 * it would be nicer only to annotate paths which are not under
 * pagefault_disable,

it further says
 * however that requires a larger audit and
 * providing helpers like get_user_atomic.

but I think that what you outline is a better way to do this.

-- 
MST
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev