[powerpc:next 90/96] warning: (PPC_C2K && ..) selects NOT_COHERENT_CACHE which has unmet direct dependencies (4xx || ..)

2017-08-11 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
head:   df4c7983189491302a6000b2dcb14d8093f8fddf
commit: 968159c0031ac1e07ab4426397e786c9c483f068 [90/96] powerpc/8xx: Getting 
rid of remaining use of CONFIG_8xx
config: powerpc-c2k_defconfig (attached as .config)
compiler: powerpc-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout 968159c0031ac1e07ab4426397e786c9c483f068
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All warnings (new ones prefixed by >>):

warning: (PPC_C2K && AMIGAONE) selects NOT_COHERENT_CACHE which has unmet 
direct dependencies (4xx || PPC_8xx || E200 || PPC_MPC512x || GAMECUBE_COMMON)

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [bug report] powerpc/mm/radix: Avoid flushing the PWC on every flush_tlb_range

2017-08-11 Thread Dan Carpenter
On Fri, Aug 11, 2017 at 03:16:08PM -0700, Tyrel Datwyler wrote:
> On 08/11/2017 01:15 PM, Dan Carpenter wrote:
> > Hello Benjamin Herrenschmidt,
> > 
> > This is a semi-automatic email about new static checker warnings.
> > 
> > The patch 424de9c6e3f8: "powerpc/mm/radix: Avoid flushing the PWC on 
> > every flush_tlb_range" from Jul 19, 2017, leads to the following 
> > Smatch complaint:
> > 
> > arch/powerpc/mm/tlb-radix.c:368 radix__flush_tlb_collapsed_pmd()
> >  error: we previously assumed 'mm' could be null (see line 362)
> > 
> > arch/powerpc/mm/tlb-radix.c
> >361  
> >362  pid = mm ? mm->context.id : 0;
> >   ^^
> > Check for NULL.
> > 
> >363  if (unlikely(pid == MMU_NO_CONTEXT))
> >364  goto no_context;
> >365  
> >366  /* 4k page size, just blow the world */
> >367  if (PAGE_SIZE == 0x1000) {
> >368  radix__flush_all_mm(mm);
> > ^^
> > Unchecked dereference.
> 
> Appears to be a false positive. MMU_NO_CONTEXT I believe is defined as "0". 
> So, maybe it
> would be clearer that we take the goto branch if this line read:
> 
> 362   pid = mm ? mm->context.id : MMU_NO_CONTEXT;
> 

Ah...  This is because I'm compiling code for other arches in a "best
effort" way.  It doesn't pull in the right headers so it doesn't know
the value of MMU_NO_CONTEXT.  Otherwise it would read the code correctly
and not complain.

Sorry for that.

regards,
dan carpenter


Re: [PATCH 3/6] powerpc/mm: Ensure cpumask update is ordered

2017-08-11 Thread Benjamin Herrenschmidt
On Fri, 2017-08-11 at 21:06 +1000, Nicholas Piggin wrote:
> Other than that your series seems good to me if you repost it you
> can add
> 
> Reviewed-by: Nicholas Piggin 
> 
> This one out of the series is the bugfix so it should go to stable
> as well, right?

Yup.

Ben.



Re: [bug report] powerpc/mm/radix: Avoid flushing the PWC on every flush_tlb_range

2017-08-11 Thread Tyrel Datwyler
On 08/11/2017 01:15 PM, Dan Carpenter wrote:
> Hello Benjamin Herrenschmidt,
> 
> This is a semi-automatic email about new static checker warnings.
> 
> The patch 424de9c6e3f8: "powerpc/mm/radix: Avoid flushing the PWC on 
> every flush_tlb_range" from Jul 19, 2017, leads to the following 
> Smatch complaint:
> 
> arch/powerpc/mm/tlb-radix.c:368 radix__flush_tlb_collapsed_pmd()
>error: we previously assumed 'mm' could be null (see line 362)
> 
> arch/powerpc/mm/tlb-radix.c
>361
>362pid = mm ? mm->context.id : 0;
>   ^^
> Check for NULL.
> 
>363if (unlikely(pid == MMU_NO_CONTEXT))
>364goto no_context;
>365
>366/* 4k page size, just blow the world */
>367if (PAGE_SIZE == 0x1000) {
>368radix__flush_all_mm(mm);
> ^^
> Unchecked dereference.

Appears to be a false positive. MMU_NO_CONTEXT I believe is defined as "0". So, 
maybe it
would be clearer that we take the goto branch if this line read:

362 pid = mm ? mm->context.id : MMU_NO_CONTEXT;

-Tyrel

> 
>369return;
>370}
> 
> regards,
> dan carpenter
> 



Re: [PATCH 0/4] Allow non-legacy cards to be vgaarb default

2017-08-11 Thread Bjorn Helgaas
On Tue, Jul 25, 2017 at 03:56:20PM +, Gabriele Paoloni wrote:
> > Having practically zero background in gfx development (either kernel or
> > Xorg), I think the problem is that vga_default_device() /
> > vga_set_default_device(), which -- apparently -- "boot_vga" is based
> > upon, come from "drivers/gpu/vga/vgaarb.c". Namely, the concept of
> > "primary / boot display device" is tied to the VGA arbiter, plus only a
> > PCI device can currently be marked as primary/boot display device.
> > 
> > Can these concepts be split from each other? (I can fully imagine that
> > this would result in a userspace visible interface change (or
> > addition),
> > so that e.g. "/sys/devices/**/boot_gpu" would have to be consulted by
> > display servers.)
> > 
> > (Sorry if I'm totally wrong.)
> > 
> > ... Hm, reading the thread starter at
> >  > d...@lists.ozlabs.org/msg120851.html>,
> > and the references within... It looks like this work is motivated by
> > hardware that is supposed to be PCI, but actually breaks the specs. Is
> > that correct? If so, then I don't think I can suggest anything useful.
> 
> My understanding is that the current PCIe HW is specs compliant but the
> vgaarb, in order to make a VGA device the default one, requires all the
> bridges on top of such device to have the "VGA Enable" bit set (optional
> bit in the PCI Express™ to PCI/PCI-X Bridge Spec). I.e. all the bridges
> on top have to support legacy VGA devices; and this is not mandatory
> from the specs...right?

Per the PCIe-to-PCI Bridge spec r1.0, sec 5.1.2.13, the VGA Enable bit
is optional, as you say.  The PCI-to-PCI Bridge spec r1.2, sec
3.2.5.18, doesn't say VGA Enable is optional, *but* sec 4.5 says
bridges need not support VGA.  I naively assume one would discover
that by finding VGA Enable to be RO zero.

Of course, in any case, I also assume that (a) there exist VGA cards
that require legacy VGA resources, e.g., memory 0xa-0xb, and
(b) such cards will not work behind bridges without VGA support.

I have no idea what if anything the VGA arbiter should do about
bridges like this or VGA devices behind them, but it does sound like
the arbiter might need to become smarter.

Bjorn


Re: [PATCH 1/3] powerpc: simplify and fix VGA default device behaviour

2017-08-11 Thread Bjorn Helgaas
On Fri, Aug 04, 2017 at 08:20:31PM +1000, Daniel Axtens wrote:
> Some powerpc devices provide a PCI display that isn't picked up by
> the VGA arbiter, presumably because it doesn't support the PCI
> legacy VGA ranges.
> 
> Commit c2e1d84523ad ("powerpc: Set default VGA device") introduced
> an arch quirk to mark these devices as default to fix X autoconfig.
> 
> The commit message stated that the patch:
> 
> Ensures a default VGA is always set if a graphics adapter is present,
> even if firmware did not initialize it. If more than one graphics
> adapter is present, ensure the one initialized by firmware is set
> as the default VGA device.
> 
> The patch used the following test to decide whether or not to mark
> a device as default:
> 
>   pci_read_config_word(pdev, PCI_COMMAND, );
>   if ((cmd & (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) || !vga_default_device())
>   vga_set_default_device(pdev);
> 
> This doesn't seem like it works quite as intended. Because of the
> logical OR, the default device will be set in 2 cases:
> 
>  1) if there is no default device
> OR
>  2) if this device has normal memory/IO decoding turned on
> 
> This will work as intended if there is only one device, but if
> there are multiple devices, we may override the device the VGA
> arbiter picked.

This quirk only runs on VGA class devices.  If there's more than one
VGA device in the system, and we assume that firmware only enables
PCI_COMMAND_IO or PCI_COMMAND_MEMORY on "the one initialized by
firmware", which seems reasonable to me, I think the existing code
does match the commit message.

We set the first VGA device we find to be the default.  Then, if we
find another VGA device that's enabled, we make *it* the default
instead.

> Instead, set a device as default if there is no default device AND
> this device decodes.
> 
> This will not change behaviour on single-headed systems.

If there is no enabled VGA device on the system, your new code means
there will be no default VGA device.

It's not clear from this changelog what problem this patch solves.
Maybe it's the "some displays not being picked up by the VGA arbiter"
you mentioned, but there's not enough detail to connect it with the
patch, especially since the patch means we'll set the default device
in fewer cases than we did before.

With the patch, we only set the default if we find an enabled VGA
device.  Previously we also set the default if we found a VGA device
that had not been enabled.

> Cc: Brian King 
> Signed-off-by: Daniel Axtens 
> 
> ---
> 
> Tested in TCG (the card provided by qemu doesn't automatically
> register with vgaarb, so the relevant code path has been tested)
> but I would appreciate any tests on real hardware.
> 
> Informal benh ack: https://patchwork.kernel.org/patch/9850235/
> ---
>  arch/powerpc/kernel/pci-common.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/pci-common.c 
> b/arch/powerpc/kernel/pci-common.c
> index 341a7469cab8..c95fdda3a2dc 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -1746,8 +1746,11 @@ static void fixup_vga(struct pci_dev *pdev)
>  {
>   u16 cmd;
>  
> + if (vga_default_device())
> + return;
> +
>   pci_read_config_word(pdev, PCI_COMMAND, );
> - if ((cmd & (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) || 
> !vga_default_device())
> + if (cmd & (PCI_COMMAND_IO | PCI_COMMAND_MEMORY))
>   vga_set_default_device(pdev);
>  
>  }
> -- 
> 2.11.0
> 


Re: [PATCH net-next] fsl/fman: implement several errata workarounds

2017-08-11 Thread David Miller
From: Florinel Iordache 
Date: Thu, 10 Aug 2017 16:47:04 +0300

> Implemented workarounds for the following dTSEC Erratum:
> A002, A004, A0012, A0014, A004839 on several operations
> that involve MAC CFG register changes: adjust link,
> rx pause frames, modify MAC address.
> 
> Signed-off-by: Florinel Iordache 

Applied, thanks.


Re: [PATCH v2 2/3] livepatch: send a fake signal to all blocking tasks

2017-08-11 Thread Josh Poimboeuf
On Thu, Aug 10, 2017 at 12:48:14PM +0200, Miroslav Benes wrote:
> Last, sending the fake signal is not automatic. It is done only when
> admin requests it by writing 1 to force sysfs attribute in livepatch
> sysfs directory.

'writing 1' -> 'writing "signal"'

(unless you take my suggestion to change to two separate sysfs files)

> @@ -468,7 +468,12 @@ static ssize_t force_store(struct kobject *kobj, struct 
> kobj_attribute *attr,
>   return -EINVAL;
>   }
>  
> - return -EINVAL;
> + if (!memcmp("signal", buf, min(sizeof("signal")-1, count)))
> + klp_force_signals();

Any reason why you can't just do a strcmp()?

> +++ b/kernel/livepatch/transition.c
> @@ -577,3 +577,43 @@ void klp_copy_process(struct task_struct *child)
>  
>   /* TIF_PATCH_PENDING gets copied in setup_thread_stack() */
>  }
> +
> +/*
> + * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
> + * Kthreads with TIF_PATCH_PENDING set are woken up. Only admin can request 
> this
> + * action currently.
> + */
> +void klp_force_signals(void)
> +{
> + struct task_struct *g, *task;
> +
> + pr_notice("signalling remaining tasks\n");

As a native US speaker with possible OCD spelling tendencies, it bothers
me to see "signalling" with two l's instead of one.  According to
Google, the UK spelling of the word has two l's, so maybe it's not a
typo.  I'll forgive you if you don't fix it :-)

> +
> + read_lock(_lock);
> + for_each_process_thread(g, task) {
> + if (!klp_patch_pending(task))
> + continue;
> +
> + /*
> +  * There is a small race here. We could see TIF_PATCH_PENDING
> +  * set and decide to wake up a kthread or send a fake signal.
> +  * Meanwhile the task could migrate itself and the action
> +  * would be meaningless. It is not serious though.
> +  */
> + if (task->flags & PF_KTHREAD) {
> + /*
> +  * Wake up a kthread which still has not been migrated.
> +  */
> + wake_up_process(task);
> + } else {
> + /*
> +  * Send fake signal to all non-kthread tasks which are
> +  * still not migrated.
> +  */
> + spin_lock_irq(>sighand->siglock);
> + signal_wake_up(task, 0);
> + spin_unlock_irq(>sighand->siglock);
> + }
> + }
> + read_unlock(_lock);

I can't remember if we talked about this before, is it possible to also
signal/wake the idle tasks?

-- 
Josh


[bug report] powerpc/mm/radix: Avoid flushing the PWC on every flush_tlb_range

2017-08-11 Thread Dan Carpenter
Hello Benjamin Herrenschmidt,

This is a semi-automatic email about new static checker warnings.

The patch 424de9c6e3f8: "powerpc/mm/radix: Avoid flushing the PWC on 
every flush_tlb_range" from Jul 19, 2017, leads to the following 
Smatch complaint:

arch/powerpc/mm/tlb-radix.c:368 radix__flush_tlb_collapsed_pmd()
 error: we previously assumed 'mm' could be null (see line 362)

arch/powerpc/mm/tlb-radix.c
   361  
   362  pid = mm ? mm->context.id : 0;
  ^^
Check for NULL.

   363  if (unlikely(pid == MMU_NO_CONTEXT))
   364  goto no_context;
   365  
   366  /* 4k page size, just blow the world */
   367  if (PAGE_SIZE == 0x1000) {
   368  radix__flush_all_mm(mm);
^^
Unchecked dereference.

   369  return;
   370  }

regards,
dan carpenter


[PATCH] powerpc/perf: double unlock bug in imc_common_cpuhp_mem_free()

2017-08-11 Thread Dan Carpenter
There is a typo so we call unlock instead of lock.

Fixes: 885dcd709ba9 ("powerpc/perf: Add nest IMC PMU support")
Signed-off-by: Dan Carpenter 
---
I also don't understand how the _imc_refc[node_id].lock works.  Why
can't we use ref->lock everywhere?  They seem equivalent, and my static
checker complains if we call the same lock different names.

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 46cd912af060..52017f6eafd9 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -1124,7 +1124,7 @@ static void cleanup_all_thread_imc_memory(void)
 static void imc_common_cpuhp_mem_free(struct imc_pmu *pmu_ptr)
 {
if (pmu_ptr->domain == IMC_DOMAIN_NEST) {
-   mutex_unlock(_init_lock);
+   mutex_lock(_init_lock);
if (nest_pmus == 1) {

cpuhp_remove_state(CPUHP_AP_PERF_POWERPC_NEST_IMC_ONLINE);
kfree(nest_imc_refc);


Re: [v6 04/15] mm: discard memblock data later

2017-08-11 Thread Pasha Tatashin

Hi Michal,

This suggestion won't work, because there are arches without memblock 
support: tile, sh...


So, I would still need to have:

#ifdef CONFIG_MEMBLOCK in page_alloc, or define memblock_discard() stubs 
in nobootmem headfile. In either case it would become messier than what 
it is right now.


Pasha


I have just one nit below
Acked-by: Michal Hocko 

[...]

diff --git a/mm/memblock.c b/mm/memblock.c
index 2cb25fe4452c..bf14aea6ab70 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -285,31 +285,27 @@ static void __init_memblock memblock_remove_region(struct 
memblock_type *type, u
  }
  
  #ifdef CONFIG_ARCH_DISCARD_MEMBLOCK


pull this ifdef inside memblock_discard and you do not have an another
one in page_alloc_init_late

[...]

+/**
+ * Discard memory and reserved arrays if they were allocated
+ */
+void __init memblock_discard(void)
  {


here


-   if (memblock.memory.regions == memblock_memory_init_regions)
-   return 0;
+   phys_addr_t addr, size;
  
-	*addr = __pa(memblock.memory.regions);

+   if (memblock.reserved.regions != memblock_reserved_init_regions) {
+   addr = __pa(memblock.reserved.regions);
+   size = PAGE_ALIGN(sizeof(struct memblock_region) *
+ memblock.reserved.max);
+   __memblock_free_late(addr, size);
+   }
  
-	return PAGE_ALIGN(sizeof(struct memblock_region) *

- memblock.memory.max);
+   if (memblock.memory.regions == memblock_memory_init_regions) {
+   addr = __pa(memblock.memory.regions);
+   size = PAGE_ALIGN(sizeof(struct memblock_region) *
+ memblock.memory.max);
+   __memblock_free_late(addr, size);
+   }
  }
-
  #endif


[PATCH] drivers/macintosh: make wf_control_ops and wf_pid_param const

2017-08-11 Thread Bhumika Goyal
Make wf_control_ops const as they are only stored in the ops field of a
wf_control structure, which is const.
Make wf_pid_param const as they are only used during a copy operation.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal 
---
Cross compiled windfarm_smu_controls.o and windfarm_rm31.o for powerpc.

 drivers/macintosh/windfarm_cpufreq_clamp.c | 2 +-
 drivers/macintosh/windfarm_rm31.c  | 4 ++--
 drivers/macintosh/windfarm_smu_controls.c  | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/macintosh/windfarm_cpufreq_clamp.c 
b/drivers/macintosh/windfarm_cpufreq_clamp.c
index 72d1fdf..2626990 100644
--- a/drivers/macintosh/windfarm_cpufreq_clamp.c
+++ b/drivers/macintosh/windfarm_cpufreq_clamp.c
@@ -63,7 +63,7 @@ static s32 clamp_max(struct wf_control *ct)
return 1;
 }
 
-static struct wf_control_ops clamp_ops = {
+static const struct wf_control_ops clamp_ops = {
.set_value  = clamp_set,
.get_value  = clamp_get,
.get_min= clamp_min,
diff --git a/drivers/macintosh/windfarm_rm31.c 
b/drivers/macintosh/windfarm_rm31.c
index bdfcb8a..a0cd9c7 100644
--- a/drivers/macintosh/windfarm_rm31.c
+++ b/drivers/macintosh/windfarm_rm31.c
@@ -338,7 +338,7 @@ static int cpu_setup_pid(int cpu)
 }
 
 /* Backside/U3 fan */
-static struct wf_pid_param backside_param = {
+static const struct wf_pid_param backside_param = {
.interval   = 1,
.history_len= 2,
.gd = 0x0050,
@@ -351,7 +351,7 @@ static int cpu_setup_pid(int cpu)
 };
 
 /* DIMMs temperature (clamp the backside fan) */
-static struct wf_pid_param dimms_param = {
+static const struct wf_pid_param dimms_param = {
.interval   = 1,
.history_len= 20,
.gd = 0,
diff --git a/drivers/macintosh/windfarm_smu_controls.c 
b/drivers/macintosh/windfarm_smu_controls.c
index c155a54..d174c74 100644
--- a/drivers/macintosh/windfarm_smu_controls.c
+++ b/drivers/macintosh/windfarm_smu_controls.c
@@ -145,7 +145,7 @@ static s32 smu_fan_max(struct wf_control *ct)
return fct->max;
 }
 
-static struct wf_control_ops smu_fan_ops = {
+static const struct wf_control_ops smu_fan_ops = {
.set_value  = smu_fan_set,
.get_value  = smu_fan_get,
.get_min= smu_fan_min,
-- 
1.9.1



[RFC v7 26/25] mm/mprotect, powerpc/mm/pkeys, x86/mm/pkeys: Add sysfs interface

2017-08-11 Thread Thiago Jung Bauermann
Expose useful information for programs using memory protection keys.
Provide implementation for powerpc and x86.

On a powerpc system with pkeys support, here is what is shown:

$ head /sys/kernel/mm/protection_keys/*
==> /sys/kernel/mm/protection_keys/disable_execute_supported <==
true

==> /sys/kernel/mm/protection_keys/total_keys <==
32

==> /sys/kernel/mm/protection_keys/usable_keys <==
30

And on an x86 without pkeys support:

$ head /sys/kernel/mm/protection_keys/*
==> /sys/kernel/mm/protection_keys/disable_execute_supported <==
false

==> /sys/kernel/mm/protection_keys/total_keys <==
1

==> /sys/kernel/mm/protection_keys/usable_keys <==
0

Signed-off-by: Thiago Jung Bauermann 
---

Ram asked me to add a sysfs interface for the memory protection keys
feature. Here it is.

If you have suggestions on what should be exposed, please let me know.

 arch/powerpc/include/asm/pkeys.h   |  2 ++
 arch/powerpc/mm/pkeys.c| 12 
 arch/x86/include/asm/mmu_context.h | 34 +++---
 arch/x86/include/asm/pkeys.h   |  1 +
 arch/x86/mm/pkeys.c|  5 
 mm/mprotect.c  | 58 ++
 6 files changed, 96 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index e61ed6c332db..bbc5a34cc6d6 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -215,6 +215,8 @@ static inline int arch_set_user_pkey_access(struct 
task_struct *tsk, int pkey,
return __arch_set_user_pkey_access(tsk, pkey, init_val);
 }
 
+unsigned int arch_usable_pkeys(void);
+
 static inline bool arch_pkeys_enabled(void)
 {
return pkey_inited;
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 1424c79f45f6..54efbb133049 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -272,3 +272,15 @@ bool arch_vma_access_permitted(struct vm_area_struct *vma,
 
return pkey_access_permitted(pkey, write, execute);
 }
+
+unsigned int arch_usable_pkeys(void)
+{
+   unsigned int reserved;
+
+   if (!pkey_inited)
+   return 0;
+
+   reserved = hweight32(initial_allocation_mask);
+
+   return (pkeys_total > reserved) ? pkeys_total - reserved : 0;
+}
diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index 68b329d77b3a..d2eabedd583a 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -105,13 +105,30 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, 
struct task_struct *tsk)
 #endif
 }
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#define PKEY_INITIAL_ALLOCATION_MAP1
+
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+   unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
+ VM_PKEY_BIT2 | VM_PKEY_BIT3;
+
+   return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
+}
+#else
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+   return 0;
+}
+#endif
+
 static inline int init_new_context(struct task_struct *tsk,
   struct mm_struct *mm)
 {
#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
/* pkey 0 is the default and always allocated */
-   mm->context.pkey_allocation_map = 0x1;
+   mm->context.pkey_allocation_map = PKEY_INITIAL_ALLOCATION_MAP;
/* -1 means unallocated or invalid */
mm->context.execute_only_pkey = -1;
}
@@ -205,21 +222,6 @@ static inline void arch_unmap(struct mm_struct *mm, struct 
vm_area_struct *vma,
mpx_notify_unmap(mm, vma, start, end);
 }
 
-#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
-static inline int vma_pkey(struct vm_area_struct *vma)
-{
-   unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
- VM_PKEY_BIT2 | VM_PKEY_BIT3;
-
-   return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
-}
-#else
-static inline int vma_pkey(struct vm_area_struct *vma)
-{
-   return 0;
-}
-#endif
-
 static inline bool __pkru_allows_pkey(u16 pkey, bool write)
 {
u32 pkru = read_pkru();
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index fa8279972ddf..e1b25aa60530 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -105,5 +105,6 @@ extern int arch_set_user_pkey_access(struct task_struct 
*tsk, int pkey,
 extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
unsigned long init_val);
 extern void copy_init_pkru_to_fpregs(void);
+extern unsigned int arch_usable_pkeys(void);
 
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 2dab69a706ec..a3acca15ff83 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -123,6 +123,11 @@ 

[PATCH v2 14/14] powerpc/64s: idle ESL=0 stop can avoid MSR and save/restore overhead

2017-08-11 Thread Nicholas Piggin
When stop is executed with EC=ESL=0, it appears to execute like a
normal instruction (resuming from NIP when woken by interrupt).
So all the save/restore handling can be avoided completely. In
particular NV GPRs do not have to be saved, and MSR does not have
to be switched back to kernel MSR.

So move the test for "lite" sleep states out to power9_idle_stop.

Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/idle_book3s.S | 40 ++-
 1 file changed, 14 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index fc5145339277..0d8dd9823bd3 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -264,31 +264,8 @@ enter_winkle:
 /*
  * r3 - PSSCR value corresponding to the requested stop state.
  */
-power_enter_stop:
-/*
- * Check if we are executing the lite variant with ESL=EC=0
- */
-   andis.   r4,r3,PSSCR_EC_ESL_MASK_SHIFTED
+power_enter_stop_esl:
clrldi   r3,r3,60 /* r3 = Bits[60:63] = Requested Level (RL) */
-   bne  .Lhandle_esl_ec_set
-   PPC_STOP
-   li  r3,0  /* Since we didn't lose state, return 0 */
-
-   /*
-* pnv_wakeup_noloss() expects r12 to contain the SRR1 value so
-* it can determine if the wakeup reason is an HMI in
-* CHECK_HMI_INTERRUPT.
-*
-* However, when we wakeup with ESL=0, SRR1 will not contain the wakeup
-* reason, so there is no point setting r12 to SRR1.
-*
-* Further, we clear r12 here, so that we don't accidentally enter the
-* HMI in pnv_wakeup_noloss() if the value of r12[42:45] == WAKE_HMI.
-*/
-   li  r12, 0
-   b   pnv_wakeup_noloss
-
-.Lhandle_esl_ec_set:
/*
 * POWER9 DD2 can incorrectly set PMAO when waking up after a
 * state-loss idle. Saving and restoring MMCR0 over idle is a
@@ -361,9 +338,20 @@ ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66);   
\
  * r3 contains desired PSSCR register value.
  */
 _GLOBAL(power9_idle_stop)
-   std r3, PACA_REQ_PSSCR(r13)
+   /*
+* Check if we are executing the lite variant with ESL=EC=0
+* This case resumes execution after the stop instruction without
+* losing any state, so nothing has to be saved.
+*/
mtspr   SPRN_PSSCR,r3
-   LOAD_REG_ADDR(r4,power_enter_stop)
+   andis.  r4,r3,PSSCR_EC_ESL_MASK_SHIFTED
+   bne 1f
+   PPC_STOP
+   li  r3,0  /* Since we didn't lose state, return 0 */
+   blr
+1: /* state-loss idle */
+   std r3, PACA_REQ_PSSCR(r13)
+   LOAD_REG_ADDR(r4,power_enter_stop_esl)
b   pnv_powersave_common
/* No return */
 
-- 
2.13.3



[PATCH v2 13/14] powerpc/64s: idle POWER9 can execute stop in virtual mode

2017-08-11 Thread Nicholas Piggin
The hardware can execute stop in any context, and KVM does not
require real mode because siblings do not share MMU state. This
saves a switch to real-mode when going idle.

Acked-by: Gautham R. Shenoy 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/idle_book3s.S | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 3b701b1a5e87..fc5145339277 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -141,7 +141,16 @@ pnv_powersave_common:
std r5,_CCR(r1)
std r1,PACAR1(r13)
 
+BEGIN_FTR_SECTION
+   /*
+* POWER9 does not require real mode to stop, and does not set
+* hwthread_state for KVM (threads don't share MMU context), so
+* we can remain in virtual mode for this.
+*/
+   bctr
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
/*
+* POWER8
 * Go to real mode to do the nap, as required by the architecture.
 * Also, we need to be in real mode before setting hwthread_state,
 * because as soon as we do that, another thread can switch
-- 
2.13.3



[PATCH v2 12/14] KVM: PPC: Book3S HV: POWER9 can execute stop without a sync sequence

2017-08-11 Thread Nicholas Piggin
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 3e024fd71fe8..edb47738a686 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -2527,7 +2527,17 @@ BEGIN_FTR_SECTION
 END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 
 kvm_nap_sequence:  /* desired LPCR value in r5 */
-BEGIN_FTR_SECTION
+BEGIN_FTR_SECTION  /* nap sequence */
+   mtspr   SPRN_LPCR,r5
+   isync
+   li  r0, 0
+   std r0, HSTATE_SCRATCH0(r13)
+   ptesync
+   ld  r0, HSTATE_SCRATCH0(r13)
+1: cmpdr0, r0
+   bne 1b
+   nap
+FTR_SECTION_ELSE   /* stop sequence */
/*
 * PSSCR bits:  exit criterion = 1 (wakeup based on LPCR at sreset)
 *  enable state loss = 1 (allow SMT mode switch)
@@ -2539,18 +2549,8 @@ BEGIN_FTR_SECTION
li  r4, LPCR_PECE_HVEE@higher
sldir4, r4, 32
or  r5, r5, r4
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
mtspr   SPRN_LPCR,r5
-   isync
-   li  r0, 0
-   std r0, HSTATE_SCRATCH0(r13)
-   ptesync
-   ld  r0, HSTATE_SCRATCH0(r13)
-1: cmpdr0, r0
-   bne 1b
-BEGIN_FTR_SECTION
-   nap
-FTR_SECTION_ELSE
+
PPC_STOP
 ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
b   .
-- 
2.13.3



[PATCH v2 11/14] powerpc/64s: idle POWER9 can execute stop without a sync sequence

2017-08-11 Thread Nicholas Piggin
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/cpuidle.h | 16 
 arch/powerpc/kernel/idle_book3s.S  | 26 --
 2 files changed, 20 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/cpuidle.h 
b/arch/powerpc/include/asm/cpuidle.h
index 52586f9956bb..6853a3741338 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -90,20 +90,4 @@ static inline void report_invalid_psscr_val(u64 psscr_val, 
int err)
 
 #endif
 
-/* Idle state entry routines */
-#ifdef CONFIG_PPC_P7_NAP
-#define IDLE_STATE_ENTER_SEQ(IDLE_INST) \
-   /* Magic NAP/SLEEP/WINKLE mode enter sequence */\
-   std r0,0(r1);   \
-   ptesync;\
-   ld  r0,0(r1);   \
-236:   cmpdcr0,r0,r0;  \
-   bne 236b;   \
-   IDLE_INST;  \
-
-#defineIDLE_STATE_ENTER_SEQ_NORET(IDLE_INST)   \
-   IDLE_STATE_ENTER_SEQ(IDLE_INST) \
-   b   .
-#endif /* CONFIG_PPC_P7_NAP */
-
 #endif
diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 9a9a28f0758d..3b701b1a5e87 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -151,6 +151,19 @@ pnv_powersave_common:
mtmsrd  r7,0
bctr
 
+/*
+ * This is the sequence required to execute idle instructions, as
+ * specified in ISA v2.07. MSR[IR] and MSR[DR] must be 0.
+ */
+#define ARCH207_IDLE_STATE_ENTER_SEQ_NORET(IDLE_INST)  \
+   /* Magic NAP/SLEEP/WINKLE mode enter sequence */\
+   std r0,0(r1);   \
+   ptesync;\
+   ld  r0,0(r1);   \
+236:   cmpdcr0,r0,r0;  \
+   bne 236b;   \
+   IDLE_INST;
+
.globl pnv_enter_arch207_idle_mode
 pnv_enter_arch207_idle_mode:
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
@@ -176,7 +189,7 @@ pnv_enter_arch207_idle_mode:
stb r3,PACA_THREAD_IDLE_STATE(r13)
cmpwi   cr3,r3,PNV_THREAD_SLEEP
bge cr3,2f
-   IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP)
+   ARCH207_IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP)
/* No return */
 2:
/* Sleep or winkle */
@@ -215,7 +228,7 @@ pnv_fastsleep_workaround_at_entry:
 
 common_enter: /* common code for all the threads entering sleep or winkle */
bgt cr3,enter_winkle
-   IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP)
+   ARCH207_IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP)
 
 fastsleep_workaround_at_entry:
orisr15,r15,PNV_CORE_IDLE_LOCK_BIT@h
@@ -237,7 +250,7 @@ fastsleep_workaround_at_entry:
 enter_winkle:
bl  save_sprs_to_stack
 
-   IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE)
+   ARCH207_IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE)
 
 /*
  * r3 - PSSCR value corresponding to the requested stop state.
@@ -249,7 +262,7 @@ power_enter_stop:
andis.   r4,r3,PSSCR_EC_ESL_MASK_SHIFTED
clrldi   r3,r3,60 /* r3 = Bits[60:63] = Requested Level (RL) */
bne  .Lhandle_esl_ec_set
-   IDLE_STATE_ENTER_SEQ(PPC_STOP)
+   PPC_STOP
li  r3,0  /* Since we didn't lose state, return 0 */
 
/*
@@ -282,7 +295,8 @@ power_enter_stop:
ld  r4,ADDROFF(pnv_first_deep_stop_state)(r5)
cmpdr3,r4
bge .Lhandle_deep_stop
-   IDLE_STATE_ENTER_SEQ_NORET(PPC_STOP)
+   PPC_STOP/* Does not return (system reset interrupt) */
+
 .Lhandle_deep_stop:
 /*
  * Entering deep idle state.
@@ -304,7 +318,7 @@ lwarx_loop_stop:
 
bl  save_sprs_to_stack
 
-   IDLE_STATE_ENTER_SEQ_NORET(PPC_STOP)
+   PPC_STOP/* Does not return (system reset interrupt) */
 
 /*
  * Entered with MSR[EE]=0 and no soft-masked interrupts pending.
-- 
2.13.3



[PATCH v2 10/14] KVM: PPC: Book3S HV: POWER9 does not require secondary thread management

2017-08-11 Thread Nicholas Piggin
POWER9 CPUs have independent MMU contexts per thread, so KVM does not
need to quiesce secondary threads, so the hwthread_req/hwthread_state
protocol does not have to be used. So patch it away on POWER9, and patch
away the branch from the Linux idle wakeup to kvm_start_guest that is
never used.

Add a warning and error out of kvmppc_grab_hwthread in case it is ever
called on POWER9.

This avoids a hwsync in the idle wakeup path on POWER9.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/kvm_book3s_asm.h |  4 
 arch/powerpc/kernel/idle_book3s.S | 30 +++---
 arch/powerpc/kvm/book3s_hv.c  | 14 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  8 
 4 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
b/arch/powerpc/include/asm/kvm_book3s_asm.h
index 7cea76f11c26..83596f32f50b 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -104,6 +104,10 @@ struct kvmppc_host_state {
u8 napping;
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+   /*
+* hwthread_req/hwthread_state pair is used to pull sibling threads
+* out of guest on pre-ISAv3.0B CPUs where threads share MMU.
+*/
u8 hwthread_req;
u8 hwthread_state;
u8 host_ipi;
diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index e6252c5a57a4..9a9a28f0758d 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -243,12 +243,6 @@ enter_winkle:
  * r3 - PSSCR value corresponding to the requested stop state.
  */
 power_enter_stop:
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-   /* Tell KVM we're entering idle */
-   li  r4,KVM_HWTHREAD_IN_IDLE
-   /* DO THIS IN REAL MODE!  See comment above. */
-   stb r4,HSTATE_HWTHREAD_STATE(r13)
-#endif
 /*
  * Check if we are executing the lite variant with ESL=EC=0
  */
@@ -411,6 +405,18 @@ pnv_powersave_wakeup_mce:
 
b   pnv_powersave_wakeup
 
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+kvm_start_guest_check:
+   li  r0,KVM_HWTHREAD_IN_KERNEL
+   stb r0,HSTATE_HWTHREAD_STATE(r13)
+   /* Order setting hwthread_state vs. testing hwthread_req */
+   sync
+   lbz r0,HSTATE_HWTHREAD_REQ(r13)
+   cmpwi   r0,0
+   beqlr
+   b   kvm_start_guest
+#endif
+
 /*
  * Called from reset vector for powersave wakeups.
  * cr3 - set to gt if waking up with partial/complete hypervisor state loss
@@ -435,15 +441,9 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
mr  r3,r12
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-   li  r0,KVM_HWTHREAD_IN_KERNEL
-   stb r0,HSTATE_HWTHREAD_STATE(r13)
-   /* Order setting hwthread_state vs. testing hwthread_req */
-   sync
-   lbz r0,HSTATE_HWTHREAD_REQ(r13)
-   cmpwi   r0,0
-   beq 1f
-   b   kvm_start_guest
-1:
+BEGIN_FTR_SECTION
+   bl  kvm_start_guest_check
+END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
 #endif
 
/* Return SRR1 from power7_nap() */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 359c79cdf0cc..e34cd6fb947b 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2111,6 +2111,16 @@ static int kvmppc_grab_hwthread(int cpu)
struct paca_struct *tpaca;
long timeout = 1;
 
+   /*
+* ISA v3.0 idle routines do not set hwthread_state or test
+* hwthread_req, so they can not grab idle threads.
+*/
+   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+   WARN_ON(1);
+   pr_err("KVM: can not control sibling threads\n");
+   return -EBUSY;
+   }
+
tpaca = [cpu];
 
/* Ensure the thread won't go into the kernel if it wakes */
@@ -2145,10 +2155,12 @@ static void kvmppc_release_hwthread(int cpu)
struct paca_struct *tpaca;
 
tpaca = [cpu];
-   tpaca->kvm_hstate.hwthread_req = 0;
tpaca->kvm_hstate.kvm_vcpu = NULL;
tpaca->kvm_hstate.kvm_vcore = NULL;
tpaca->kvm_hstate.kvm_split_mode = NULL;
+   if (!cpu_has_feature(CPU_FTR_ARCH_300))
+   tpaca->kvm_hstate.hwthread_req = 0;
+
 }
 
 static void radix_flush_cpu(struct kvm *kvm, int cpu, struct kvm_vcpu *vcpu)
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index c52184a8efdf..3e024fd71fe8 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -149,9 +149,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
subfr4, r4, r3
mtspr   SPRN_DEC, r4
 
+BEGIN_FTR_SECTION
/* hwthread_req may have got set by cede or no vcpu, so clear it */
li  r0, 0
stb r0, HSTATE_HWTHREAD_REQ(r13)
+END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
 
/*
 * For external interrupts 

[PATCH v2 09/14] powerpc/64: runlatch CTRL[RUN] set optimisation

2017-08-11 Thread Nicholas Piggin
The CTRL register is read-only except bit 63 which is the run latch
control. This means it can be updated with a mtspr rather than
mfspr/mtspr.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/process.c | 35 +++
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 9f3e2c932dcc..75306b6e1812 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1994,11 +1994,25 @@ void show_stack(struct task_struct *tsk, unsigned long 
*stack)
 void notrace __ppc64_runlatch_on(void)
 {
struct thread_info *ti = current_thread_info();
-   unsigned long ctrl;
 
-   ctrl = mfspr(SPRN_CTRLF);
-   ctrl |= CTRL_RUNLATCH;
-   mtspr(SPRN_CTRLT, ctrl);
+   if (cpu_has_feature(CPU_FTR_ARCH_206)) {
+   /*
+* Least significant bit (RUN) is the only writable bit of
+* the CTRL register, so we can avoid mfspr. 2.06 is not the
+* earliest ISA where this is the case, but it's convenient.
+*/
+   mtspr(SPRN_CTRLT, CTRL_RUNLATCH);
+   } else {
+   unsigned long ctrl;
+
+   /*
+* Some architectures (e.g., Cell) have writable fields other
+* than RUN, so do the read-modify-write.
+*/
+   ctrl = mfspr(SPRN_CTRLF);
+   ctrl |= CTRL_RUNLATCH;
+   mtspr(SPRN_CTRLT, ctrl);
+   }
 
ti->local_flags |= _TLF_RUNLATCH;
 }
@@ -2007,13 +2021,18 @@ void notrace __ppc64_runlatch_on(void)
 void notrace __ppc64_runlatch_off(void)
 {
struct thread_info *ti = current_thread_info();
-   unsigned long ctrl;
 
ti->local_flags &= ~_TLF_RUNLATCH;
 
-   ctrl = mfspr(SPRN_CTRLF);
-   ctrl &= ~CTRL_RUNLATCH;
-   mtspr(SPRN_CTRLT, ctrl);
+   if (cpu_has_feature(CPU_FTR_ARCH_206)) {
+   mtspr(SPRN_CTRLT, 0);
+   } else {
+   unsigned long ctrl;
+
+   ctrl = mfspr(SPRN_CTRLF);
+   ctrl &= ~CTRL_RUNLATCH;
+   mtspr(SPRN_CTRLT, ctrl);
+   }
 }
 #endif /* CONFIG_PPC64 */
 
-- 
2.13.3



[PATCH v2 08/14] powerpc/64s: irq replay remove spurious irq reason

2017-08-11 Thread Nicholas Piggin
HVI interrupts have always used 0x500, so remove the dead branch.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 29253cecf713..566cf126c13b 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1680,8 +1680,6 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | 
CPU_FTR_ARCH_300)
 BEGIN_FTR_SECTION
cmpwi   r3,0xa00
beq h_doorbell_common_msgclr
-   cmpwi   r3,0xea0
-   beq h_virt_irq_common
cmpwi   r3,0xe60
beq hmi_exception_common
 FTR_SECTION_ELSE
-- 
2.13.3



[PATCH v2 07/14] powerpc/64: remove redundant instruction in interrupt replay

2017-08-11 Thread Nicholas Piggin
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/entry_64.S | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index ec67f67dafab..3f2666d24a7e 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -995,7 +995,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
bne 1f
addir3,r1,STACK_FRAME_OVERHEAD;
bl  doorbell_exception
-   b   ret_from_except
 #endif /* CONFIG_PPC_DOORBELL */
 1: b   ret_from_except /* What else to do here ? */
  
-- 
2.13.3



[PATCH v2 06/14] powerpc/64s: irq replay external use the HV handler in HV mode on POWER9

2017-08-11 Thread Nicholas Piggin
POWER9 host external interrupts use the h_virt_irq_common handler, so
use that to replay them rather than using the hardware_interrupt_common
handler. Both call do_IRQ, but using the correct handler reduces i-cache
footprint.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index f9d0796fb2c9..29253cecf713 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1672,7 +1672,11 @@ _GLOBAL(__replay_interrupt)
cmpwi   r3,0x900
beq decrementer_common
cmpwi   r3,0x500
+BEGIN_FTR_SECTION
+   beq h_virt_irq_common
+FTR_SECTION_ELSE
beq hardware_interrupt_common
+ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_300)
 BEGIN_FTR_SECTION
cmpwi   r3,0xa00
beq h_doorbell_common_msgclr
-- 
2.13.3



[PATCH v2 05/14] powerpc/64s: irq replay merge HV and non-HV paths for doorbell replay

2017-08-11 Thread Nicholas Piggin
This results in smaller code, and fewer branches.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/entry_64.S   | 6 +-
 arch/powerpc/kernel/exceptions-64s.S | 2 +-
 arch/powerpc/kernel/irq.c| 2 --
 3 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 49d8422767b4..ec67f67dafab 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -990,11 +990,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 #ifdef CONFIG_PPC_BOOK3E
cmpwi   cr0,r3,0x280
 #else
-   BEGIN_FTR_SECTION
-   cmpwi   cr0,r3,0xe80
-   FTR_SECTION_ELSE
-   cmpwi   cr0,r3,0xa00
-   ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
+   cmpwi   cr0,r3,0xa00
 #endif /* CONFIG_PPC_BOOK3E */
bne 1f
addir3,r1,STACK_FRAME_OVERHEAD;
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 67321be3122c..f9d0796fb2c9 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1674,7 +1674,7 @@ _GLOBAL(__replay_interrupt)
cmpwi   r3,0x500
beq hardware_interrupt_common
 BEGIN_FTR_SECTION
-   cmpwi   r3,0xe80
+   cmpwi   r3,0xa00
beq h_doorbell_common_msgclr
cmpwi   r3,0xea0
beq h_virt_irq_common
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 7c46e0cce054..60ee6d7251b8 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -207,8 +207,6 @@ notrace unsigned int __check_irq_replay(void)
 #else
if (happened & PACA_IRQ_DBELL) {
local_paca->irq_happened &= ~PACA_IRQ_DBELL;
-   if (cpu_has_feature(CPU_FTR_HVMODE))
-   return 0xe80;
return 0xa00;
}
 #endif /* CONFIG_PPC_BOOK3E */
-- 
2.13.3



[PATCH v2 04/14] powerpc/64: cleanup __check_irq_replay

2017-08-11 Thread Nicholas Piggin
Move the clearing of irq_happened bits into the condition where
they were found to be set. This reduces instruction count slightly,
and reduces stores into irq_happened.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/irq.c | 45 +++--
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index f291f7826abc..7c46e0cce054 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -143,9 +143,10 @@ notrace unsigned int __check_irq_replay(void)
 */
unsigned char happened = local_paca->irq_happened;
 
-   /* Clear bit 0 which we wouldn't clear otherwise */
-   local_paca->irq_happened &= ~PACA_IRQ_HARD_DIS;
if (happened & PACA_IRQ_HARD_DIS) {
+   /* Clear bit 0 which we wouldn't clear otherwise */
+   local_paca->irq_happened &= ~PACA_IRQ_HARD_DIS;
+
/*
 * We may have missed a decrementer interrupt if hard disabled.
 * Check the decrementer register in case we had a rollover
@@ -173,39 +174,39 @@ notrace unsigned int __check_irq_replay(void)
 * This is a higher priority interrupt than the others, so
 * replay it first.
 */
-   local_paca->irq_happened &= ~PACA_IRQ_HMI;
-   if (happened & PACA_IRQ_HMI)
+   if (happened & PACA_IRQ_HMI) {
+   local_paca->irq_happened &= ~PACA_IRQ_HMI;
return 0xe60;
+   }
 
-   /*
-* We may have missed a decrementer interrupt. We check the
-* decrementer itself rather than the paca irq_happened field
-* in case we also had a rollover while hard disabled
-*/
-   local_paca->irq_happened &= ~PACA_IRQ_DEC;
-   if (happened & PACA_IRQ_DEC)
+   if (happened & PACA_IRQ_DEC) {
+   local_paca->irq_happened &= ~PACA_IRQ_DEC;
return 0x900;
+   }
 
-   /* Finally check if an external interrupt happened */
-   local_paca->irq_happened &= ~PACA_IRQ_EE;
-   if (happened & PACA_IRQ_EE)
+   if (happened & PACA_IRQ_EE) {
+   local_paca->irq_happened &= ~PACA_IRQ_EE;
return 0x500;
+   }
 
 #ifdef CONFIG_PPC_BOOK3E
-   /* Finally check if an EPR external interrupt happened
-* this bit is typically set if we need to handle another
-* "edge" interrupt from within the MPIC "EPR" handler
+   /*
+* Check if an EPR external interrupt happened this bit is typically
+* set if we need to handle another "edge" interrupt from within the
+* MPIC "EPR" handler.
 */
-   local_paca->irq_happened &= ~PACA_IRQ_EE_EDGE;
-   if (happened & PACA_IRQ_EE_EDGE)
+   if (happened & PACA_IRQ_EE_EDGE) {
+   local_paca->irq_happened &= ~PACA_IRQ_EE_EDGE;
return 0x500;
+   }
 
-   local_paca->irq_happened &= ~PACA_IRQ_DBELL;
-   if (happened & PACA_IRQ_DBELL)
+   if (happened & PACA_IRQ_DBELL) {
+   local_paca->irq_happened &= ~PACA_IRQ_DBELL;
return 0x280;
+   }
 #else
-   local_paca->irq_happened &= ~PACA_IRQ_DBELL;
if (happened & PACA_IRQ_DBELL) {
+   local_paca->irq_happened &= ~PACA_IRQ_DBELL;
if (cpu_has_feature(CPU_FTR_HVMODE))
return 0xe80;
return 0xa00;
-- 
2.13.3



[PATCH v2 03/14] powerpc/64s: masked interrupt returns to kernel so avoid r13 restore

2017-08-11 Thread Nicholas Piggin
Places in the kernel where r13 is not the PACA pointer must have
maskable interrupts disabled, so r13 does not have to be restored
when returning from a soft-masked interrupt.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index c4f50a9e2ab5..67321be3122c 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1379,7 +1379,7 @@ masked_##_H##interrupt:   
\
ld  r9,PACA_EXGEN+EX_R9(r13);   \
ld  r10,PACA_EXGEN+EX_R10(r13); \
ld  r11,PACA_EXGEN+EX_R11(r13); \
-   GET_SCRATCH0(r13);  \
+   /* returns to kernel where r13 must be set up, so don't restore it */ \
##_H##rfid; \
b   .;  \
MASKED_DEC_HANDLER(_H)
-- 
2.13.3



[PATCH v2 02/14] powerpc/64s: masked interrupt avoid instruction

2017-08-11 Thread Nicholas Piggin
EE is always enabled in SRR1 for masked interrupts, so clearing
it can use xor.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index f8ad3f0eb383..c4f50a9e2ab5 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1373,8 +1373,7 @@ masked_##_H##interrupt:   
\
 1: andi.   r10,r10,(PACA_IRQ_DBELL|PACA_IRQ_HMI);  \
bne 2f; \
mfspr   r10,SPRN_##_H##SRR1;\
-   rldicl  r10,r10,48,1; /* clear MSR_EE */\
-   rotldi  r10,r10,16; \
+   xorir10,r10,MSR_EE; /* clear MSR_EE */  \
mtspr   SPRN_##_H##SRR1,r10;\
 2: mtcrf   0x80,r9;\
ld  r9,PACA_EXGEN+EX_R9(r13);   \
-- 
2.13.3



[PATCH v2 01/14] powerpc/64s: masked interrupt avoid branch

2017-08-11 Thread Nicholas Piggin
Interrupts which do not require EE to be cleared can all
be tested with a single bitwise test.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index f14f3c04ec7e..f8ad3f0eb383 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1370,10 +1370,8 @@ masked_##_H##interrupt:  
\
ori r10,r10,0x; \
mtspr   SPRN_DEC,r10;   \
b   MASKED_DEC_HANDLER_LABEL;   \
-1: cmpwi   r10,PACA_IRQ_DBELL; \
-   beq 2f; \
-   cmpwi   r10,PACA_IRQ_HMI;   \
-   beq 2f; \
+1: andi.   r10,r10,(PACA_IRQ_DBELL|PACA_IRQ_HMI);  \
+   bne 2f; \
mfspr   r10,SPRN_##_H##SRR1;\
rldicl  r10,r10,48,1; /* clear MSR_EE */\
rotldi  r10,r10,16; \
-- 
2.13.3



[PATCH v2 00/14] idle and soft-irq improvements and POWER9 idle optimisation

2017-08-11 Thread Nicholas Piggin
Since last time:

- Split out KVM parts. Technically they don't actually have
  dependencies with the Linux patches I suppose, so they could be
  merged via different trees. Logically I think they are better to
  stay together.

- Fix and simplify the KVM secondary thread management patch, thanks
  Gautham.

- Fixed a bug in the ESL=0 avoid overhead patch.

- Retested on P8 and P9 with qemu.

- Mambo simulator still doesn't handle EC=0 wakeups properly, but I
  have reported that, and managed to write a patch here to fix the
  simulator bug and test there too.

Nicholas Piggin (14):
  powerpc/64s: masked interrupt avoid branch
  powerpc/64s: masked interrupt avoid instruction
  powerpc/64s: masked interrupt returns to kernel so avoid r13 restore
  powerpc/64: cleanup __check_irq_replay
  powerpc/64s: irq replay merge HV and non-HV paths for doorbell replay
  powerpc/64s: irq replay external use the HV handler in HV mode on
POWER9
  powerpc/64: remove redundant instruction in interrupt replay
  powerpc/64s: irq replay remove spurious irq reason
  powerpc/64: runlatch CTRL[RUN] set optimisation
  KVM: PPC: Book3S HV: POWER9 does not require secondary thread
management
  powerpc/64s: idle POWER9 can execute stop without a sync sequence
  KVM: PPC: Book3S HV: POWER9 can execute stop without a sync sequence
  powerpc/64s: idle POWER9 can execute stop in virtual mode
  powerpc/64s: idle ESL=0 stop can avoid MSR and save/restore overhead

 arch/powerpc/include/asm/cpuidle.h|  16 -
 arch/powerpc/include/asm/kvm_book3s_asm.h |   4 ++
 arch/powerpc/kernel/entry_64.S|   7 +-
 arch/powerpc/kernel/exceptions-64s.S  |  19 +++---
 arch/powerpc/kernel/idle_book3s.S | 103 +-
 arch/powerpc/kernel/irq.c |  47 +++---
 arch/powerpc/kernel/process.c |  35 +++---
 arch/powerpc/kvm/book3s_hv.c  |  14 +++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  32 ++
 9 files changed, 154 insertions(+), 123 deletions(-)

-- 
2.13.3



Re: [FIX PATCH v0] powerpc: Fix memory unplug failure on radix guest

2017-08-11 Thread Reza Arbab

On Fri, Aug 11, 2017 at 02:07:51PM +0530, Aneesh Kumar K.V wrote:

Reza Arbab  writes:


On Thu, Aug 10, 2017 at 02:53:48PM +0530, Bharata B Rao wrote:

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index f830562..24ecf53 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -524,6 +524,7 @@ static int __init early_init_dt_scan_drconf_memory(unsigned 
long node)
size = 0x8000ul - base;
}
memblock_add(base, size);
+   memblock_mark_hotplug(base, size);
} while (--rngs);
}
memblock_dump_all();


Doing this has the effect of putting all the affected memory into
ZONE_MOVABLE. See find_zone_movable_pfns_for_nodes(). This means no
kernel allocations can occur there. Is that okay?



So the thinking here is any memory identified via ibm,dynamic-memory can
be hot removed later. Hence the need to add them lmb size, because our
hotplug framework remove them in lmb size. If we want to support
hotunplug, then we will have to make sure kernel allocation doesn't
happen in that region right ?


Yes, the net result is that this memory can now be hotremoved. I just 
wanted to point out that the patch doesn't only change the granularity 
of addition--it also causes the memory to end up in a different zone 
(when using movable_node).



With the above i would consider not marking it hotplug was a bug before
?


Sure, that's reasonable.

--
Reza Arbab



Re: [v6 07/15] mm: defining memblock_virt_alloc_try_nid_raw

2017-08-11 Thread Pasha Tatashin

Sure, I could do this, but as I understood from earlier Dave Miller's
comments, we should do one logical change at a time. Hence, introduce API in
one patch use it in another. So, this is how I tried to organize this patch
set. Is this assumption incorrect?


Well, it really depends. If the patch is really small then adding a new
API along with users is easier to review and backport because you have a
clear view of the usage. I believe this is the case here. But if others
feel otherwise I will not object.


I will merge them.

Thank you,
Pasha


Re: [v6 04/15] mm: discard memblock data later

2017-08-11 Thread Pasha Tatashin

I will address your comment, and send out a new patch. Should I send it out
separately from the series or should I keep it inside?


I would post it separatelly. It doesn't depend on the rest.


OK, I will post it separately. No it does not depend on the rest, but 
the reset depends on this. So, I am not sure how to enforce that this 
comes before the rest.





Also, before I send out a new patch, I will need to root cause and resolve
problem found by kernel test robot , and bisected
down to this patch.

[  156.659400] BUG: Bad page state in process swapper  pfn:03147
[  156.660051] page:88001ed8a1c0 count:0 mapcount:-127 mapping:
(null) index:0x1
[  156.660917] flags: 0x0()
[  156.661198] raw:   0001
ff80
[  156.662006] raw: 88001f4a8120 88001ed85ce0 

[  156.662811] page dumped because: nonzero mapcount
[  156.663307] CPU: 0 PID: 1 Comm: swapper Not tainted
4.13.0-rc3-00220-g1aad694 #1
[  156.664077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.9.3-20161025_171302-gandalf 04/01/2014
[  156.665129] Call Trace:
[  156.665422]  dump_stack+0x1e/0x20
[  156.665802]  bad_page+0x122/0x148


Was the report related with this patch?


Yes, they said that the problem was bisected down to this patch. Do you 
know if there is a way to submit a patch to this test robot?


Thank you,
Pasha


Re: [v6 15/15] mm: debug for raw alloctor

2017-08-11 Thread Pasha Tatashin

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.


Please fold this into the patch which introduces
memblock_virt_alloc_try_nid_raw.


OK

 I am not sure CONFIG_DEBUG_VM is the

best config because that tends to be enabled quite often. Maybe
CONFIG_MEMBLOCK_DEBUG? Or even make it kernel command line parameter?



Initially, I did not want to make it CONFIG_MEMBLOCK_DEBUG because we 
really benefit from this debugging code when VM debug is enabled, and 
especially struct page debugging asserts which also depend on 
CONFIG_DEBUG_VM.


However, now thinking about it, I will change it to 
CONFIG_MEMBLOCK_DEBUG, and let users decide what other debugging configs 
need to be enabled, as this is also OK.


Thank you,
Pasha


Re: [v6 14/15] mm: optimize early system hash allocations

2017-08-11 Thread Pasha Tatashin

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
that memory that was allocated for system hash needs to be zeroed,
otherwise the memory does not need to be zeroed, and client will initialize
it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot performance.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 


OK, but as mentioned in the previous patch add memblock_virt_alloc_raw
in this patch.

Acked-by: Michal Hocko 


Ok I will merge them.

Thank you,
Pasha


Re: [v6 13/15] mm: stop zeroing memory during allocation in vmemmap

2017-08-11 Thread Pasha Tatashin

On 08/11/2017 09:04 AM, Michal Hocko wrote:

On Mon 07-08-17 16:38:47, Pavel Tatashin wrote:

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.


First of all this should be probably merged with the previous patch. The
I think vmemmap_alloc_block would be better to split up into
__vmemmap_alloc_block which doesn't zero and vmemmap_alloc_block which
does zero which would reduce the memset callsites and it would make it
slightly more robust interface.


Ok, I will add: vmemmap_alloc_block_zero() call, and merge this and the 
previous patches together.


Re: [v6 07/15] mm: defining memblock_virt_alloc_try_nid_raw

2017-08-11 Thread Michal Hocko
On Fri 11-08-17 11:58:46, Pasha Tatashin wrote:
> On 08/11/2017 08:39 AM, Michal Hocko wrote:
> >On Mon 07-08-17 16:38:41, Pavel Tatashin wrote:
> >>A new variant of memblock_virt_alloc_* allocations:
> >>memblock_virt_alloc_try_nid_raw()
> >> - Does not zero the allocated memory
> >> - Does not panic if request cannot be satisfied
> >
> >OK, this looks good but I would not introduce memblock_virt_alloc_raw
> >here because we do not have any users. Please move that to "mm: optimize
> >early system hash allocations" which actually uses the API. It would be
> >easier to review it that way.
> >
> >>Signed-off-by: Pavel Tatashin 
> >>Reviewed-by: Steven Sistare 
> >>Reviewed-by: Daniel Jordan 
> >>Reviewed-by: Bob Picco 
> >
> >other than that
> >Acked-by: Michal Hocko 
> 
> Sure, I could do this, but as I understood from earlier Dave Miller's
> comments, we should do one logical change at a time. Hence, introduce API in
> one patch use it in another. So, this is how I tried to organize this patch
> set. Is this assumption incorrect?

Well, it really depends. If the patch is really small then adding a new
API along with users is easier to review and backport because you have a
clear view of the usage. I believe this is the case here. But if others
feel otherwise I will not object.

-- 
Michal Hocko
SUSE Labs


Re: [v6 09/15] sparc64: optimized struct page zeroing

2017-08-11 Thread Pasha Tatashin

Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
calling memset(). We do eight to tent regular stores based on the size of
struct page. Compiler optimizes out the conditions of switch() statement.


Again, this doesn't explain why we need this. You have mentioned those
reasons in some previous emails but be explicit here please.



I will add performance data to this patch as well.

Thank you,
Pasha


Re: [v6 08/15] mm: zero struct pages during initialization

2017-08-11 Thread Pasha Tatashin

I believe this deserves much more detailed explanation why this is safe.
What actually prevents any pfn walker from seeing an uninitialized
struct page? Please make your assumptions explicit in the commit log so
that we can check them independently.


There is nothing prevents pfn walkers from walk over any struct pages 
deferred and non-deferred. However, during boot before deferred pages 
are initialized we have just a few places that do that, and all of those 
cases are fixed in this patchset.



Also this is done with some purpose which is the perfmance, right? You
have mentioned that in the cover letter but if somebody is going to read
through git logs this wouldn't be obvious from the specific commit.
So add that information here as well. Especially numbers will be
interesting.


I will add more performance data to this patch comment.


Re: [v6 04/15] mm: discard memblock data later

2017-08-11 Thread Michal Hocko
On Fri 11-08-17 11:49:15, Pasha Tatashin wrote:
> >I guess this goes all the way down to
> >Fixes: 7e18adb4f80b ("mm: meminit: initialise remaining struct pages in 
> >parallel with kswapd")
> 
> I will add this to the patch.
> 
> >>Signed-off-by: Pavel Tatashin 
> >>Reviewed-by: Steven Sistare 
> >>Reviewed-by: Daniel Jordan 
> >>Reviewed-by: Bob Picco 
> >
> >Considering that some HW might behave strangely and this would be rather
> >hard to debug I would be tempted to mark this for stable. It should also
> >be merged separately from the rest of the series.
> >
> >I have just one nit below
> >Acked-by: Michal Hocko 
> 
> I will address your comment, and send out a new patch. Should I send it out
> separately from the series or should I keep it inside?

I would post it separatelly. It doesn't depend on the rest.

> Also, before I send out a new patch, I will need to root cause and resolve
> problem found by kernel test robot , and bisected
> down to this patch.
> 
> [  156.659400] BUG: Bad page state in process swapper  pfn:03147
> [  156.660051] page:88001ed8a1c0 count:0 mapcount:-127 mapping:
> (null) index:0x1
> [  156.660917] flags: 0x0()
> [  156.661198] raw:   0001
> ff80
> [  156.662006] raw: 88001f4a8120 88001ed85ce0 
> 
> [  156.662811] page dumped because: nonzero mapcount
> [  156.663307] CPU: 0 PID: 1 Comm: swapper Not tainted
> 4.13.0-rc3-00220-g1aad694 #1
> [  156.664077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.9.3-20161025_171302-gandalf 04/01/2014
> [  156.665129] Call Trace:
> [  156.665422]  dump_stack+0x1e/0x20
> [  156.665802]  bad_page+0x122/0x148

Was the report related with this patch?
-- 
Michal Hocko
SUSE Labs


Re: [v6 07/15] mm: defining memblock_virt_alloc_try_nid_raw

2017-08-11 Thread Pasha Tatashin

On 08/11/2017 08:39 AM, Michal Hocko wrote:

On Mon 07-08-17 16:38:41, Pavel Tatashin wrote:

A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
 - Does not zero the allocated memory
 - Does not panic if request cannot be satisfied


OK, this looks good but I would not introduce memblock_virt_alloc_raw
here because we do not have any users. Please move that to "mm: optimize
early system hash allocations" which actually uses the API. It would be
easier to review it that way.


Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 


other than that
Acked-by: Michal Hocko 


Sure, I could do this, but as I understood from earlier Dave Miller's 
comments, we should do one logical change at a time. Hence, introduce 
API in one patch use it in another. So, this is how I tried to organize 
this patch set. Is this assumption incorrect?


Re: [PATCH] PCI: Convert to using %pOF instead of full_name

2017-08-11 Thread Bjorn Helgaas
[+cc Tyrel]

On Wed, Aug 09, 2017 at 05:04:43PM -0500, Rob Herring wrote:
> On Wed, Aug 2, 2017 at 5:39 PM, Bjorn Helgaas <helg...@kernel.org> wrote:
> > On Tue, Jul 18, 2017 at 04:43:21PM -0500, Rob Herring wrote:
> >> Now that we have a custom printf format specifier, convert users of
> >> full_name to use %pOF instead. This is preparation to remove storing
> >> of the full path string for each node.
> >>
> >> Signed-off-by: Rob Herring <r...@kernel.org>
> >> Cc: Thomas Petazzoni <thomas.petazz...@free-electrons.com>
> >> Cc: Jason Cooper <ja...@lakedaemon.net>
> >> Cc: Bjorn Helgaas <bhelg...@google.com>
> >> Cc: Thierry Reding <thierry.red...@gmail.com>
> >> Cc: Jonathan Hunter <jonath...@nvidia.com>
> >> Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> >> Cc: Paul Mackerras <pau...@samba.org>
> >> Cc: Michael Ellerman <m...@ellerman.id.au>
> >> Cc: linux-...@vger.kernel.org
> >> Cc: linux-arm-ker...@lists.infradead.org
> >> Cc: linux-te...@vger.kernel.org
> >> Cc: linuxppc-dev@lists.ozlabs.org
> >
> > Applied to pci/misc for v4.14, thanks!
> 
> This hasn't shown up in -next.

Thanks, it should be in next-20170811.

I updated it to add Tyrel's reviewed-by, but that's not in -next yet.


Re: [v6 05/15] mm: don't accessed uninitialized struct pages

2017-08-11 Thread Pasha Tatashin

On 08/11/2017 05:37 AM, Michal Hocko wrote:

On Mon 07-08-17 16:38:39, Pavel Tatashin wrote:

In deferred_init_memmap() where all deferred struct pages are initialized
we have a check like this:

 if (page->flags) {
 VM_BUG_ON(page_zone(page) != zone);
 goto free_range;
 }

This way we are checking if the current deferred page has already been
initialized. It works, because memory for struct pages has been zeroed, and
the only way flags are not zero if it went through __init_single_page()
before.  But, once we change the current behavior and won't zero the memory
in memblock allocator, we cannot trust anything inside "struct page"es
until they are initialized. This patch fixes this.

This patch defines a new accessor memblock_get_reserved_pfn_range()
which returns successive ranges of reserved PFNs.  deferred_init_memmap()
calls it to determine if a PFN and its struct page has already been
initialized.


Why don't we simply check the pfn against pgdat->first_deferred_pfn?


Because we are initializing deferred pages, and all of them have pfn 
greater than pgdat->first_deferred_pfn. However, some of deferred pages 
were already initialized if they were reserved, in this path:


mem_init()
 free_all_bootmem()
  free_low_memory_core_early()
   for_each_reserved_mem_region()
reserve_bootmem_region()
 init_reserved_page() <- if this is deferred reserved page
  __init_single_pfn()
   __init_single_page()

So, currently, we are using the value of page->flags to figure out if 
this page has been initialized while being part of deferred page, but 
this is not going to work for this project, as we do not zero the memory 
that is backing the struct pages, and size the value of page->flags can 
be anything.


Re: [v6 04/15] mm: discard memblock data later

2017-08-11 Thread Pasha Tatashin

I guess this goes all the way down to
Fixes: 7e18adb4f80b ("mm: meminit: initialise remaining struct pages in parallel 
with kswapd")


I will add this to the patch.


Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 


Considering that some HW might behave strangely and this would be rather
hard to debug I would be tempted to mark this for stable. It should also
be merged separately from the rest of the series.

I have just one nit below
Acked-by: Michal Hocko 


I will address your comment, and send out a new patch. Should I send it 
out separately from the series or should I keep it inside?


Also, before I send out a new patch, I will need to root cause and 
resolve problem found by kernel test robot , and 
bisected down to this patch.


[  156.659400] BUG: Bad page state in process swapper  pfn:03147
[  156.660051] page:88001ed8a1c0 count:0 mapcount:-127 mapping: 
(null) index:0x1

[  156.660917] flags: 0x0()
[  156.661198] raw:   0001 
ff80
[  156.662006] raw: 88001f4a8120 88001ed85ce0  


[  156.662811] page dumped because: nonzero mapcount
[  156.663307] CPU: 0 PID: 1 Comm: swapper Not tainted 
4.13.0-rc3-00220-g1aad694 #1
[  156.664077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.9.3-20161025_171302-gandalf 04/01/2014

[  156.665129] Call Trace:
[  156.665422]  dump_stack+0x1e/0x20
[  156.665802]  bad_page+0x122/0x148

I was not able to reproduce this problem, even-though I used their qemu 
script and config. But I am getting the following panic both base and fix:


[  115.763259] VFS: Cannot open root device "ram0" or 
unknown-block(0,0): error -6
[  115.764511] Please append a correct "root=" boot option; here are the 
available partitions:
[  115.765816] Kernel panic - not syncing: VFS: Unable to mount root fs 
on unknown-block(0,0)
[  115.767124] CPU: 0 PID: 1 Comm: swapper Not tainted 
4.13.0-rc4_pt_memset6-00033-g7e65200b1473 #7
[  115.768506] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014

[  115.770368] Call Trace:
[  115.770771]  dump_stack+0x1e/0x20
[  115.771310]  panic+0xf8/0x2bc
[  115.771792]  mount_block_root+0x3bb/0x441
[  115.772437]  ? do_early_param+0xc5/0xc5
[  115.773051]  ? do_early_param+0xc5/0xc5
[  115.773683]  mount_root+0x7c/0x7f
[  115.774243]  prepare_namespace+0x194/0x1d1
[  115.774898]  kernel_init_freeable+0x1c8/0x1df
[  115.775575]  ? rest_init+0x13f/0x13f
[  115.776153]  kernel_init+0x14/0x142
[  115.776711]  ? rest_init+0x13f/0x13f
[  115.777285]  ret_from_fork+0x2a/0x40
[  115.777864] Kernel Offset: disabled

Their config has CONFIG_BLK_DEV_RAM disabled, but qemu script has:
root=/dev/ram0, so I enabled dev_ram, but still getting a panic when 
root is mounted both in base and fix.


Pasha


Re: [v6 02/15] x86/mm: setting fields in deferred pages

2017-08-11 Thread Pasha Tatashin

AFAIU register_page_bootmem_info_node is only about struct pages backing
pgdat, usemap and memmap. Those should be in reserved memblocks and we
do not initialize those at later times, they are not relevant to the
deferred initialization as your changelog suggests so the ordering with
get_page_bootmem shouldn't matter. Or am I missing something here?


The pages for pgdata, usemap, and memmap are part of reserved, and thus 
getting initialized when free_all_bootmem() is called.


So, we have something like this in mem_init()

register_page_bootmem_info
 register_page_bootmem_info_node
  get_page_bootmem
   .. setting fields here ..
   such as: page->freelist = (void *)type;

free_all_bootmem()
 free_low_memory_core_early()
  for_each_reserved_mem_region()
   reserve_bootmem_region()
init_reserved_page() <- Only if this is deferred reserved page
 __init_single_pfn()
  __init_single_page()
  memset(0) <-- Loose the set fields here!

memblock does not know about deferred pages, and can be requested to 
allocate physical pages anywhere. So, the reserved pages in memblock can 
be both in non-deferred and deferred part of the memory.


Without deferred pages enabled, by the time register_page_bootmem_info() 
is called every page went through __init_single_page(), but with 
deferred pages enabled, there is scenario where fields can be set before 
pages go through __init_single_page(). This patch fixes it.


Re: [v6 01/15] x86/mm: reserve only exiting low pages

2017-08-11 Thread Pasha Tatashin

Struct pages are initialized by going through __init_single_page(). Since
the existing physical memory in memblock is represented in memblock.memory
list, struct page for every page from this list goes through
__init_single_page().


By a page _from_ this list you mean struct pages backing the physical
memory of the memblock lists?


Correct: "for every page from this list...", for every page represented 
by this list the struct page is initialized through __init_single_page()



In this patchset we will stop zeroing struct page memory during allocation.
Therefore, this bug must be fixed in order to avoid random assert failures
caused by CONFIG_DEBUG_VM_PGFLAGS triggers.

The fix is to reserve memory from the first existing PFN.


Hmm, I assume this is a result of some assert triggering, right? Which
one? Why don't we need the same treatment for other than x86 arch?


Correct, the pgflags asserts were triggered when we were setting 
reserved flags to struct page for PFN 0 in which was never initialized 
through __init_single_page(). The reason they were triggered is because 
we set all uninitialized memory to ones in one of the debug patches.



Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 


I guess that the review happened inhouse. I do not want to question its
value but it is rather strange to not hear the specific review comments
which might be useful in general and moreover even not include those
people on the CC list so they are aware of the follow up discussion.


I will bring this up with my colleagues to how to handle this better in 
the future. I will also CC the reviewers when I sent out the updated 
patch series.


Re: [v6 00/15] complete deferred page initialization

2017-08-11 Thread Michal Hocko
On Fri 11-08-17 11:13:07, Pasha Tatashin wrote:
> On 08/11/2017 03:58 AM, Michal Hocko wrote:
> >[I am sorry I didn't get to your previous versions]
> 
> Thank you for reviewing this work. I will address your comments, and
> send-out a new patches.
> 
> >>
> >>In this work we do the following:
> >>- Never read access struct page until it was initialized
> >
> >How is this enforced? What about pfn walkers? E.g. page_ext
> >initialization code (page owner in particular)
> 
> This is hard to enforce 100%. But, because we have a patch in this series
> that sets all memory that was allocated by memblock_virt_alloc_try_nid_raw()
> to ones with debug options enabled, and because Linux has a good set of
> asserts in place that check struct pages to be sane, especially the ones
> that are enabled with this config: CONFIG_DEBUG_VM_PGFLAGS. I was able to
> find many places in linux which accessed struct pages before
> __init_single_page() is performed, and fix them. Most of these places happen
> only when deferred struct page initialization code is enabled.

Yes, I am very well aware of how hard is this to guarantee. I was merely
pointing out that the changelog should be more verbose about your
testing and assumptions so that we can revalidate them.
-- 
Michal Hocko
SUSE Labs


Re: [v6 00/15] complete deferred page initialization

2017-08-11 Thread Pasha Tatashin

On 08/11/2017 03:58 AM, Michal Hocko wrote:

[I am sorry I didn't get to your previous versions]


Thank you for reviewing this work. I will address your comments, and 
send-out a new patches.




In this work we do the following:
- Never read access struct page until it was initialized


How is this enforced? What about pfn walkers? E.g. page_ext
initialization code (page owner in particular)


This is hard to enforce 100%. But, because we have a patch in this 
series that sets all memory that was allocated by 
memblock_virt_alloc_try_nid_raw() to ones with debug options enabled, 
and because Linux has a good set of asserts in place that check struct 
pages to be sane, especially the ones that are enabled with this config: 
CONFIG_DEBUG_VM_PGFLAGS. I was able to find many places in linux which 
accessed struct pages before __init_single_page() is performed, and fix 
them. Most of these places happen only when deferred struct page 
initialization code is enabled.





- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization


Please give us a more highlevel description of how your reimplementation
works and how is the patchset organized. I will go through those patches
but it is always good to give an overview in the cover letter to make
the review easier.


Ok, will add more explanation to the cover letter.


Single threaded struct page init: 7.6s/T improvement
Deferred struct page init: 10.2s/T improvement


What are before and after numbers and how have you measured them.


When I send out this series the next time I will include before vs. 
after on the machine I tested, including links to dmesg output.


I used my early boot timestamps for x86 and sparc to measure the data. 
Early boot timestamps for sparc is already part of mainline, the x86 
patches are out for review: https://lkml.org/lkml/2017/8/10/946 (should 
have changed subject line there :) ).


[PATCH v2 4/8] powerpc/xive: introduce xive_esb_write()

2017-08-11 Thread Cédric Le Goater
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 8a58662ed793..ac5f18a66742 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -203,6 +203,15 @@ static u8 xive_esb_read(struct xive_irq_data *xd, u32 
offset)
return (u8)val;
 }
 
+static void xive_esb_write(struct xive_irq_data *xd, u32 offset, u64 data)
+{
+   /* Handle HW errata */
+   if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
+   offset |= offset << 4;
+
+   out_be64(xd->eoi_mmio + offset, data);
+}
+
 #ifdef CONFIG_XMON
 static void xive_dump_eq(const char *name, struct xive_q *q)
 {
@@ -297,7 +306,7 @@ void xive_do_source_eoi(u32 hw_irq, struct xive_irq_data 
*xd)
 {
/* If the XIVE supports the new "store EOI facility, use it */
if (xd->flags & XIVE_IRQ_FLAG_STORE_EOI)
-   out_be64(xd->eoi_mmio + XIVE_ESB_STORE_EOI, 0);
+   xive_esb_write(xd, XIVE_ESB_STORE_EOI, 0);
else if (hw_irq && xd->flags & XIVE_IRQ_FLAG_EOI_FW) {
/*
 * The FW told us to call it. This happens for some
-- 
2.13.4



Re: [PATCH] ASoC: Freescale: Delete an error message for a failed memory allocation in three functions

2017-08-11 Thread Joe Perches
On Fri, 2017-08-11 at 15:32 +0200, SF Markus Elfring wrote:
> From 885ccd6c63291dcd4854a0cbaab5188cdc3db805 Mon Sep 17 00:00:00 2001
> From: Markus Elfring 
> Date: Fri, 11 Aug 2017 15:10:43 +0200
> Subject: [PATCH] ASoC: Freescale: Delete an error message for a failed memory 
> allocation in three functions
> 
> Omit an extra message for a memory allocation failure in these functions.
> 
> This issue was detected by using the Coccinelle software.
> 
> Link: 
> http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf

This doesn't have anything to do with refactoring strings.

Just note that allocations without GFP_NOWARN do a dump_stack()

> diff --git a/sound/soc/fsl/fsl_asrc_dma.c b/sound/soc/fsl/fsl_asrc_dma.c
[]
> @@ -282,7 +282,5 @@ static int fsl_asrc_dma_startup(struct snd_pcm_substream 
> *substream)
> - if (!pair) {
> - dev_err(dev, "failed to allocate pair\n");
> + if (!pair)
>   return -ENOMEM;
> - }

Use normal diff output that shows 3 lines of context
above and below the patched lines.


[PATCH v2 8/8] powerpc/xive: improve debugging macros

2017-08-11 Thread Cédric Le Goater
Having the CPU identifier in the debug logs is helpful when tracking
issues. Also add some more logging and fix a compile issue in
xive_do_source_eoi().

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 8fd58773c241..1c087ed7427f 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -40,7 +40,8 @@
 #undef DEBUG_ALL
 
 #ifdef DEBUG_ALL
-#define DBG_VERBOSE(fmt...)pr_devel(fmt)
+#define DBG_VERBOSE(fmt, ...)  pr_devel("cpu %d - " fmt, \
+smp_processor_id(), ## __VA_ARGS__)
 #else
 #define DBG_VERBOSE(fmt...)do { } while(0)
 #endif
@@ -344,7 +345,7 @@ void xive_do_source_eoi(u32 hw_irq, struct xive_irq_data 
*xd)
xive_esb_read(xd, XIVE_ESB_LOAD_EOI);
else {
eoi_val = xive_esb_read(xd, XIVE_ESB_SET_PQ_00);
-   DBG_VERBOSE("eoi_val=%x\n", offset, eoi_val);
+   DBG_VERBOSE("eoi_val=%x\n", eoi_val);
 
/* Re-trigger if needed */
if ((eoi_val & XIVE_ESB_VAL_Q) && xd->trig_mmio)
@@ -1004,6 +1005,9 @@ static void xive_ipi_eoi(struct irq_data *d)
 {
struct xive_cpu *xc = __this_cpu_read(xive_cpu);
 
+   DBG_VERBOSE("IPI eoi: irq=%d [0x%lx] (HW IRQ 0x%x) pending=%02x\n",
+   d->irq, irqd_to_hwirq(d), xc->hw_ipi, xc->pending_prio);
+
/* Handle possible race with unplug and drop stale IPIs */
if (!xc)
return;
-- 
2.13.4



[PATCH v2 7/8] powerpc/xive: add XIVE Exploitation Mode to CAS

2017-08-11 Thread Cédric Le Goater
On POWER9, the Client Architecture Support (CAS) negotiation process
determines whether the guest operates in XIVE Legacy compatibility or
in XIVE exploitation mode. Now that we have initial guest support for
the XIVE interrupt controller, let's inform the hypervisor what we can
do.

The platform advertises the XIVE Exploitation Mode support using the
property "ibm,arch-vec-5-platform-support-vec-5", byte 23 bits 0-1 :

 - 0b00 XIVE legacy mode Only
 - 0b01 XIVE exploitation mode Only
 - 0b10 XIVE legacy or exploitation mode

The OS asks for XIVE Exploitation Mode support using the property
"ibm,architecture-vec-5", byte 23 bits 0-1:

 - 0b00 XIVE legacy mode Only
 - 0b01 XIVE exploitation mode Only

Signed-off-by: Cédric Le Goater 
---

 Changes since v1:

 - fixed XIVE mode parsing 
 - integrated the prom.h definitions
 - introduced extra bits definition : OV5_XIVE_LEGACY and OV5_XIVE_EITHER
 
 arch/powerpc/include/asm/prom.h |  5 -
 arch/powerpc/kernel/prom_init.c | 34 +-
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 35c00d7a0cf8..825bd5998701 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -159,7 +159,10 @@ struct of_drconf_cell {
 #define OV5_PFO_HW_842 0x1140  /* PFO Compression Accelerator */
 #define OV5_PFO_HW_ENCR0x1120  /* PFO Encryption Accelerator */
 #define OV5_SUB_PROCESSORS 0x1501  /* 1,2,or 4 Sub-Processors supported */
-#define OV5_XIVE_EXPLOIT   0x1701  /* XIVE exploitation supported */
+#define OV5_XIVE_SUPPORT   0x17C0  /* XIVE Exploitation Support Mask */
+#define OV5_XIVE_LEGACY0x1700  /* XIVE legacy mode Only */
+#define OV5_XIVE_EXPLOIT   0x1740  /* XIVE exploitation mode Only */
+#define OV5_XIVE_EITHER0x1780  /* XIVE legacy or exploitation 
mode */
 /* MMU Base Architecture */
 #define OV5_MMU_SUPPORT0x18C0  /* MMU Mode Support Mask */
 #define OV5_MMU_HASH   0x1800  /* Hash MMU Only */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 613f79f03877..02190e90c7ae 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -177,6 +177,7 @@ struct platform_support {
bool hash_mmu;
bool radix_mmu;
bool radix_gtse;
+   bool xive;
 };
 
 /* Platforms codes are now obsolete in the kernel. Now only used within this
@@ -1041,6 +1042,27 @@ static void __init prom_parse_mmu_model(u8 val,
}
 }
 
+static void __init prom_parse_xive_model(u8 val,
+struct platform_support *support)
+{
+   switch (val) {
+   case OV5_FEAT(OV5_XIVE_EITHER): /* Either Available */
+   prom_debug("XIVE - either mode supported\n");
+   support->xive = true;
+   break;
+   case OV5_FEAT(OV5_XIVE_EXPLOIT): /* Only Exploitation mode */
+   prom_debug("XIVE - exploitation mode supported\n");
+   support->xive = true;
+   break;
+   case OV5_FEAT(OV5_XIVE_LEGACY): /* Only Legacy mode */
+   prom_debug("XIVE - legacy mode supported\n");
+   break;
+   default:
+   prom_debug("Unknown xive support option: 0x%x\n", val);
+   break;
+   }
+}
+
 static void __init prom_parse_platform_support(u8 index, u8 val,
   struct platform_support *support)
 {
@@ -1054,6 +1076,10 @@ static void __init prom_parse_platform_support(u8 index, 
u8 val,
support->radix_gtse = true;
}
break;
+   case OV5_INDX(OV5_XIVE_SUPPORT): /* Interrupt mode */
+   prom_parse_xive_model(val & OV5_FEAT(OV5_XIVE_SUPPORT),
+ support);
+   break;
}
 }
 
@@ -1062,7 +1088,8 @@ static void __init prom_check_platform_support(void)
struct platform_support supported = {
.hash_mmu = false,
.radix_mmu = false,
-   .radix_gtse = false
+   .radix_gtse = false,
+   .xive = false
};
int prop_len = prom_getproplen(prom.chosen,
   "ibm,arch-vec-5-platform-support");
@@ -1095,6 +1122,11 @@ static void __init prom_check_platform_support(void)
/* We're probably on a legacy hypervisor */
prom_debug("Assuming legacy hash support\n");
}
+
+   if (supported.xive) {
+   prom_debug("Asking for XIVE\n");
+   ibm_architecture_vec.vec5.intarch = OV5_FEAT(OV5_XIVE_EXPLOIT);
+   }
 }
 
 static void __init prom_send_capabilities(void)
-- 
2.13.4



[PATCH v2 0/8] guest exploitation of the XIVE interrupt controller

2017-08-11 Thread Cédric Le Goater
Hello,

On a POWER9 sPAPR machine, the Client Architecture Support (CAS)
negotiation process determines whether the guest operates with an
interrupt controller using the legacy model, as found on POWER8, or in
XIVE exploitation mode, the newer POWER9 interrupt model. This
patchset is a first proposal to add XIVE support in the sPAPR machine.

Tested with a QEMU XIVE model for sPAPR machine and with the Power
hypervisor.

Code is here:

  https://github.com/legoater/linux/commits/xive
  https://github.com/legoater/qemu/commits/xive   

Thanks,

C.

Changes since v1 :

 - introduced a common subroutine xive_queue_page_alloc()
 - introduced a xive_teardown_cpu() routine
 - removed P9 doorbell support when xive is enabled.
 - fixed xive_esb_read() naming
 - fixed XIVE mode parsing in CAS (just got the final specs)

Changes since RFC :

 - renamed backend to 'spapr'
 - fixed hotplug support
 - fixed kexec support
 - fixed src_chip value (XIVE_INVALID_CHIP_ID)
 - added doorbell support
 - added some debug logs
 - added  H_INT_ESB hcall
 - took into account '/ibm,plat-res-int-priorities'
 - fixed WARNING in xive_find_target_in_mask()

Cédric Le Goater (8):
  powerpc/xive: introduce a common routine xive_queue_page_alloc()
  powerpc/xive: guest exploitation of the XIVE interrupt controller
  powerpc/xive: rename xive_poke_esb() in xive_esb_read()
  powerpc/xive: introduce xive_esb_write()
  powerpc/xive: add the HW IRQ number under xive_irq_data
  powerpc/xive: introduce H_INT_ESB hcall
  powerpc/xive: add XIVE Exploitation Mode to CAS
  powerpc/xive: improve debugging macros

 arch/powerpc/include/asm/hvcall.h|  13 +-
 arch/powerpc/include/asm/prom.h  |   5 +-
 arch/powerpc/include/asm/xive.h  |   5 +
 arch/powerpc/kernel/prom_init.c  |  34 +-
 arch/powerpc/platforms/pseries/Kconfig   |   1 +
 arch/powerpc/platforms/pseries/hotplug-cpu.c |  11 +-
 arch/powerpc/platforms/pseries/kexec.c   |   6 +-
 arch/powerpc/platforms/pseries/setup.c   |   8 +-
 arch/powerpc/platforms/pseries/smp.c |  27 +-
 arch/powerpc/sysdev/xive/Kconfig |   5 +
 arch/powerpc/sysdev/xive/Makefile|   1 +
 arch/powerpc/sysdev/xive/common.c|  76 ++-
 arch/powerpc/sysdev/xive/native.c|  18 +-
 arch/powerpc/sysdev/xive/spapr.c | 661 +++
 arch/powerpc/sysdev/xive/xive-internal.h |   7 +
 15 files changed, 843 insertions(+), 35 deletions(-)
 create mode 100644 arch/powerpc/sysdev/xive/spapr.c

-- 
2.13.4



[PATCH v2 6/8] powerpc/xive: introduce H_INT_ESB hcall

2017-08-11 Thread Cédric Le Goater
The H_INT_ESB hcall() is used to issue a load or store to the ESB page
instead of using the MMIO pages. This can be used as a workaround on
some HW issues. The OS knows that this hcall should be used on an
interrupt source when the ESB hcall flag is set to 1 in the hcall
H_INT_GET_SOURCE_INFO.

To maintain the frontier between the xive frontend and backend, we
introduce a new xive operation 'esb_rw' to be used in the routines
doing memory accesses on the ESBs.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/xive.h  |  1 +
 arch/powerpc/sysdev/xive/common.c| 10 ++--
 arch/powerpc/sysdev/xive/spapr.c | 44 +++-
 arch/powerpc/sysdev/xive/xive-internal.h |  1 +
 4 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 64ec9bbcf03e..371fbebf1ec9 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -56,6 +56,7 @@ struct xive_irq_data {
 #define XIVE_IRQ_FLAG_SHIFT_BUG0x04
 #define XIVE_IRQ_FLAG_MASK_FW  0x08
 #define XIVE_IRQ_FLAG_EOI_FW   0x10
+#define XIVE_IRQ_FLAG_H_INT_ESB0x20
 
 #define XIVE_INVALID_CHIP_ID   -1
 
diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index ac5f18a66742..8fd58773c241 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -198,7 +198,10 @@ static u8 xive_esb_read(struct xive_irq_data *xd, u32 
offset)
if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
offset |= offset << 4;
 
-   val = in_be64(xd->eoi_mmio + offset);
+   if ((xd->flags & XIVE_IRQ_FLAG_H_INT_ESB) && xive_ops->esb_rw)
+   val = xive_ops->esb_rw(xd->hw_irq, offset, 0, 0);
+   else
+   val = in_be64(xd->eoi_mmio + offset);
 
return (u8)val;
 }
@@ -209,7 +212,10 @@ static void xive_esb_write(struct xive_irq_data *xd, u32 
offset, u64 data)
if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
offset |= offset << 4;
 
-   out_be64(xd->eoi_mmio + offset, data);
+   if ((xd->flags & XIVE_IRQ_FLAG_H_INT_ESB) && xive_ops->esb_rw)
+   xive_ops->esb_rw(xd->hw_irq, offset, data, 1);
+   else
+   out_be64(xd->eoi_mmio + offset, data);
 }
 
 #ifdef CONFIG_XMON
diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c
index 7efcee928569..ecaa9a1cb63b 100644
--- a/arch/powerpc/sysdev/xive/spapr.c
+++ b/arch/powerpc/sysdev/xive/spapr.c
@@ -224,7 +224,46 @@ static long plpar_int_sync(unsigned long flags)
return 0;
 }
 
-#define XIVE_SRC_H_INT_ESB (1ull << (63 - 60)) /* TODO */
+#define XIVE_ESB_FLAG_STORE (1ull << (63 - 63))
+
+static long plpar_int_esb(unsigned long flags,
+ unsigned long lisn,
+ unsigned long offset,
+ unsigned long in_data,
+ unsigned long *out_data)
+{
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+   long rc;
+
+   pr_devel("H_INT_ESB flags=%lx lisn=%lx offset=%lx in=%lx\n",
+   flags,  lisn, offset, in_data);
+
+   rc = plpar_hcall(H_INT_ESB, retbuf, flags, lisn, offset, in_data);
+   if (rc) {
+   pr_err("H_INT_ESB lisn=%ld offset=%ld returned %ld\n",
+  lisn, offset, rc);
+   return  rc;
+   }
+
+   *out_data = retbuf[0];
+
+   return 0;
+}
+
+static u64 xive_spapr_esb_rw(u32 lisn, u32 offset, u64 data, bool write)
+{
+   unsigned long read_data;
+   long rc;
+
+   rc = plpar_int_esb(write ? XIVE_ESB_FLAG_STORE : 0,
+  lisn, offset, data, _data);
+   if (rc)
+   return -1;
+
+   return write ? 0 : read_data;
+}
+
+#define XIVE_SRC_H_INT_ESB (1ull << (63 - 60))
 #define XIVE_SRC_LSI   (1ull << (63 - 61))
 #define XIVE_SRC_TRIGGER   (1ull << (63 - 62))
 #define XIVE_SRC_STORE_EOI (1ull << (63 - 63))
@@ -244,6 +283,8 @@ static int xive_spapr_populate_irq_data(u32 hw_irq, struct 
xive_irq_data *data)
if (rc)
return  -EINVAL;
 
+   if (flags & XIVE_SRC_H_INT_ESB)
+   data->flags  |= XIVE_IRQ_FLAG_H_INT_ESB;
if (flags & XIVE_SRC_STORE_EOI)
data->flags  |= XIVE_IRQ_FLAG_STORE_EOI;
if (flags & XIVE_SRC_LSI)
@@ -486,6 +527,7 @@ static const struct xive_ops xive_spapr_ops = {
.setup_cpu  = xive_spapr_setup_cpu,
.teardown_cpu   = xive_spapr_teardown_cpu,
.sync_source= xive_spapr_sync_source,
+   .esb_rw = xive_spapr_esb_rw,
 #ifdef CONFIG_SMP
.get_ipi= xive_spapr_get_ipi,
.put_ipi= xive_spapr_put_ipi,
diff --git a/arch/powerpc/sysdev/xive/xive-internal.h 
b/arch/powerpc/sysdev/xive/xive-internal.h
index dd1e2022cce4..f34abed0c05f 100644
--- 

[PATCH v2 1/8] powerpc/xive: introduce a common routine xive_queue_page_alloc()

2017-08-11 Thread Cédric Le Goater
This routine will be used in the spapr backend. Also introduce a short
xive_alloc_order() helper.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c| 16 
 arch/powerpc/sysdev/xive/native.c| 16 +---
 arch/powerpc/sysdev/xive/xive-internal.h |  6 ++
 3 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 6e0c9dee724f..26999ceae20e 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1424,6 +1424,22 @@ bool xive_core_init(const struct xive_ops *ops, void 
__iomem *area, u32 offset,
return true;
 }
 
+__be32 *xive_queue_page_alloc(unsigned int cpu, u32 queue_shift)
+{
+   unsigned int alloc_order;
+   struct page *pages;
+   __be32 *qpage;
+
+   alloc_order = xive_alloc_order(queue_shift);
+   pages = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, alloc_order);
+   if (!pages)
+   return ERR_PTR(-ENOMEM);
+   qpage = (__be32 *)page_address(pages);
+   memset(qpage, 0, 1 << queue_shift);
+
+   return qpage;
+}
+
 static int __init xive_off(char *arg)
 {
xive_cmdline_disabled = true;
diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index 0f95476b01f6..ef92a83090e1 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -202,17 +202,12 @@ EXPORT_SYMBOL_GPL(xive_native_disable_queue);
 static int xive_native_setup_queue(unsigned int cpu, struct xive_cpu *xc, u8 
prio)
 {
struct xive_q *q = >queue[prio];
-   unsigned int alloc_order;
-   struct page *pages;
__be32 *qpage;
 
-   alloc_order = (xive_queue_shift > PAGE_SHIFT) ?
-   (xive_queue_shift - PAGE_SHIFT) : 0;
-   pages = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, alloc_order);
-   if (!pages)
-   return -ENOMEM;
-   qpage = (__be32 *)page_address(pages);
-   memset(qpage, 0, 1 << xive_queue_shift);
+   qpage = xive_queue_page_alloc(cpu, xive_queue_shift);
+   if (IS_ERR(qpage))
+   return PTR_ERR(qpage);
+
return xive_native_configure_queue(get_hard_smp_processor_id(cpu),
   q, prio, qpage, xive_queue_shift, 
false);
 }
@@ -227,8 +222,7 @@ static void xive_native_cleanup_queue(unsigned int cpu, 
struct xive_cpu *xc, u8
 * from an IPI and iounmap isn't safe
 */
__xive_native_disable_queue(get_hard_smp_processor_id(cpu), q, prio);
-   alloc_order = (xive_queue_shift > PAGE_SHIFT) ?
-   (xive_queue_shift - PAGE_SHIFT) : 0;
+   alloc_order = xive_alloc_order(xive_queue_shift);
free_pages((unsigned long)q->qpage, alloc_order);
q->qpage = NULL;
 }
diff --git a/arch/powerpc/sysdev/xive/xive-internal.h 
b/arch/powerpc/sysdev/xive/xive-internal.h
index d07ef2d29caf..dd1e2022cce4 100644
--- a/arch/powerpc/sysdev/xive/xive-internal.h
+++ b/arch/powerpc/sysdev/xive/xive-internal.h
@@ -56,6 +56,12 @@ struct xive_ops {
 
 bool xive_core_init(const struct xive_ops *ops, void __iomem *area, u32 offset,
u8 max_prio);
+__be32 *xive_queue_page_alloc(unsigned int cpu, u32 queue_shift);
+
+static inline u32 xive_alloc_order(u32 queue_shift)
+{
+   return (queue_shift > PAGE_SHIFT) ? (queue_shift - PAGE_SHIFT) : 0;
+}
 
 extern bool xive_cmdline_disabled;
 
-- 
2.13.4



[PATCH v2 5/8] powerpc/xive: add the HW IRQ number under xive_irq_data

2017-08-11 Thread Cédric Le Goater
It will be required later by the H_INT_ESB hcall.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/xive.h   | 1 +
 arch/powerpc/sysdev/xive/native.c | 2 ++
 arch/powerpc/sysdev/xive/spapr.c  | 2 ++
 3 files changed, 5 insertions(+)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 473f133a8555..64ec9bbcf03e 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -45,6 +45,7 @@ struct xive_irq_data {
void __iomem *trig_mmio;
u32 esb_shift;
int src_chip;
+   u32 hw_irq;
 
/* Setup/used by frontend */
int target;
diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index ef92a83090e1..f8bcff15b0f9 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -82,6 +82,8 @@ int xive_native_populate_irq_data(u32 hw_irq, struct 
xive_irq_data *data)
return -ENOMEM;
}
 
+   data->hw_irq = hw_irq;
+
if (!data->trig_page)
return 0;
if (data->trig_page == data->eoi_page) {
diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c
index 89e5a57693db..7efcee928569 100644
--- a/arch/powerpc/sysdev/xive/spapr.c
+++ b/arch/powerpc/sysdev/xive/spapr.c
@@ -264,6 +264,8 @@ static int xive_spapr_populate_irq_data(u32 hw_irq, struct 
xive_irq_data *data)
return -ENOMEM;
}
 
+   data->hw_irq = hw_irq;
+
/* Full function page supports trigger */
if (flags & XIVE_SRC_TRIGGER) {
data->trig_mmio = data->eoi_mmio;
-- 
2.13.4



[PATCH v2 3/8] powerpc/xive: rename xive_poke_esb() in xive_esb_read()

2017-08-11 Thread Cédric Le Goater
xive_poke_esb() is performing a load/read so it is better named as
xive_esb_read() as we will need to introduce a xive_esb_write()
routine. Also use the XIVE_ESB_LOAD_EOI offset when EOI'ing LSI
interrupts.

Signed-off-by: Cédric Le Goater 
---

 Changes since v1:

 - fixed naming.
 
 arch/powerpc/sysdev/xive/common.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 8774af7a4105..8a58662ed793 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -190,7 +190,7 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, bool 
just_peek)
  * This is used to perform the magic loads from an ESB
  * described in xive.h
  */
-static u8 xive_poke_esb(struct xive_irq_data *xd, u32 offset)
+static u8 xive_esb_read(struct xive_irq_data *xd, u32 offset)
 {
u64 val;
 
@@ -227,7 +227,7 @@ void xmon_xive_do_dump(int cpu)
xive_dump_eq("IRQ", >queue[xive_irq_priority]);
 #ifdef CONFIG_SMP
{
-   u64 val = xive_poke_esb(>ipi_data, XIVE_ESB_GET);
+   u64 val = xive_esb_read(>ipi_data, XIVE_ESB_GET);
xmon_printf("  IPI state: %x:%c%c\n", xc->hw_ipi,
val & XIVE_ESB_VAL_P ? 'P' : 'p',
val & XIVE_ESB_VAL_P ? 'Q' : 'q');
@@ -326,9 +326,9 @@ void xive_do_source_eoi(u32 hw_irq, struct xive_irq_data 
*xd)
 * properly.
 */
if (xd->flags & XIVE_IRQ_FLAG_LSI)
-   in_be64(xd->eoi_mmio);
+   xive_esb_read(xd, XIVE_ESB_LOAD_EOI);
else {
-   eoi_val = xive_poke_esb(xd, XIVE_ESB_SET_PQ_00);
+   eoi_val = xive_esb_read(xd, XIVE_ESB_SET_PQ_00);
DBG_VERBOSE("eoi_val=%x\n", offset, eoi_val);
 
/* Re-trigger if needed */
@@ -383,12 +383,12 @@ static void xive_do_source_set_mask(struct xive_irq_data 
*xd,
 * ESB accordingly on unmask.
 */
if (mask) {
-   val = xive_poke_esb(xd, XIVE_ESB_SET_PQ_01);
+   val = xive_esb_read(xd, XIVE_ESB_SET_PQ_01);
xd->saved_p = !!(val & XIVE_ESB_VAL_P);
} else if (xd->saved_p)
-   xive_poke_esb(xd, XIVE_ESB_SET_PQ_10);
+   xive_esb_read(xd, XIVE_ESB_SET_PQ_10);
else
-   xive_poke_esb(xd, XIVE_ESB_SET_PQ_00);
+   xive_esb_read(xd, XIVE_ESB_SET_PQ_00);
 }
 
 /*
@@ -768,7 +768,7 @@ static int xive_irq_retrigger(struct irq_data *d)
 * To perform a retrigger, we first set the PQ bits to
 * 11, then perform an EOI.
 */
-   xive_poke_esb(xd, XIVE_ESB_SET_PQ_11);
+   xive_esb_read(xd, XIVE_ESB_SET_PQ_11);
 
/*
 * Note: We pass "0" to the hw_irq argument in order to
@@ -803,7 +803,7 @@ static int xive_irq_set_vcpu_affinity(struct irq_data *d, 
void *state)
irqd_set_forwarded_to_vcpu(d);
 
/* Set it to PQ=10 state to prevent further sends */
-   pq = xive_poke_esb(xd, XIVE_ESB_SET_PQ_10);
+   pq = xive_esb_read(xd, XIVE_ESB_SET_PQ_10);
 
/* No target ? nothing to do */
if (xd->target == XIVE_INVALID_TARGET) {
@@ -832,7 +832,7 @@ static int xive_irq_set_vcpu_affinity(struct irq_data *d, 
void *state)
 * for sure the queue slot is no longer in use.
 */
if (pq & 2) {
-   pq = xive_poke_esb(xd, XIVE_ESB_SET_PQ_11);
+   pq = xive_esb_read(xd, XIVE_ESB_SET_PQ_11);
xd->saved_p = true;
 
/*
-- 
2.13.4



[PATCH v2 2/8] powerpc/xive: guest exploitation of the XIVE interrupt controller

2017-08-11 Thread Cédric Le Goater
This is the framework for using XIVE in a PowerVM guest. The support
is very similar to the native one in a much simpler form.

Instead of OPAL calls, a set of Hypervisors call are used to configure
the interrupt sources and the event/notification queues of the guest:

 - H_INT_GET_SOURCE_INFO

   used to obtain the address of the MMIO page of the Event State
   Buffer (PQ bits) entry associated with the source.

 - H_INT_SET_SOURCE_CONFIG

   assigns a source to a "target".

 - H_INT_GET_SOURCE_CONFIG

   determines to which "target" and "priority" is assigned to a source

 - H_INT_GET_QUEUE_INFO

   returns the address of the notification management page associated
   with the specified "target" and "priority".

 - H_INT_SET_QUEUE_CONFIG

   sets or resets the event queue for a given "target" and "priority".
   It is also used to set the notification config associated with the
   queue, only unconditional notification for the moment.  Reset is
   performed with a queue size of 0 and queueing is disabled in that
   case.

 - H_INT_GET_QUEUE_CONFIG

   returns the queue settings for a given "target" and "priority".

 - H_INT_RESET

   resets all of the partition's interrupt exploitation structures to
   their initial state, losing all configuration set via the hcalls
   H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.

 - H_INT_SYNC

   issue a synchronisation on a source to make sure sure all
   notifications have reached their queue.

As for XICS, the XIVE interface for the guest is described in the
device tree under the "interrupt-controller" node. A couple of new
properties are specific to XIVE :

 - "reg"

   contains the base address and size of the thread interrupt
   managnement areas (TIMA) for the user level for the OS level. Only
   the OS level is taken into account.

 - "ibm,xive-eq-sizes"

   the size of the event queues.

 - "ibm,xive-lisn-ranges"

   the interrupt numbers ranges assigned to the guest. These are
   allocated using a simple bitmap.

and also :

 - "/ibm,plat-res-int-priorities"

   contains a list of priorities that the hypervisor has reserved for
   its own use.

Tested with a QEMU XIVE model for pseries and with the Power
hypervisor

Signed-off-by: Cédric Le Goater 
---

 Changes since v1 :

 - added a xive_teardown_cpu() routine
 - removed P9 doorbell support when xive is enabled.
 - merged in patch for "ibm,plat-res-int-priorities" support
 - added some comments on the usage of raw I/O accessors.
 
 Changes since RFC :

 - renamed backend to spapr
 - fixed hotplug support
 - fixed kexec support
 - fixed src_chip value (XIVE_INVALID_CHIP_ID)
 - added doorbell support 
 - added some hcall debug logs

 arch/powerpc/include/asm/hvcall.h|  13 +-
 arch/powerpc/include/asm/xive.h  |   3 +
 arch/powerpc/platforms/pseries/Kconfig   |   1 +
 arch/powerpc/platforms/pseries/hotplug-cpu.c |  11 +-
 arch/powerpc/platforms/pseries/kexec.c   |   6 +-
 arch/powerpc/platforms/pseries/setup.c   |   8 +-
 arch/powerpc/platforms/pseries/smp.c |  27 +-
 arch/powerpc/sysdev/xive/Kconfig |   5 +
 arch/powerpc/sysdev/xive/Makefile|   1 +
 arch/powerpc/sysdev/xive/common.c|  13 +
 arch/powerpc/sysdev/xive/spapr.c | 617 +++
 11 files changed, 697 insertions(+), 8 deletions(-)
 create mode 100644 arch/powerpc/sysdev/xive/spapr.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 57d38b504ff7..3d34dc0869f6 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -280,7 +280,18 @@
 #define H_RESIZE_HPT_COMMIT0x370
 #define H_REGISTER_PROC_TBL0x37C
 #define H_SIGNAL_SYS_RESET 0x380
-#define MAX_HCALL_OPCODE   H_SIGNAL_SYS_RESET
+#define H_INT_GET_SOURCE_INFO   0x3A8
+#define H_INT_SET_SOURCE_CONFIG 0x3AC
+#define H_INT_GET_SOURCE_CONFIG 0x3B0
+#define H_INT_GET_QUEUE_INFO0x3B4
+#define H_INT_SET_QUEUE_CONFIG  0x3B8
+#define H_INT_GET_QUEUE_CONFIG  0x3BC
+#define H_INT_SET_OS_REPORTING_LINE 0x3C0
+#define H_INT_GET_OS_REPORTING_LINE 0x3C4
+#define H_INT_ESB   0x3C8
+#define H_INT_SYNC  0x3CC
+#define H_INT_RESET 0x3D0
+#define MAX_HCALL_OPCODE   H_INT_RESET
 
 /* H_VIOCTL functions */
 #define H_GET_VIOA_DUMP_SIZE   0x01
diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index c23ff4389ca2..473f133a8555 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -110,11 +110,13 @@ extern bool __xive_enabled;
 
 static inline bool xive_enabled(void) { return __xive_enabled; }
 
+extern bool xive_spapr_init(void);
 extern bool xive_native_init(void);
 extern void xive_smp_probe(void);
 extern int  xive_smp_prepare_cpu(unsigned int cpu);
 extern void xive_smp_setup_cpu(void);
 extern void xive_smp_disable_cpu(void);
+extern void xive_teardown_cpu(void);
 extern void 

[PATCH] ASoC: Freescale: Delete an error message for a failed memory allocation in three functions

2017-08-11 Thread SF Markus Elfring
>From 885ccd6c63291dcd4854a0cbaab5188cdc3db805 Mon Sep 17 00:00:00 2001
From: Markus Elfring 
Date: Fri, 11 Aug 2017 15:10:43 +0200
Subject: [PATCH] ASoC: Freescale: Delete an error message for a failed memory 
allocation in three functions

Omit an extra message for a memory allocation failure in these functions.

This issue was detected by using the Coccinelle software.

Link: 
http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf
Signed-off-by: Markus Elfring 
---
 sound/soc/fsl/fsl_asrc_dma.c | 4 +---
 sound/soc/fsl/fsl_dma.c  | 1 -
 sound/soc/fsl/fsl_ssi.c  | 4 +---
 3 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/sound/soc/fsl/fsl_asrc_dma.c b/sound/soc/fsl/fsl_asrc_dma.c
index 282d841840b1..2baf19608bd0 100644
--- a/sound/soc/fsl/fsl_asrc_dma.c
+++ b/sound/soc/fsl/fsl_asrc_dma.c
@@ -282,7 +282,5 @@ static int fsl_asrc_dma_startup(struct snd_pcm_substream 
*substream)
-   if (!pair) {
-   dev_err(dev, "failed to allocate pair\n");
+   if (!pair)
return -ENOMEM;
-   }
 
pair->asrc_priv = asrc_priv;
 
diff --git a/sound/soc/fsl/fsl_dma.c b/sound/soc/fsl/fsl_dma.c
index ccadefceeff2..0ce172f86d6c 100644
--- a/sound/soc/fsl/fsl_dma.c
+++ b/sound/soc/fsl/fsl_dma.c
@@ -907,5 +907,4 @@ static int fsl_soc_dma_probe(struct platform_device *pdev)
if (!dma) {
-   dev_err(>dev, "could not allocate dma object\n");
of_node_put(ssi_np);
return -ENOMEM;
}
diff --git a/sound/soc/fsl/fsl_ssi.c b/sound/soc/fsl/fsl_ssi.c
index 173cb8496641..64598d1183f8 100644
--- a/sound/soc/fsl/fsl_ssi.c
+++ b/sound/soc/fsl/fsl_ssi.c
@@ -1435,7 +1435,5 @@ static int fsl_ssi_probe(struct platform_device *pdev)
-   if (!ssi_private) {
-   dev_err(>dev, "could not allocate DAI object\n");
+   if (!ssi_private)
return -ENOMEM;
-   }
 
ssi_private->soc = of_id->data;
ssi_private->dev = >dev;
-- 
2.14.0



Re: [v6 15/15] mm: debug for raw alloctor

2017-08-11 Thread Michal Hocko
On Mon 07-08-17 16:38:49, Pavel Tatashin wrote:
> When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
> returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
> places excpect zeroed memory.

Please fold this into the patch which introduces
memblock_virt_alloc_try_nid_raw. I am not sure CONFIG_DEBUG_VM is the
best config because that tends to be enabled quite often. Maybe
CONFIG_MEMBLOCK_DEBUG? Or even make it kernel command line parameter?

> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 
> ---
>  mm/memblock.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 3fbf3bcb52d9..29fcb1dd8a81 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1363,12 +1363,19 @@ void * __init memblock_virt_alloc_try_nid_raw(
>   phys_addr_t min_addr, phys_addr_t max_addr,
>   int nid)
>  {
> + void *ptr;
> +
>   memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx 
> max_addr=0x%llx %pF\n",
>__func__, (u64)size, (u64)align, nid, (u64)min_addr,
>(u64)max_addr, (void *)_RET_IP_);
>  
> - return memblock_virt_alloc_internal(size, align,
> - min_addr, max_addr, nid);
> + ptr = memblock_virt_alloc_internal(size, align,
> +min_addr, max_addr, nid);
> +#ifdef CONFIG_DEBUG_VM
> + if (ptr && size > 0)
> + memset(ptr, 0xff, size);
> +#endif
> + return ptr;
>  }
>  
>  /**
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs


Re: [v6 14/15] mm: optimize early system hash allocations

2017-08-11 Thread Michal Hocko
On Mon 07-08-17 16:38:48, Pavel Tatashin wrote:
> Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
> that memory that was allocated for system hash needs to be zeroed,
> otherwise the memory does not need to be zeroed, and client will initialize
> it.
> 
> If memory does not need to be zero'd, call the new
> memblock_virt_alloc_raw() interface, and thus improve the boot performance.
> 
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 

OK, but as mentioned in the previous patch add memblock_virt_alloc_raw
in this patch.

Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 15 +++
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4d32c1fa4c6c..000806298dfb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7354,18 +7354,17 @@ void *__init alloc_large_system_hash(const char 
> *tablename,
>  
>   log2qty = ilog2(numentries);
>  
> - /*
> -  * memblock allocator returns zeroed memory already, so HASH_ZERO is
> -  * currently not used when HASH_EARLY is specified.
> -  */
>   gfp_flags = (flags & HASH_ZERO) ? GFP_ATOMIC | __GFP_ZERO : GFP_ATOMIC;
>   do {
>   size = bucketsize << log2qty;
> - if (flags & HASH_EARLY)
> - table = memblock_virt_alloc_nopanic(size, 0);
> - else if (hashdist)
> + if (flags & HASH_EARLY) {
> + if (flags & HASH_ZERO)
> + table = memblock_virt_alloc_nopanic(size, 0);
> + else
> + table = memblock_virt_alloc_raw(size, 0);
> + } else if (hashdist) {
>   table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
> - else {
> + } else {
>   /*
>* If bucketsize is not a power-of-two, we may free
>* some pages at the end of hash table which
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs


Re: [v6 13/15] mm: stop zeroing memory during allocation in vmemmap

2017-08-11 Thread Michal Hocko
On Mon 07-08-17 16:38:47, Pavel Tatashin wrote:
> Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
> we will get the performance improvement by zeroing the memory in parallel
> when struct pages are zeroed.

First of all this should be probably merged with the previous patch. The
I think vmemmap_alloc_block would be better to split up into
__vmemmap_alloc_block which doesn't zero and vmemmap_alloc_block which
does zero which would reduce the memset callsites and it would make it
slightly more robust interface.
 
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 
> ---
>  mm/sparse-vmemmap.c | 6 +++---
>  mm/sparse.c | 6 +++---
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index d40c721ab19f..3b646b5ce1b6 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
>   unsigned long align,
>   unsigned long goal)
>  {
> - return memblock_virt_alloc_try_nid(size, align, goal,
> + return memblock_virt_alloc_try_nid_raw(size, align, goal,
>   BOOTMEM_ALLOC_ACCESSIBLE, node);
>  }
>  
> @@ -56,11 +56,11 @@ void * __meminit vmemmap_alloc_block(unsigned long size, 
> int node)
>  
>   if (node_state(node, N_HIGH_MEMORY))
>   page = alloc_pages_node(
> - node, GFP_KERNEL | __GFP_ZERO | 
> __GFP_RETRY_MAYFAIL,
> + node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
>   get_order(size));
>   else
>   page = alloc_pages(
> - GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
> + GFP_KERNEL | __GFP_RETRY_MAYFAIL,
>   get_order(size));
>   if (page)
>   return page_address(page);
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 7b4be3fd5cac..0e315766ad11 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -441,9 +441,9 @@ void __init sparse_mem_maps_populate_node(struct page 
> **map_map,
>   }
>  
>   size = PAGE_ALIGN(size);
> - map = memblock_virt_alloc_try_nid(size * map_count,
> -   PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
> -   BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
> + map = memblock_virt_alloc_try_nid_raw(size * map_count,
> +   PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
> +   BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
>   if (map) {
>   for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
>   if (!present_section_nr(pnum))
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs


Re: [v6 09/15] sparc64: optimized struct page zeroing

2017-08-11 Thread Michal Hocko
On Mon 07-08-17 16:38:43, Pavel Tatashin wrote:
> Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
> calling memset(). We do eight to tent regular stores based on the size of
> struct page. Compiler optimizes out the conditions of switch() statement.

Again, this doesn't explain why we need this. You have mentioned those
reasons in some previous emails but be explicit here please.

> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 
> ---
>  arch/sparc/include/asm/pgtable_64.h | 30 ++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/arch/sparc/include/asm/pgtable_64.h 
> b/arch/sparc/include/asm/pgtable_64.h
> index 6fbd931f0570..cee5cc7ccc51 100644
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -230,6 +230,36 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
>  extern struct page *mem_map_zero;
>  #define ZERO_PAGE(vaddr) (mem_map_zero)
>  
> +/* This macro must be updated when the size of struct page grows above 80
> + * or reduces below 64.
> + * The idea that compiler optimizes out switch() statement, and only
> + * leaves clrx instructions
> + */
> +#define  mm_zero_struct_page(pp) do {
> \
> + unsigned long *_pp = (void *)(pp);  \
> + \
> +  /* Check that struct page is either 64, 72, or 80 bytes */ \
> + BUILD_BUG_ON(sizeof(struct page) & 7);  \
> + BUILD_BUG_ON(sizeof(struct page) < 64); \
> + BUILD_BUG_ON(sizeof(struct page) > 80); \
> + \
> + switch (sizeof(struct page)) {  \
> + case 80:\
> + _pp[9] = 0; /* fallthrough */   \
> + case 72:\
> + _pp[8] = 0; /* fallthrough */   \
> + default:\
> + _pp[7] = 0; \
> + _pp[6] = 0; \
> + _pp[5] = 0; \
> + _pp[4] = 0; \
> + _pp[3] = 0; \
> + _pp[2] = 0; \
> + _pp[1] = 0; \
> + _pp[0] = 0; \
> + }   \
> +} while (0)
> +
>  /* PFNs are real physical page numbers.  However, mem_map only begins to 
> record
>   * per-page information starting at pfn_base.  This is to handle systems 
> where
>   * the first physical page in the machine is at some huge physical address,
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs


Re: [v6 08/15] mm: zero struct pages during initialization

2017-08-11 Thread Michal Hocko
On Mon 07-08-17 16:38:42, Pavel Tatashin wrote:
> Add struct page zeroing as a part of initialization of other fields in
> __init_single_page().

I believe this deserves much more detailed explanation why this is safe.
What actually prevents any pfn walker from seeing an uninitialized
struct page? Please make your assumptions explicit in the commit log so
that we can check them independently.

Also this is done with some purpose which is the perfmance, right? You
have mentioned that in the cover letter but if somebody is going to read
through git logs this wouldn't be obvious from the specific commit.
So add that information here as well. Especially numbers will be
interesting.

As a sidenote, this will need some more followups for memory hotplug
after my recent changes which are not merged yet but I will take care of
that.

> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 

After the relevant information is added feel free add
Acked-by: Michal Hocko 

> ---
>  include/linux/mm.h | 9 +
>  mm/page_alloc.c| 1 +
>  2 files changed, 10 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5e8569..183ac5e733db 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -93,6 +93,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
>  #define mm_forbids_zeropage(X)   (0)
>  #endif
>  
> +/*
> + * On some architectures it is expensive to call memset() for small sizes.
> + * Those architectures should provide their own implementation of "struct 
> page"
> + * zeroing by defining this macro in .
> + */
> +#ifndef mm_zero_struct_page
> +#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
> +#endif
> +
>  /*
>   * Default maximum number of active map areas, this limits the number of vmas
>   * per mm struct. Users can overwrite this number by sysctl but there is a
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 983de0a8047b..4d32c1fa4c6c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1168,6 +1168,7 @@ static void free_one_page(struct zone *zone,
>  static void __meminit __init_single_page(struct page *page, unsigned long 
> pfn,
>   unsigned long zone, int nid)
>  {
> + mm_zero_struct_page(page);
>   set_page_links(page, zone, nid, pfn);
>   init_page_count(page);
>   page_mapcount_reset(page);
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs


Re: [v6 07/15] mm: defining memblock_virt_alloc_try_nid_raw

2017-08-11 Thread Michal Hocko
On Mon 07-08-17 16:38:41, Pavel Tatashin wrote:
> A new variant of memblock_virt_alloc_* allocations:
> memblock_virt_alloc_try_nid_raw()
> - Does not zero the allocated memory
> - Does not panic if request cannot be satisfied

OK, this looks good but I would not introduce memblock_virt_alloc_raw
here because we do not have any users. Please move that to "mm: optimize
early system hash allocations" which actually uses the API. It would be
easier to review it that way.

> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 

other than that
Acked-by: Michal Hocko 

> ---
>  include/linux/bootmem.h | 27 +
>  mm/memblock.c   | 53 
> ++---
>  2 files changed, 73 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> index e223d91b6439..ea30b3987282 100644
> --- a/include/linux/bootmem.h
> +++ b/include/linux/bootmem.h
> @@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
>  #define BOOTMEM_ALLOC_ANYWHERE   (~(phys_addr_t)0)
>  
>  /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
> +void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
> +   phys_addr_t min_addr,
> +   phys_addr_t max_addr, int nid);
>  void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
>   phys_addr_t align, phys_addr_t min_addr,
>   phys_addr_t max_addr, int nid);
> @@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc(
>   NUMA_NO_NODE);
>  }
>  
> +static inline void * __init memblock_virt_alloc_raw(
> + phys_addr_t size,  phys_addr_t align)
> +{
> + return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT,
> + BOOTMEM_ALLOC_ACCESSIBLE,
> + NUMA_NO_NODE);
> +}
> +
>  static inline void * __init memblock_virt_alloc_nopanic(
>   phys_addr_t size, phys_addr_t align)
>  {
> @@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc(
>   return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT);
>  }
>  
> +static inline void * __init memblock_virt_alloc_raw(
> + phys_addr_t size,  phys_addr_t align)
> +{
> + if (!align)
> + align = SMP_CACHE_BYTES;
> + return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
> +}
> +
>  static inline void * __init memblock_virt_alloc_nopanic(
>   phys_addr_t size, phys_addr_t align)
>  {
> @@ -309,6 +328,14 @@ static inline void * __init 
> memblock_virt_alloc_try_nid(phys_addr_t size,
> min_addr);
>  }
>  
> +static inline void * __init memblock_virt_alloc_try_nid_raw(
> + phys_addr_t size, phys_addr_t align,
> + phys_addr_t min_addr, phys_addr_t max_addr, int nid)
> +{
> + return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align,
> + min_addr, max_addr);
> +}
> +
>  static inline void * __init memblock_virt_alloc_try_nid_nopanic(
>   phys_addr_t size, phys_addr_t align,
>   phys_addr_t min_addr, phys_addr_t max_addr, int nid)
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 08f449acfdd1..3fbf3bcb52d9 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal(
>   return NULL;
>  done:
>   ptr = phys_to_virt(alloc);
> - memset(ptr, 0, size);
>  
>   /*
>* The min_count is set to 0 so that bootmem allocated blocks
> @@ -1340,6 +1339,38 @@ static void * __init memblock_virt_alloc_internal(
>   return ptr;
>  }
>  
> +/**
> + * memblock_virt_alloc_try_nid_raw - allocate boot memory block without 
> zeroing
> + * memory and without panicking
> + * @size: size of memory block to be allocated in bytes
> + * @align: alignment of the region and block's size
> + * @min_addr: the lower bound of the memory region from where the allocation
> + * is preferred (phys address)
> + * @max_addr: the upper bound of the memory region from where the allocation
> + * is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
> + * allocate only from memory limited by memblock.current_limit value
> + * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
> + *
> + * Public function, provides additional debug information (including caller
> + * info), if enabled. Does not zero allocated memory, does not panic if 
> request
> + * cannot be 

[GIT PULL] Please pull powerpc/linux.git powerpc-4.13-6 tag

2017-08-11 Thread Michael Ellerman
Hi Linus,

Please pull some more powerpc fixes for 4.13:

The following changes since commit 3db40c312c2c1eb2187c5731102fa8ff380e6e40:

  powerpc/64: Fix __check_irq_replay missing decrementer interrupt (2017-08-04 
12:55:49 +1000)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-4.13-6

for you to fetch changes up to 96ea91e7b6ee2c406598d859e7348b4829404eea:

  powerpc/watchdog: add locking around init/exit functions (2017-08-09 23:45:33 
+1000)


powerpc fixes for 4.13 #6

All fixes for code that went in this cycle.

 - A revert of an optimisation to the syscall exit path, which could lead to an
   oops on either older machines or machines with > 1T of memory.
 - Disable some deep idle states if the firmware configuration for them fails.
 - Re-enable HARD/SOFT lockup detectors in defconfigs after a Kconfig change.
 - Six fairly small patches fixing bugs in our new watchdog code.

Thanks to:
  Gautham R. Shenoy, Nicholas Piggin.


Gautham R. Shenoy (1):
  powerpc/powernv/idle: Disable LOSE_FULL_CONTEXT states when stop-api fails

Michael Ellerman (2):
  Revert "powerpc/64: Avoid restore_math call if possible in syscall exit"
  powerpc/configs: Re-enable HARD/SOFT lockup detectors

Nicholas Piggin (6):
  powerpc: NMI IPI improve lock primitive
  powerpc/watchdog: Improve watchdog lock primitive
  powerpc/watchdog: Moderate touch_nmi_watchdog overhead
  powerpc/watchdog: Fix final-check recovered case
  powerpc/watchdog: Fix marking of stuck CPUs
  powerpc/watchdog: add locking around init/exit functions

 arch/powerpc/configs/powernv_defconfig |  3 +-
 arch/powerpc/configs/ppc64_defconfig   |  3 +-
 arch/powerpc/configs/pseries_defconfig |  3 +-
 arch/powerpc/kernel/entry_64.S | 60 ++
 arch/powerpc/kernel/process.c  |  4 ---
 arch/powerpc/kernel/smp.c  |  6 ++--
 arch/powerpc/kernel/watchdog.c | 49 +++
 arch/powerpc/platforms/powernv/idle.c  | 41 +--
 drivers/cpuidle/cpuidle-powernv.c  | 10 ++
 9 files changed, 111 insertions(+), 68 deletions(-)


signature.asc
Description: PGP signature


Re: [V11,1/3] powernv: powercap: Add support for powercap framework

2017-08-11 Thread Michael Ellerman
On Thu, 2017-08-10 at 03:31:18 UTC, Shilpasri G Bhat wrote:
> Adds a generic powercap framework to change the system powercap
> inband through OPAL-OCC command/response interface.
> 
> Signed-off-by: Shilpasri G Bhat 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/cb8b340de21e1c57e1c6d4f26ccc4a

cheers


Re: [01/12] powerpc/8xx: Simplify CONFIG_8xx checks in Makefile

2017-08-11 Thread Michael Ellerman
On Tue, 2017-08-08 at 11:58:40 UTC, Christophe Leroy wrote:
> The entire 8xx directory is omitted if CONFIG_8xx is not enabled, so
> within the 8xx/Makefile CONFIG_8xx is always y. So convert
> obj-$(CONFIG_8xx) to the more obvious obj-y.
> 
> Signed-off-by: Christophe Leroy 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/0e23e7b32bfdaaa8892d8383114f84

cheers


Re: powerpc/xive: Fix section mismatch warnings

2017-08-11 Thread Michael Ellerman
On Tue, 2017-08-08 at 11:44:14 UTC, Michael Ellerman wrote:
> Both xive_core_init() and xive_native_init() are called from and call
> __init routines, so they should also be __init.
> 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/df4c7983189491302a6000b2dcb14d

cheers


Re: powerpc/mm: Fix section mismatch warning in early_check_vec5()

2017-08-11 Thread Michael Ellerman
On Tue, 2017-08-08 at 11:44:08 UTC, Michael Ellerman wrote:
> early_check_vec5() is called from and calls __init routines, so should
> also be __init.
> 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/7559952e1f6f95091b00352c5ba863

cheers


Re: [1/9] powerpc/47x: Guard 47x cputable entries with CONFIG_PPC_47x

2017-08-11 Thread Michael Ellerman
On Tue, 2017-08-08 at 06:39:17 UTC, Michael Ellerman wrote:
> Currently we build the 47x cputable entries even when CONFIG_PPC_47x is
> disabled. That means a kernel built without CONFIG_PPC_47x will claim to
> support a 47x CPU and start booting, only to break somewhere later
> because it doesn't have 47x support compiled in.
> 
> So guard the 47x cputable entries with CONFIG_PPC_47x. Note that this is
> inside the #ifdef CONFIG_44x section, because 47x depends on 44x.
> 
> Signed-off-by: Michael Ellerman 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/13fef7f9da13ab6cc22d456315e887

cheers


Re: powerpc: fix invalid use of register expressions

2017-08-11 Thread Michael Ellerman
On Sat, 2017-08-05 at 17:55:11 UTC, Andreas Schwab wrote:
> binutils >= 2.26 now warns about misuse of register expressions in
> assembler operands that are actually literals, for example:
> 
> arch/powerpc/kernel/entry_64.S:535: Warning: invalid register expression
> 
> Signed-off-by: Andreas Schwab 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/8a583c0a8d316d8ea52ea78491174a

cheers


Re: powerpc/mm: Invalidate partition table cache on host proc tbl base update

2017-08-11 Thread Michael Ellerman
On Thu, 2017-08-03 at 04:15:51 UTC, Suraj Jitindar Singh wrote:
> The host process table base is stored in the partition table by calling
> the function native_register_process_table(). Currently this just sets
> the entry in memory and is missing a proceeding cache invalidation
> instruction. Any update to the partition table should be followed by a
> cache invalidation instruction specifying invalidation of the caching of
> any partition table entries (RIC = 2, PRS = 0).
> 
> We already have a function to update the partition table with the
> required cache invalidation instructions - mmu_partition_table_set_entry().
> Update the native_register_process_table() function to call
> mmu_partition_table_set_entry(), this ensures all appropriate
> invalidation will be performed.
> 
> Signed-off-by: Suraj Jitindar Singh 
> Reviewed-by: Aneesh Kumar K.V 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/7cd2a8695ef9c31e8f51773f0de9e6

cheers


Re: powerpc: xive: ensure active irqd when setting affinity

2017-08-11 Thread Michael Ellerman
On Thu, 2017-08-03 at 01:38:22 UTC, Sukadev Bhattiprolu wrote:
> >From fd0abf5c61b6041fdb75296e8580b86dc91d08d6 Mon Sep 17 00:00:00 2001
> From: Benjamin Herrenschmidt 
> Date: Tue, 1 Aug 2017 20:54:41 -0500
> Subject: [PATCH] powerpc: xive: ensure active irqd when setting affinity
> 
> Ensure irqd is active before attempting to set affinity. This should
> make the set affinity code more robust. For instance, this prevents
> these messages seen on a 4.12 based kernel when taking cpus offline:
> 
>[  123.053037264,3] XIVE[ IC 00  ] ISN 2 lead to invalid IVE !
>[   77.885859] xive: Error -6 reconfiguring irq 17
>[   77.885862] IRQ17: set affinity failed(-6).
> 
> The underlying problem with taking cpus offline was fixed in 4.13-rc1 by:
> 
>commit 91f26cb4cd3c ("genirq/cpuhotplug: Do not migrated shutdown irqs")
> 
> Signed-off-by: Sukadev Bhattiprolu 
> Signed-off-by: Benjamin Herrenschmidt 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/cffb717ceb8e2ca0316e89d908db54

cheers


Re: powerpc/pseries: Check memory device state before onlining/offlining

2017-08-11 Thread Michael Ellerman
On Wed, 2017-08-02 at 18:03:22 UTC, Nathan Fontenot wrote:
> When DLPAR adding or removing memory we need to check the device
> offline status before trying to online/offline the memory. This is
> needed because calls device_online() and device_offline() will return
> non-zero for memory that is already online and offline respectively.
> 
> This update resolves two scenarios. First, for kernel built with
> auto-online memory enabled, memory will be onlined as part of calls
> to add_memory(). After adding the memory the pseries dlpar code tries
> to online it and fails since the memory is already online. The dlpar
> code then tries to remove the memory which produces the oops message
> below because the memory is not offline.
> 
> The second scenario occurs when removing memory that is already offline,
> i.e. marking memory offline (via sysfs) and the trying to remove that
> memory. This doesn't work because offlining the already offline memory
> does not succeed and the dlpar code then fails the dlpar remove operation.
> 
> The fix for both scenarios is to check the device.offline status before
> making the calls to device_online() or device_offline().
> 
> kernel BUG at mm/memory_hotplug.c:2189!
> Oops: Exception in kernel mode, sig: 5 [#1]
> SMP NR_CPUS=2048
> NUMA
> pSeries
> CPU: 0 PID: 5 Comm: kworker/u129:0 Not tainted 4.12.0-rc3 #272
> Workqueue: pseries hotplug workque .pseries_hp_work_fn
> task: c003f9c89200 task.stack: c003f9d1
> NIP: c02ca428 LR: c02ca3cc CTR: c0ba16a0
> REGS: c003f9d13630 TRAP: 0700   Not tainted  (4.12.0-rc3)
> MSR: 8282b032 
>   CR: 22002024  XER: 000a
> CFAR: c02ca3d0 SOFTE: 1
> GPR00: c02ca3cc c003f9d138b0 c1bb0200 0001
> GPR04: c003fb143c80 c003fef21630 0003 0002
> GPR08: 0003 0003 0003 31b1
> GPR12: 28002042 cfd8 c0118ae0 c003fb170180
> GPR16:  0004 0010 c00379c8
> GPR20: c0037b68 c003f728ff84 0002 0010
> GPR24: 0002 c003f728ff80 0002 0001
> GPR28: c003fb143c38 0002 1000 2000
> NIP [c02ca428] .remove_memory+0xb8/0xc0
> LR [c02ca3cc] .remove_memory+0x5c/0xc0
> Call Trace:
> [c003f9d138b0] [c02ca3cc] .remove_memory+0x5c/0xc0 (unreliable)
> [c003f9d13940] [c00938a4] .dlpar_add_lmb+0x384/0x400
> [c003f9d13a30] [c009456c] .dlpar_memory+0x5dc/0xca0
> [c003f9d13af0] [c008ce84] .handle_dlpar_errorlog+0x74/0xe0
> [c003f9d13b70] [c008cf1c] .pseries_hp_work_fn+0x2c/0x90
> [c003f9d13bf0] [c0110a5c] .process_one_work+0x17c/0x460
> [c003f9d13c90] [c0110dc8] .worker_thread+0x88/0x500
> [c003f9d13d70] [c0118c3c] .kthread+0x15c/0x1a0
> [c003f9d13e30] [c000ba18] .ret_from_kernel_thread+0x58/0xc0
> Instruction dump:
> 7fe3fb78 4bd7c845 6000 7fa3eb78 4bfdd3c9 38210090 e8010010 eba1ffe8
> ebc1fff0 ebe1fff8 7c0803a6 4bfdc2ac <0fe0>  7c0802a6 fb01ffc0
> 
> Fixes: 943db62c316c ("powerpc/pseries: Revert 'Auto-online hotplugged 
> memory'")
> Signed-off-by: Nathan Fontenot 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/1a367063ca0c1c6f6f54b5abd7b483

cheers


Re: [v2,1/4] powerpc/64s: fix mce accounting for powernv

2017-08-11 Thread Michael Ellerman
On Tue, 2017-08-01 at 12:00:51 UTC, Nicholas Piggin wrote:
> ---
>  arch/powerpc/kernel/traps.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index bfcfd9ef09f2..5adfea2dc822 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -755,7 +755,14 @@ void machine_check_exception(struct pt_regs *regs)
>   enum ctx_state prev_state = exception_enter();
>   int recover = 0;
>  
> +#ifdef CONFIG_PPC_BOOK3S_64
> + /* 64s accounts the mce in machine_check_early when in HVMODE */
> + if (!cpu_has_feature(CPU_FTR_HVMODE))
> + __this_cpu_inc(irq_stat.mce_exceptions);
> +#else
>   __this_cpu_inc(irq_stat.mce_exceptions);
> +#endif
> +
>  
>   add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
>  

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/f886f0f6e0e20d53dc36421c2ee83f

cheers


Re: powerpc/perf: Add PM_LD_MISS_L1 and PM_BR_2PATH to power9 event list

2017-08-11 Thread Michael Ellerman
On Mon, 2017-07-31 at 09:33:21 UTC, Madhavan Srinivasan wrote:
> Add couple of more events (PM_LD_MISS_L1 and PM_BR_2PATH) to
> power9 event list and power9_event_alternatives array (these
> events can be counted in more than one PMC).
> 
> Signed-off-by: Madhavan Srinivasan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/91e0bd1e62519bdb50e35775ec37b2

cheers


Re: powerpc/perf: Factor out PPMU_ONLY_COUNT_RUN check code from power8

2017-08-11 Thread Michael Ellerman
On Mon, 2017-07-31 at 08:02:41 UTC, Madhavan Srinivasan wrote:
> There are some hardware events on Power systems which only
> count when the processor is not idle, and there are some
> fixed-function counters which count such events. For example,
> the "run cycles" event counts cycles when the processor is
> not idle. If the user asks to count cycles, we can use
> "run cycles" if this is a per-task event, since the processor
> is running when the task is running, by definition. We can't
> use "run cycles" if the user asks for "cycles" on a system-wide
> counter.
> 
> Currently in power8 this check is done using PPMU_ONLY_COUNT_RUN
> flag in power8_get_alternatives() function. Based on the
> flag, events are switched if needed. This function should
> also be enabled in power9, so factor out the code to
> isa207_get_alternatives().
> 
> Fixes: efe881afdd999 ('powerpc/perf: Factor out event_alternative function')
> Reported-by: Anton Blanchard 
> Signed-off-by: Madhavan Srinivasan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/70a7e720998d5beaf0c8abd945234e

cheers


Re: [v4,1/5] powerpc/lib/sstep: Add cmpb instruction emulation

2017-08-11 Thread Michael Ellerman
On Mon, 2017-07-31 at 00:58:22 UTC, Matt Brown wrote:
> This patch adds emulation of the cmpb instruction, enabling xmon to
> emulate this instruction.
> Tested for correctness against the cmpb asm instruction on ppc64le.
> 
> Signed-off-by: Matt Brown 
> Reviewed-by: Cyril Bur 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/02c0f62a60b67d6c9bfe9429cbe3aa

cheers


Re: [v2,2/2] 44x/fsp2: enable eMMC arasan for fsp2 platform

2017-08-11 Thread Michael Ellerman
On Tue, 2017-07-25 at 11:40:04 UTC, Ivan Mikhaylov wrote:
> Add mmc0 changes for enabling arasan emmc and change
> defconfig appropriately.
> 
> Signed-off-by: Ivan Mikhaylov 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/754f030908c3615781e9e3559d8ba1

cheers


Re: powerpc/perf: Update default sdar_mode value for power9

2017-08-11 Thread Michael Ellerman
On Tue, 2017-07-25 at 05:35:51 UTC, Madhavan Srinivasan wrote:
> Commit 20dd4c624d251 ('powerpc/perf: Fix SDAR_MODE value for continous
> sampling on Power9') set the default sdar_mode value in MMCRA[SDAR_MODE]
> to be used as 0b01 (Update on TLB miss). And this value is set if sdar_mode
> from event is zero, or we are in continous sampling mode in power9 dd1.
> 
> But it is preferred to have the sdar_mode value for power9 as
> 0b10 (Update on dcache miss) for better sampling updates instead
> of 0b01 (Update on TLB miss).
> 
> Signed-off-by: Madhavan Srinivasan 
> Acked-by: Anton Blanchard 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/7aa345d84245a75760fc35a385fc55

cheers


Re: powerpc/pseries: energy driver only print message when LPAR guest

2017-08-11 Thread Michael Ellerman
On Fri, 2017-07-21 at 01:16:44 UTC, Nicholas Piggin wrote:
> On Thu, 20 Jul 2017 23:03:21 +1000
> Michael Ellerman  wrote:
> 
> > Nicholas Piggin  writes:
> > 
> > > This driver currently reports the H_BEST_ENERGY is unsupported even
> > > when booting in a non-LPAR environment (e.g., powernv). Prevent it.  
> > 
> > Just delete the printk(). Users don't know what that means, and
> > developers have other better ways to detect that the hcall is missing if
> > anyone cares.
> > 
> > cheers
> 
> powerpc/pseries: energy driver do not print failure message
> 
> This driver currently reports the H_BEST_ENERGY is unsupported (even
> when booting in a non-LPAR environment). This is not something the
> administrator can do much with, and not significant for debugging.
> 
> Remove it.
> 
> Signed-off-by: Nicholas Piggin 
> Reviewed-by: Vaidyanathan Srinivasan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a70a0b9f4404d8edb72ca0e0272731

cheers


Re: [1/2] powerpc/perf: Cleanup of PM_BR_CMPL vs. PM_BRU_CMPL in power9 event list

2017-08-11 Thread Michael Ellerman
On Mon, 2017-01-09 at 13:30:14 UTC, Madhavan Srinivasan wrote:
> Fixes:34922527a2bcb ('powerpc/perf: Add power9 event list macros for generic 
> and cache events')
> Signed-off-by: Madhavan Srinivasan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/93fc5ca9a0048c17b47582051940bf

cheers


Re: [PATCH 3/6] powerpc/mm: Ensure cpumask update is ordered

2017-08-11 Thread Nicholas Piggin
On Mon, 24 Jul 2017 21:20:07 +1000
Nicholas Piggin  wrote:

> On Mon, 24 Jul 2017 14:28:00 +1000
> Benjamin Herrenschmidt  wrote:
> 
> > There is no guarantee that the various isync's involved with
> > the context switch will order the update of the CPU mask with
> > the first TLB entry for the new context being loaded by the HW.
> > 
> > Be safe here and add a memory barrier to order any subsequent
> > load/store which may bring entries into the TLB.
> > 
> > The corresponding barrier on the other side already exists as
> > pte updates use pte_xchg() which uses __cmpxchg_u64 which has
> > a sync after the atomic operation.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> >  arch/powerpc/include/asm/mmu_context.h | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/powerpc/include/asm/mmu_context.h 
> > b/arch/powerpc/include/asm/mmu_context.h
> > index ed9a36ee3107..ff1aeb2cd19f 100644
> > --- a/arch/powerpc/include/asm/mmu_context.h
> > +++ b/arch/powerpc/include/asm/mmu_context.h
> > @@ -110,6 +110,7 @@ static inline void switch_mm_irqs_off(struct mm_struct 
> > *prev,
> > /* Mark this context has been used on the new CPU */
> > if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(next))) {
> > cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
> > +   smp_mb();
> > new_on_cpu = true;
> > }
> >
> 
> I think this is the right thing to do, but it should be commented.
> Is hwsync the right barrier? (i.e., it will order the page table walk)

After some offline discussion, I think we have an agreement that
this is the right barrier, as it orders with the subsequent load
of next->context.id that the mtpid depends on (or slbmte for HPT).

So we should have a comment here to that effect, and including
the pte_xchg comments from your changelog. Some comment (at least
refer back to here) added at pte_xchg too please.

Other than that your series seems good to me if you repost it you
can add

Reviewed-by: Nicholas Piggin 

This one out of the series is the bugfix so it should go to stable
as well, right?

Thanks,
Nick


Re: [RFC v7 24/25] powerpc: Deliver SEGV signal on pkey violation

2017-08-11 Thread Michael Ellerman
Thiago Jung Bauermann  writes:

> Ram Pai  writes:
>
>> The value of the AMR register at the time of exception
>> is made available in gp_regs[PT_AMR] of the siginfo.
>>
>> The value of the pkey, whose protection got violated,
>> is made available in si_pkey field of the siginfo structure.
>
> Should the IAMR also be made available?
>
> Also, should the AMR and IAMR be accesible to userspace (e.g., to GDB)
> via ptrace and the core file?

Yes if they're part of the thread's context they should be accessible
via ptrace and in core files.

>> --- a/arch/powerpc/kernel/signal_32.c
>> +++ b/arch/powerpc/kernel/signal_32.c
>> @@ -500,6 +500,11 @@ static int save_user_regs(struct pt_regs *regs, struct 
>> mcontext __user *frame,
>> (unsigned long) >tramp[2]);
>>  }
>>
>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>> +if (__put_user(get_paca()->paca_amr, >mc_gregs[PT_AMR]))
>> +return 1;
>> +#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>> +
>>  return 0;
>>  }
>
> frame->mc_gregs[PT_AMR] has 32 bits, but paca_amr has 64 bits. Does this
> work as intended?

I don't understand why we are putting it in there at all?

Is there some special handling of the actual register on signals? I
haven't seen it. In which case the process can get the value of AMR by
reading the register. ??

cheers


Re: [v6 04/15] mm: discard memblock data later

2017-08-11 Thread Mel Gorman
On Fri, Aug 11, 2017 at 11:32:49AM +0200, Michal Hocko wrote:
> > Signed-off-by: Pavel Tatashin 
> > Reviewed-by: Steven Sistare 
> > Reviewed-by: Daniel Jordan 
> > Reviewed-by: Bob Picco 
> 
> Considering that some HW might behave strangely and this would be rather
> hard to debug I would be tempted to mark this for stable. It should also
> be merged separately from the rest of the series.
> 
> I have just one nit below
> Acked-by: Michal Hocko 
> 

Agreed.

-- 
Mel Gorman
SUSE Labs


[PATCH] rtc: rtctest: Improve support detection

2017-08-11 Thread Lukáš Doktor
The rtc-generic and opal-rtc are failing to run this test as they do not
support all the features. Let's treat the error returns and skip to the
following test.

Theoretically the test_DATE should be also adjusted, but as it's enabled
on demand I think it makes sense to fail in such case.

Signed-off-by: Lukáš Doktor 
---
 tools/testing/selftests/timers/rtctest.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/timers/rtctest.c 
b/tools/testing/selftests/timers/rtctest.c
index f61170f..6344842 100644
--- a/tools/testing/selftests/timers/rtctest.c
+++ b/tools/testing/selftests/timers/rtctest.c
@@ -125,7 +125,7 @@ int main(int argc, char **argv)
/* Turn on update interrupts (one per second) */
retval = ioctl(fd, RTC_UIE_ON, 0);
if (retval == -1) {
-   if (errno == EINVAL) {
+   if (errno == EINVAL || errno == EINVAL) {
fprintf(stderr,
"\n...Update IRQs not supported.\n");
goto test_READ;
@@ -221,6 +221,11 @@ int main(int argc, char **argv)
/* Read the current alarm settings */
retval = ioctl(fd, RTC_ALM_READ, _tm);
if (retval == -1) {
+   if (errno == EINVAL) {
+   fprintf(stderr,
+   "\n...EINVAL reading current alarm 
setting.\n");
+   goto test_PIE;
+   }
perror("RTC_ALM_READ ioctl");
exit(errno);
}
@@ -231,7 +236,7 @@ int main(int argc, char **argv)
/* Enable alarm interrupts */
retval = ioctl(fd, RTC_AIE_ON, 0);
if (retval == -1) {
-   if (errno == EINVAL) {
+   if (errno == EINVAL || errno ==EIO) {
fprintf(stderr,
"\n...Alarm IRQs not supported.\n");
goto test_PIE;
-- 
2.9.4



[PATCH 0/1] rtc: rtctest: Support opal-rtc and rtc-generic

2017-08-11 Thread Lukáš Doktor
On ppc64le machines the opal-rtc, resp rtc-generic in guest is used. They only
support minimal set of functionality and fail this test in not-yet treated
way. This extends the checks and skips to the next test when feature is not
supported.

Lukáš Doktor (1):
  rtc: rtctest: Improve support detection

 tools/testing/selftests/timers/rtctest.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

-- 
2.9.4



Re: [v6 05/15] mm: don't accessed uninitialized struct pages

2017-08-11 Thread Michal Hocko
On Mon 07-08-17 16:38:39, Pavel Tatashin wrote:
> In deferred_init_memmap() where all deferred struct pages are initialized
> we have a check like this:
> 
> if (page->flags) {
> VM_BUG_ON(page_zone(page) != zone);
> goto free_range;
> }
> 
> This way we are checking if the current deferred page has already been
> initialized. It works, because memory for struct pages has been zeroed, and
> the only way flags are not zero if it went through __init_single_page()
> before.  But, once we change the current behavior and won't zero the memory
> in memblock allocator, we cannot trust anything inside "struct page"es
> until they are initialized. This patch fixes this.
> 
> This patch defines a new accessor memblock_get_reserved_pfn_range()
> which returns successive ranges of reserved PFNs.  deferred_init_memmap()
> calls it to determine if a PFN and its struct page has already been
> initialized.

Why don't we simply check the pfn against pgdat->first_deferred_pfn?

> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 
> ---
>  include/linux/memblock.h |  3 +++
>  mm/memblock.c| 54 
> ++--
>  mm/page_alloc.c  | 11 +-
>  3 files changed, 61 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index bae11c7e7bf3..b6a2a610f5e1 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -320,6 +320,9 @@ int memblock_is_map_memory(phys_addr_t addr);
>  int memblock_is_region_memory(phys_addr_t base, phys_addr_t size);
>  bool memblock_is_reserved(phys_addr_t addr);
>  bool memblock_is_region_reserved(phys_addr_t base, phys_addr_t size);
> +void memblock_get_reserved_pfn_range(unsigned long pfn,
> +  unsigned long *pfn_start,
> +  unsigned long *pfn_end);
>  
>  extern void __memblock_dump_all(void);
>  
> diff --git a/mm/memblock.c b/mm/memblock.c
> index bf14aea6ab70..08f449acfdd1 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1580,7 +1580,13 @@ void __init memblock_mem_limit_remove_map(phys_addr_t 
> limit)
>   memblock_cap_memory_range(0, max_addr);
>  }
>  
> -static int __init_memblock memblock_search(struct memblock_type *type, 
> phys_addr_t addr)
> +/**
> + * Return index in regions array if addr is within the region. Otherwise
> + * return -1. If -1 is returned and *next_idx is not %NULL, sets it to the
> + * next region index or -1 if there is none.
> + */
> +static int __init_memblock memblock_search(struct memblock_type *type,
> +phys_addr_t addr, int *next_idx)
>  {
>   unsigned int left = 0, right = type->cnt;
>  
> @@ -1595,22 +1601,26 @@ static int __init_memblock memblock_search(struct 
> memblock_type *type, phys_addr
>   else
>   return mid;
>   } while (left < right);
> +
> + if (next_idx)
> + *next_idx = (right == type->cnt) ? -1 : right;
> +
>   return -1;
>  }
>  
>  bool __init memblock_is_reserved(phys_addr_t addr)
>  {
> - return memblock_search(, addr) != -1;
> + return memblock_search(, addr, NULL) != -1;
>  }
>  
>  bool __init_memblock memblock_is_memory(phys_addr_t addr)
>  {
> - return memblock_search(, addr) != -1;
> + return memblock_search(, addr, NULL) != -1;
>  }
>  
>  int __init_memblock memblock_is_map_memory(phys_addr_t addr)
>  {
> - int i = memblock_search(, addr);
> + int i = memblock_search(, addr, NULL);
>  
>   if (i == -1)
>   return false;
> @@ -1622,7 +1632,7 @@ int __init_memblock memblock_search_pfn_nid(unsigned 
> long pfn,
>unsigned long *start_pfn, unsigned long *end_pfn)
>  {
>   struct memblock_type *type = 
> - int mid = memblock_search(type, PFN_PHYS(pfn));
> + int mid = memblock_search(type, PFN_PHYS(pfn), NULL);
>  
>   if (mid == -1)
>   return -1;
> @@ -1646,7 +1656,7 @@ int __init_memblock memblock_search_pfn_nid(unsigned 
> long pfn,
>   */
>  int __init_memblock memblock_is_region_memory(phys_addr_t base, phys_addr_t 
> size)
>  {
> - int idx = memblock_search(, base);
> + int idx = memblock_search(, base, NULL);
>   phys_addr_t end = base + memblock_cap_size(base, );
>  
>   if (idx == -1)
> @@ -1655,6 +1665,38 @@ int __init_memblock 
> memblock_is_region_memory(phys_addr_t base, phys_addr_t size
>memblock.memory.regions[idx].size) >= end;
>  }
>  
> +/**
> + * memblock_get_reserved_pfn_range - search for the next reserved region
> + *
> + * @pfn: start searching from this pfn.
> + *
> + * RETURNS:
> + * [start_pfn, end_pfn), where start_pfn >= pfn. If none is found
> + * start_pfn, and end_pfn are both set to 

Re: [v6 04/15] mm: discard memblock data later

2017-08-11 Thread Michal Hocko
[CC Mel]

On Mon 07-08-17 16:38:38, Pavel Tatashin wrote:
> There is existing use after free bug when deferred struct pages are
> enabled:
> 
> The memblock_add() allocates memory for the memory array if more than
> 128 entries are needed.  See comment in e820__memblock_setup():
> 
>   * The bootstrap memblock region count maximum is 128 entries
>   * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
>   * than that - so allow memblock resizing.
> 
> This memblock memory is freed here:
> free_low_memory_core_early()
> 
> We access the freed memblock.memory later in boot when deferred pages are
> initialized in this path:
> 
> deferred_init_memmap()
> for_each_mem_pfn_range()
>   __next_mem_pfn_range()
> type = 

Yes you seem to be right.
>
> One possible explanation for why this use-after-free hasn't been hit
> before is that the limit of INIT_MEMBLOCK_REGIONS has never been exceeded
> at least on systems where deferred struct pages were enabled.

Yeah this sounds like the case.
 
> Another reason why we want this problem fixed in this patch series is,
> in the next patch, we will need to access memblock.reserved from
> deferred_init_memmap().
> 

I guess this goes all the way down to 
Fixes: 7e18adb4f80b ("mm: meminit: initialise remaining struct pages in 
parallel with kswapd")
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 

Considering that some HW might behave strangely and this would be rather
hard to debug I would be tempted to mark this for stable. It should also
be merged separately from the rest of the series.

I have just one nit below
Acked-by: Michal Hocko 

[...]
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 2cb25fe4452c..bf14aea6ab70 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -285,31 +285,27 @@ static void __init_memblock 
> memblock_remove_region(struct memblock_type *type, u
>  }
>  
>  #ifdef CONFIG_ARCH_DISCARD_MEMBLOCK

pull this ifdef inside memblock_discard and you do not have an another
one in page_alloc_init_late

[...]
> +/**
> + * Discard memory and reserved arrays if they were allocated
> + */
> +void __init memblock_discard(void)
>  {

here

> - if (memblock.memory.regions == memblock_memory_init_regions)
> - return 0;
> + phys_addr_t addr, size;
>  
> - *addr = __pa(memblock.memory.regions);
> + if (memblock.reserved.regions != memblock_reserved_init_regions) {
> + addr = __pa(memblock.reserved.regions);
> + size = PAGE_ALIGN(sizeof(struct memblock_region) *
> +   memblock.reserved.max);
> + __memblock_free_late(addr, size);
> + }
>  
> - return PAGE_ALIGN(sizeof(struct memblock_region) *
> -   memblock.memory.max);
> + if (memblock.memory.regions == memblock_memory_init_regions) {
> + addr = __pa(memblock.memory.regions);
> + size = PAGE_ALIGN(sizeof(struct memblock_region) *
> +   memblock.memory.max);
> + __memblock_free_late(addr, size);
> + }
>  }
> -
>  #endif
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fc32aa81f359..63d16c185736 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1584,6 +1584,10 @@ void __init page_alloc_init_late(void)
>   /* Reinit limits that are based on free pages after the kernel is up */
>   files_maxfiles_init();
>  #endif
> +#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
> + /* Discard memblock private memory */
> + memblock_discard();
> +#endif
>  
>   for_each_populated_zone(zone)
>   set_zone_contiguous(zone);
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs


Re: [v6 02/15] x86/mm: setting fields in deferred pages

2017-08-11 Thread Michal Hocko
[CC Mel - the full series is here
http://lkml.kernel.org/r/1502138329-123460-1-git-send-email-pasha.tatas...@oracle.com]

On Mon 07-08-17 16:38:36, Pavel Tatashin wrote:
> Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
> flags and other fields in "struct page"es are never changed prior to first
> initializing struct pages by going through __init_single_page().
> 
> With deferred struct page feature enabled there is a case where we set some
> fields prior to initializing:
> 
> mem_init() {
> register_page_bootmem_info();
> free_all_bootmem();
> ...
> }
> 
> When register_page_bootmem_info() is called only non-deferred struct pages
> are initialized. But, this function goes through some reserved pages which
> might be part of the deferred, and thus are not yet initialized.
> 
>   mem_init
>register_page_bootmem_info
> register_page_bootmem_info_node
>  get_page_bootmem
>   .. setting fields here ..
>   such as: page->freelist = (void *)type;
> 
> We end-up with similar issue as in the previous patch, where currently we
> do not observe problem as memory is zeroed. But, if flag asserts are
> changed we can start hitting issues.
> 
> Also, because in this patch series we will stop zeroing struct page memory
> during allocation, we must make sure that struct pages are properly
> initialized prior to using them.
> 
> The deferred-reserved pages are initialized in free_all_bootmem().
> Therefore, the fix is to switch the above calls.

I have to confess that this part of the early struct page initialization
is not my strongest point and I have to always re-read the code from the
scratch but I really do not undestand what you are trying to achieve
here.

AFAIU register_page_bootmem_info_node is only about struct pages backing
pgdat, usemap and memmap. Those should be in reserved memblocks and we
do not initialize those at later times, they are not relevant to the
deferred initialization as your changelog suggests so the ordering with
get_page_bootmem shouldn't matter. Or am I missing something here?
 
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 
> ---
>  arch/x86/mm/init_64.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 136422d7d539..1e863baec847 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1165,12 +1165,17 @@ void __init mem_init(void)
>  
>   /* clear_bss() already clear the empty_zero_page */
>  
> - register_page_bootmem_info();
> -
>   /* this will put all memory onto the freelists */
>   free_all_bootmem();
>   after_bootmem = 1;
>  
> + /* Must be done after boot memory is put on freelist, because here we
> +  * might set fields in deferred struct pages that have not yet been
> +  * initialized, and free_all_bootmem() initializes all the reserved
> +  * deferred pages for us.
> +  */
> + register_page_bootmem_info();
> +
>   /* Register memory areas for /proc/kcore */
>   kclist_add(_vsyscall, (void *)VSYSCALL_ADDR,
>PAGE_SIZE, KCORE_OTHER);
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs


Re: [FIX PATCH v0] powerpc: Fix memory unplug failure on radix guest

2017-08-11 Thread Aneesh Kumar K.V
Bharata B Rao  writes:

> For a PowerKVM guest, it is possible to specify a DIMM device in
> addition to the system RAM at boot time. When such a cold plugged DIMM
> device is removed from a radix guest, we hit the following warning in the
> guest kernel resulting in the eventual failure of memory unplug:
>
> remove_pud_table: unaligned range
> WARNING: CPU: 3 PID: 164 at arch/powerpc/mm/pgtable-radix.c:597 
> remove_pagetable+0x468/0xca0
> Call Trace:
> remove_pagetable+0x464/0xca0 (unreliable)
> radix__remove_section_mapping+0x24/0x40
> remove_section_mapping+0x28/0x60
> arch_remove_memory+0xcc/0x120
> remove_memory+0x1ac/0x270
> dlpar_remove_lmb+0x1ac/0x210
> dlpar_memory+0xbc4/0xeb0
> pseries_hp_work_fn+0x1a4/0x230
> process_one_work+0x1cc/0x660
> worker_thread+0xac/0x6d0
> kthread+0x16c/0x1b0
> ret_from_kernel_thread+0x5c/0x74
>
> The DIMM memory that is cold plugged gets merged to the same memblock
> region as RAM and hence gets mapped at 1G alignment. However since the
> removal is done for one LMB (lmb size 256MB) at a time, the address
> of the LMB (which is 256MB aligned) would get flagged as unaligned
> in remove_pud_table() resulting in the above failure.
>
> This problem is not seen for hot plugged memory because for the
> hot plugged memory, the mappings are created separately for each
> LMB and hence they all get aligned at 256MB.
>
> To fix this problem for the cold plugged memory, let us mark the
> cold plugged memblock region explicitly as HOTPLUGGED so that the
> region doesn't get merged with RAM. All the memory that is discovered
> via ibm,dynamic-memory-configuration is marked so(1). Next identify
> such regions in radix_init_pgtable() and create separate mappings
> within that region for each LMB so that they get don't get aligned
> like RAM region at 1G (2).
>
> (1) For PowerKVM guests, all boot time memory is represented via
> memory@ nodes and hot plugged/pluggable memory is represented via
> ibm,dynamic-memory-reconfiguration property. We are marking all
> hotplugged memory that is in ASSIGNED state during boot as HOTPLUGGED.
> With this only cold plugged memory gets marked for PowerKVM but
> need to check how this will affect PowerVM guests.

Can you verify this on PowerVM too ? ie we should in most case not find
anything under ibm,dynamic-memory-reconfiguration ?


-aneesh



RE: [PATCH net-next] fsl/fman: implement several errata workarounds

2017-08-11 Thread Madalin-cristian Bucur
> -Original Message-
> From: Florinel Iordache [mailto:florinel.iorda...@nxp.com]
> Subject: [PATCH net-next] fsl/fman: implement several errata workarounds
> 
> Implemented workarounds for the following dTSEC Erratum:
> A002, A004, A0012, A0014, A004839 on several operations
> that involve MAC CFG register changes: adjust link,
> rx pause frames, modify MAC address.
> 
> Signed-off-by: Florinel Iordache 

Acked-by: Madalin Bucur 


Re: [FIX PATCH v0] powerpc: Fix memory unplug failure on radix guest

2017-08-11 Thread Aneesh Kumar K.V
Reza Arbab  writes:

> On Thu, Aug 10, 2017 at 02:53:48PM +0530, Bharata B Rao wrote:
>>diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
>>index f830562..24ecf53 100644
>>--- a/arch/powerpc/kernel/prom.c
>>+++ b/arch/powerpc/kernel/prom.c
>>@@ -524,6 +524,7 @@ static int __init 
>>early_init_dt_scan_drconf_memory(unsigned long node)
>>  size = 0x8000ul - base;
>>  }
>>  memblock_add(base, size);
>>+ memblock_mark_hotplug(base, size);
>>  } while (--rngs);
>>  }
>>  memblock_dump_all();
>
> Doing this has the effect of putting all the affected memory into 
> ZONE_MOVABLE. See find_zone_movable_pfns_for_nodes(). This means no 
> kernel allocations can occur there. Is that okay?
>

So the thinking here is any memory identified via ibm,dynamic-memory can
be hot removed later. Hence the need to add them lmb size, because our
hotplug framework remove them in lmb size. If we want to support
hotunplug, then we will have to make sure kernel allocation doesn't
happen in that region right ?

With the above i would consider not marking it hotplug was a bug before
?

-aneesh



[PATCH] powerpc/vdso64: Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE

2017-08-11 Thread Santosh Sivaraj
Current vDSO64 implementation does not have support for coarse clocks
(CLOCK_MONOTONIC_COARSE, CLOCK_REALTIME_COARSE), for which it falls back
to system call, increasing the response time, vDSO implementation reduces
the cycle time. Below is a benchmark of the difference in execution time
with and without vDSO support.

(Non-coarse clocks are also included just for completion)

Without vDSO support:

clock-gettime-realtime: syscall: 172 nsec/call
clock-gettime-realtime:libc: 26 nsec/call
clock-gettime-realtime:vdso: 21 nsec/call
clock-gettime-monotonic: syscall: 170 nsec/call
clock-gettime-monotonic:libc: 30 nsec/call
clock-gettime-monotonic:vdso: 24 nsec/call
clock-gettime-realtime-coarse: syscall: 153 nsec/call
clock-gettime-realtime-coarse:libc: 15 nsec/call
clock-gettime-realtime-coarse:vdso: 9 nsec/call
clock-gettime-monotonic-coarse: syscall: 167 nsec/call
clock-gettime-monotonic-coarse:libc: 15 nsec/call
clock-gettime-monotonic-coarse:vdso: 11 nsec/call

CC: Benjamin Herrenschmidt 
Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/kernel/asm-offsets.c |  2 +
 arch/powerpc/kernel/vdso64/gettimeofday.S | 73 ---
 2 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 6e95c2c..c6acaa5 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -396,6 +396,8 @@ int main(void)
/* Other bits used by the vdso */
DEFINE(CLOCK_REALTIME, CLOCK_REALTIME);
DEFINE(CLOCK_MONOTONIC, CLOCK_MONOTONIC);
+   DEFINE(CLOCK_REALTIME_COARSE, CLOCK_REALTIME_COARSE);
+   DEFINE(CLOCK_MONOTONIC_COARSE, CLOCK_MONOTONIC_COARSE);
DEFINE(NSEC_PER_SEC, NSEC_PER_SEC);
DEFINE(CLOCK_REALTIME_RES, MONOTONIC_RES_NSEC);
 
diff --git a/arch/powerpc/kernel/vdso64/gettimeofday.S 
b/arch/powerpc/kernel/vdso64/gettimeofday.S
index 3820213..5229d1e 100644
--- a/arch/powerpc/kernel/vdso64/gettimeofday.S
+++ b/arch/powerpc/kernel/vdso64/gettimeofday.S
@@ -60,19 +60,26 @@ V_FUNCTION_END(__kernel_gettimeofday)
  */
 V_FUNCTION_BEGIN(__kernel_clock_gettime)
   .cfi_startproc
+   mr  r11,r4  /* r11 saves tp */
+   mflrr12 /* r12 saves lr */
+   lis r7,NSEC_PER_SEC@h   /* want nanoseconds */
+   ori r7,r7,NSEC_PER_SEC@l
+
/* Check for supported clock IDs */
cmpwi   cr0,r3,CLOCK_REALTIME
cmpwi   cr1,r3,CLOCK_MONOTONIC
crorcr0*4+eq,cr0*4+eq,cr1*4+eq
-   bne cr0,99f
+   beq cr0,50f
 
-   mflrr12 /* r12 saves lr */
+   cmpwi   cr0,r3,CLOCK_REALTIME_COARSE
+   cmpwi   cr1,r3,CLOCK_MONOTONIC_COARSE
+   crorcr0*4+eq,cr0*4+eq,cr1*4+eq
+   beq cr0,65f
+
+   b   99f /* Fallback to syscall */
   .cfi_register lr,r12
-   mr  r11,r4  /* r11 saves tp */
-   bl  V_LOCAL_FUNC(__get_datapage)/* get data page */
-   lis r7,NSEC_PER_SEC@h   /* want nanoseconds */
-   ori r7,r7,NSEC_PER_SEC@l
-50:bl  V_LOCAL_FUNC(__do_get_tspec)/* get time from tb & kernel */
+50:bl  V_LOCAL_FUNC(__get_datapage)/* get data page */
+   bl  V_LOCAL_FUNC(__do_get_tspec)/* get time from tb & kernel */
bne cr1,80f /* if not monotonic, all done */
 
/*
@@ -110,6 +117,58 @@ V_FUNCTION_BEGIN(__kernel_clock_gettime)
 1: bge cr1,80f
addir4,r4,-1
add r5,r5,r7
+   b   80f
+
+   /*
+* For coarse clocks we get data directly from the vdso data page, so
+* we don't need to call __do_get_tspec, but we still need to do the
+* counter trick.
+*/
+65:bl  V_LOCAL_FUNC(__get_datapage)/* get data page */
+70:ld  r8,CFG_TB_UPDATE_COUNT(r3)
+   andi.   r0,r8,1 /* pending update ? loop */
+   bne-70b
+   xor r0,r8,r8/* create dependency */
+   add r3,r3,r0
+
+   /*
+* CLOCK_REALTIME_COARSE, below values are needed for MONOTONIC_COARSE
+* too
+*/
+   ld  r4,STAMP_XTIME+TSPC64_TV_SEC(r3)
+   ld  r5,STAMP_XTIME+TSPC64_TV_NSEC(r3)
+   bne cr1,78f
+
+   /* CLOCK_MONOTONIC_COARSE */
+   lwa r6,WTOM_CLOCK_SEC(r3)
+   lwa r9,WTOM_CLOCK_NSEC(r3)
+
+   /* check if counter has updated */
+78:or  r0,r6,r9
+   xor r0,r0,r0
+   add r3,r3,r0
+   ld  r0,CFG_TB_UPDATE_COUNT(r3)
+   cmpld   cr0,r0,r8   /* check if updated */
+   bne-70b
+
+   /* Counter has not updated, so continue calculating proper values for
+* sec and nsec if monotonic coarse, or just return with the proper
+* values for realtime.
+*/
+  

[PATCH kernel] PCI: Disable IOV before pcibios_sriov_disable()

2017-08-11 Thread Alexey Kardashevskiy
From: Gavin Shan 

The PowerNV platform is the only user of pcibios_sriov_disable().
The IOV BAR could be shifted by pci_iov_update_resource(). The
warning message in the function is printed if the IOV capability
is in enabled (PCI_SRIOV_CTRL_VFE && PCI_SRIOV_CTRL_MSE) state.

This is the backtrace of what is happening:
   pci_disable_sriov
   sriov_disable
   pnv_pci_sriov_disable
   pnv_pci_vf_resource_shift
   pci_update_resource
   pci_iov_update_resource

This fixes the issue by disabling IOV capability before calling
pcibios_sriov_disable(). With it, the disabling path matches
the enabling path: pcibios_sriov_enable() is called before the
IOV capability is enabled.

Cc: shan.ga...@gmail.com
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Reported-by: Carol L Soto 
Signed-off-by: Gavin Shan 
Tested-by: Carol L Soto 
Signed-off-by: Alexey Kardashevskiy 
---

This is repost. Since Gavin left the team, I am trying to push it out.
The previos converstion is here: https://patchwork.ozlabs.org/patch/732653/

Two questions were raised then. I'll try to comment on this below.

>1) "res" is already in the resource tree, so we shouldn't be changing
>   its start address, because that may make the tree inconsistent,
>   e.g., the resource may no longer be completely contained in its
>   parent, it may conflict with a sibling, etc.

We should not, yes. But...

At the boot time IOV BAR gets as much MMIO space as it can possibly use.
(Embarassingly I cannot trace where this is coming from, 8GB is selected
via pci_assign_unassigned_root_bus_resources() path somehow).
For example, it is 256*32MB=8GB where 256 is maximum PEs number and 32MB
is a PF/VF BAR size. Whatever shifting we do afterwards, the boudaries of
that 8GB area do not change and we test it in pnv_pci_vf_resource_shift():

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/platforms/powernv/pci-ioda.c#n987

> 2) If we update "res->start", shouldn't we update "res->end"
>   correspondingly?

We have to update the PF's IOV BAR address as we allocate PEs dynamically
and we do not know in advance where our VF numbers start in that
8GB window. So we change IOV BASR start. Changing the end may make it
look more like there is a free area to use but in reality it won't be
usable as well as the area we "release" by shifting the start address.

We could probably move that M64 MMIO window by the same delta in
opposite direction so the IOV BAR start address would remain the same
but its VF#0 would be mapped to let's say PF#5. I am just afraid there
is an alignment requirement for these M64 window start address; and this
would be even more tricky to manage.

We could also create reserved areas for the amount of space "release" by
moving the start address, not sure how though.

So how do we proceed with this particular patch now? Thanks.
---
 drivers/pci/iov.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 120485d6f352..ac41c8be9200 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -331,7 +331,6 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
while (i--)
pci_iov_remove_virtfn(dev, i, 0);
 
-   pcibios_sriov_disable(dev);
 err_pcibios:
iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
pci_cfg_access_lock(dev);
@@ -339,6 +338,8 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
ssleep(1);
pci_cfg_access_unlock(dev);
 
+   pcibios_sriov_disable(dev);
+
if (iov->link != dev->devfn)
sysfs_remove_link(>dev.kobj, "dep_link");
 
@@ -357,14 +358,14 @@ static void sriov_disable(struct pci_dev *dev)
for (i = 0; i < iov->num_VFs; i++)
pci_iov_remove_virtfn(dev, i, 0);
 
-   pcibios_sriov_disable(dev);
-
iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
pci_cfg_access_lock(dev);
pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
ssleep(1);
pci_cfg_access_unlock(dev);
 
+   pcibios_sriov_disable(dev);
+
if (iov->link != dev->devfn)
sysfs_remove_link(>dev.kobj, "dep_link");
 
-- 
2.11.0



Re: [v6 01/15] x86/mm: reserve only exiting low pages

2017-08-11 Thread Michal Hocko
On Mon 07-08-17 16:38:35, Pavel Tatashin wrote:
> Struct pages are initialized by going through __init_single_page(). Since
> the existing physical memory in memblock is represented in memblock.memory
> list, struct page for every page from this list goes through
> __init_single_page().

By a page _from_ this list you mean struct pages backing the physical
memory of the memblock lists?
 
> The second memblock list: memblock.reserved, manages the allocated memory.
> The memory that won't be available to kernel allocator. So, every page from
> this list goes through reserve_bootmem_region(), where certain struct page
> fields are set, the assumption being that the struct pages have been
> initialized beforehand.
> 
> In trim_low_memory_range() we unconditionally reserve memoryfrom PFN 0, but
> memblock.memory might start at a later PFN. For example, in QEMU,
> e820__memblock_setup() can use PFN 1 as the first PFN in memblock.memory,
> so PFN 0 is not on memblock.memory (and hence isn't initialized via
> __init_single_page) but is on memblock.reserved (and hence we set fields in
> the uninitialized struct page).
> 
> Currently, the struct page memory is always zeroed during allocation,
> which prevents this problem from being detected. But, if some asserts
> provided by CONFIG_DEBUG_VM_PGFLAGS are tighten, this problem may become
> visible in existing kernels.
> 
> In this patchset we will stop zeroing struct page memory during allocation.
> Therefore, this bug must be fixed in order to avoid random assert failures
> caused by CONFIG_DEBUG_VM_PGFLAGS triggers.
> 
> The fix is to reserve memory from the first existing PFN.

Hmm, I assume this is a result of some assert triggering, right? Which
one? Why don't we need the same treatment for other than x86 arch?

> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 

I guess that the review happened inhouse. I do not want to question its
value but it is rather strange to not hear the specific review comments
which might be useful in general and moreover even not include those
people on the CC list so they are aware of the follow up discussion.

> ---
>  arch/x86/kernel/setup.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 3486d0498800..489cdc141bcb 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -790,7 +790,10 @@ early_param("reservelow", parse_reservelow);
>  
>  static void __init trim_low_memory_range(void)
>  {
> - memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
> + unsigned long min_pfn = find_min_pfn_with_active_regions();
> + phys_addr_t base = min_pfn << PAGE_SHIFT;
> +
> + memblock_reserve(base, ALIGN(reserve_low, PAGE_SIZE));
>  }
>   
>  /*
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs


Re: [v6 00/15] complete deferred page initialization

2017-08-11 Thread Michal Hocko
[I am sorry I didn't get to your previous versions]

On Mon 07-08-17 16:38:34, Pavel Tatashin wrote:
[...]
> SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
> which defers initializing struct pages until all cpus have been started so
> it can be done in parallel.
> 
> However, this feature is sub-optimal, because the deferred page
> initialization code expects that the struct pages have already been zeroed,
> and the zeroing is done early in boot with a single thread only.  Also, we
> access that memory and set flags before struct pages are initialized. All
> of this is fixed in this patchset.
> 
> In this work we do the following:
> - Never read access struct page until it was initialized

How is this enforced? What about pfn walkers? E.g. page_ext
initialization code (page owner in particular)

> - Never set any fields in struct pages before they are initialized
> - Zero struct page at the beginning of struct page initialization

Please give us a more highlevel description of how your reimplementation
works and how is the patchset organized. I will go through those patches
but it is always good to give an overview in the cover letter to make
the review easier.

> Performance improvements on x86 machine with 8 nodes:
> Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz
> 
> Single threaded struct page init: 7.6s/T improvement
> Deferred struct page init: 10.2s/T improvement

What are before and after numbers and how have you measured them.
> 
> Pavel Tatashin (15):
>   x86/mm: reserve only exiting low pages
>   x86/mm: setting fields in deferred pages
>   sparc64/mm: setting fields in deferred pages
>   mm: discard memblock data later
>   mm: don't accessed uninitialized struct pages
>   sparc64: simplify vmemmap_populate
>   mm: defining memblock_virt_alloc_try_nid_raw
>   mm: zero struct pages during initialization
>   sparc64: optimized struct page zeroing
>   x86/kasan: explicitly zero kasan shadow memory
>   arm64/kasan: explicitly zero kasan shadow memory
>   mm: explicitly zero pagetable memory
>   mm: stop zeroing memory during allocation in vmemmap
>   mm: optimize early system hash allocations
>   mm: debug for raw alloctor
> 
>  arch/arm64/mm/kasan_init.c  |  42 ++
>  arch/sparc/include/asm/pgtable_64.h |  30 +++
>  arch/sparc/mm/init_64.c |  31 +++-
>  arch/x86/kernel/setup.c |   5 +-
>  arch/x86/mm/init_64.c   |   9 ++-
>  arch/x86/mm/kasan_init_64.c |  67 
>  include/linux/bootmem.h |  27 +++
>  include/linux/memblock.h|   9 ++-
>  include/linux/mm.h  |   9 +++
>  mm/memblock.c   | 152 
> 
>  mm/nobootmem.c  |  16 
>  mm/page_alloc.c |  31 +---
>  mm/sparse-vmemmap.c |  10 ++-
>  mm/sparse.c |   6 +-
>  14 files changed, 356 insertions(+), 88 deletions(-)
> 
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs


Re: [RFC v7 09/25] powerpc: store and restore the pkey state across context switches

2017-08-11 Thread Michael Ellerman
Thiago Jung Bauermann  writes:

> Ram Pai  writes:
>> --- a/arch/powerpc/kernel/process.c
>> +++ b/arch/powerpc/kernel/process.c
>> @@ -42,6 +42,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  #include 
>>  #include 
>> @@ -1096,6 +1097,13 @@ static inline void save_sprs(struct thread_struct *t)
>>  t->tar = mfspr(SPRN_TAR);
>>  }
>>  #endif
>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>> +if (arch_pkeys_enabled()) {
>> +t->amr = mfspr(SPRN_AMR);
>> +t->iamr = mfspr(SPRN_IAMR);
>> +t->uamor = mfspr(SPRN_UAMOR);
>> +}
>> +#endif
>>  }
>
> Is it worth having a flag in thread_struct saying whether it has every
> called pkey_alloc and only do the mfsprs if it did?

Yes, in fact there's a programming note in the UAMOR section of the arch
that says exactly that.

On the write side you have to be a bit more careful. You have to make
sure you set the UAMOR to 0 when you're switching from a process that
has used keys to one that isn't.

cheers


[PATCH 1/2] powerpc/powernv/npu: Move tlb flush before launching ATSD

2017-08-11 Thread Alistair Popple
The nest mmu tlb flush needs to happen before the GPU translation shootdown
is launched to avoid the GPU refilling its tlb with stale nmmu translations
prior to the nmmu flush completing.

Signed-off-by: Alistair Popple 
Cc: sta...@vger.kernel.org
---
 arch/powerpc/platforms/powernv/npu-dma.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index b5d960d..3d4f879 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -546,6 +546,12 @@ static void mmio_invalidate(struct npu_context 
*npu_context, int va,
unsigned long pid = npu_context->mm->context.id;
 
/*
+* Unfortunately the nest mmu does not support flushing specific
+* addresses so we have to flush the whole mm.
+*/
+   flush_tlb_mm(npu_context->mm);
+
+   /*
 * Loop over all the NPUs this process is active on and launch
 * an invalidate.
 */
@@ -576,12 +582,6 @@ static void mmio_invalidate(struct npu_context 
*npu_context, int va,
}
}
 
-   /*
-* Unfortunately the nest mmu does not support flushing specific
-* addresses so we have to flush the whole mm.
-*/
-   flush_tlb_mm(npu_context->mm);
-
mmio_invalidate_wait(mmio_atsd_reg, flush);
if (flush)
/* Wait for the flush to complete */
-- 
2.1.4



[PATCH 2/2] powerpc/powernv/npu: Don't explicitly flush nmmu tlb

2017-08-11 Thread Alistair Popple
The nest mmu required an explicit flush as a tlbi would not flush it in the
same way as the core. However an alternate firmware fix exists which should
eliminate the need for this flush, so instead add a device-tree property
(ibm,nmmu-flush) on the NVLink2 PHB to enable it only if required.

Signed-off-by: Alistair Popple 
---

Michael,

This patch depends on http://patchwork.ozlabs.org/patch/796775/ - [v3,1/3]
powerpc/mm: Add marker for contexts requiring global TLB invalidations.

- Alistair

 arch/powerpc/platforms/powernv/npu-dma.c | 27 +--
 arch/powerpc/platforms/powernv/pci.h |  3 +++
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 3d4f879..ac07800 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -544,12 +544,7 @@ static void mmio_invalidate(struct npu_context 
*npu_context, int va,
struct pci_dev *npdev;
struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS];
unsigned long pid = npu_context->mm->context.id;
-
-   /*
-* Unfortunately the nest mmu does not support flushing specific
-* addresses so we have to flush the whole mm.
-*/
-   flush_tlb_mm(npu_context->mm);
+   bool nmmu_flushed = false;
 
/*
 * Loop over all the NPUs this process is active on and launch
@@ -566,6 +561,17 @@ static void mmio_invalidate(struct npu_context 
*npu_context, int va,
npu = >npu;
mmio_atsd_reg[i].npu = npu;
 
+   if (nphb->npu.nmmu_flush && !nmmu_flushed) {
+   /*
+* Unfortunately the nest mmu does not support
+* flushing specific addresses so we have to
+* flush the whole mm once before shooting down
+* the GPU translation.
+*/
+   flush_tlb_mm(npu_context->mm);
+   nmmu_flushed = true;
+   }
+
if (va)
mmio_atsd_reg[i].reg =
mmio_invalidate_va(npu, address, pid,
@@ -732,6 +738,13 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev 
*gpdev,
return ERR_PTR(-ENODEV);
npu_context->npdev[npu->index][nvlink_index] = npdev;
 
+   if (!nphb->npu.nmmu_flush)
+   /*
+* If we're not explicitly flushing ourselves we need to mark
+* the thread for global flushes
+*/
+   mm_context_set_global_tlbi(>context);
+
return npu_context;
 }
 EXPORT_SYMBOL(pnv_npu2_init_context);
@@ -829,6 +842,8 @@ int pnv_npu2_init(struct pnv_phb *phb)
static int npu_index;
uint64_t rc = 0;
 
+   phb->npu.nmmu_flush =
+   of_property_read_bool(phb->hose->dn, "ibm,nmmu-flush");
for_each_child_of_node(phb->hose->dn, dn) {
gpdev = pnv_pci_get_gpu_dev(get_pci_dev(dn));
if (gpdev) {
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index f16bc40..e8e3e20 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -184,6 +184,9 @@ struct pnv_phb {
 
/* Bitmask for MMIO register usage */
unsigned long mmio_atsd_usage;
+
+   /* Do we need to explicitly flush the nest mmu? */
+   bool nmmu_flush;
} npu;
 
 #ifdef CONFIG_CXL_BASE
-- 
2.1.4