Re: [PATCH 2/3] powerpc/sysfs: Show idle_purr and idle_spurr for every CPU

2020-02-04 Thread Christophe Leroy




Le 27/11/2019 à 13:01, Gautham R. Shenoy a écrit :

From: "Gautham R. Shenoy" 

On Pseries LPARs, to calculate utilization, we need to know the
[S]PURR ticks when the CPUs were busy or idle.

The total PURR and SPURR ticks are already exposed via the per-cpu
sysfs files /sys/devices/system/cpu/cpuX/purr and
/sys/devices/system/cpu/cpuX/spurr.

This patch adds support for exposing the idle PURR and SPURR ticks via
/sys/devices/system/cpu/cpuX/idle_purr and
/sys/devices/system/cpu/cpuX/idle_spurr.


Might be a candid question, but I see in arch/powerpc/kernel/time.c that 
PURR/SPURR are already taken into account by the kernel to calculate 
utilisation when CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is selected.


As far as I understand, you are wanting to expose this to userland to 
redo the calculation there. What is wrong with the values reported by 
the kernel ?


Christophe



Signed-off-by: Gautham R. Shenoy 
---
  arch/powerpc/kernel/sysfs.c | 32 
  1 file changed, 32 insertions(+)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 80a676d..42ade55 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -1044,6 +1044,36 @@ static ssize_t show_physical_id(struct device *dev,
  }
  static DEVICE_ATTR(physical_id, 0444, show_physical_id, NULL);
  
+static ssize_t idle_purr_show(struct device *dev,

+ struct device_attribute *attr, char *buf)
+{
+   struct cpu *cpu = container_of(dev, struct cpu, dev);
+   unsigned int cpuid = cpu->dev.id;
+   struct lppaca *cpu_lppaca_ptr = paca_ptrs[cpuid]->lppaca_ptr;
+   u64 idle_purr_cycles = be64_to_cpu(cpu_lppaca_ptr->wait_state_cycles);
+
+   return sprintf(buf, "%llx\n", idle_purr_cycles);
+}
+static DEVICE_ATTR_RO(idle_purr);
+
+DECLARE_PER_CPU(u64, idle_spurr_cycles);
+static ssize_t idle_spurr_show(struct device *dev,
+  struct device_attribute *attr, char *buf)
+{
+   struct cpu *cpu = container_of(dev, struct cpu, dev);
+   unsigned int cpuid = cpu->dev.id;
+   u64 *idle_spurr_cycles_ptr = per_cpu_ptr(_spurr_cycles, cpuid);
+
+   return sprintf(buf, "%llx\n", *idle_spurr_cycles_ptr);
+}
+static DEVICE_ATTR_RO(idle_spurr);
+
+static void create_idle_purr_spurr_sysfs_entry(struct device *cpudev)
+{
+   device_create_file(cpudev, _attr_idle_purr);
+   device_create_file(cpudev, _attr_idle_spurr);
+}
+
  static int __init topology_init(void)
  {
int cpu, r;
@@ -1067,6 +1097,8 @@ static int __init topology_init(void)
register_cpu(c, cpu);
  
  			device_create_file(>dev, _attr_physical_id);

+   if (firmware_has_feature(FW_FEATURE_SPLPAR))
+   create_idle_purr_spurr_sysfs_entry(>dev);
}
}
r = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "powerpc/topology:online",



Re: [PATCH 2/3] powerpc/sysfs: Show idle_purr and idle_spurr for every CPU

2020-02-04 Thread Naveen N. Rao

Gautham R Shenoy wrote:



With repect to lparstat, the read interval is user-specified and just gets
passed onto sleep().


Ok. So I guess currently you will be sending smp_call_function every
time you read a PURR and SPURR. That number will now increase by 2
times when we read idle_purr and idle_spurr.


Yes, not really efficient. I just wanted to point out that we can't have 
stale data being returned if we choose to add another sysfs file.


We should be able to use any other interface too, if you have a 
different interface in mind.



- Naveen



Re: [PATCH] powerpc/vdso32: mark __kernel_datapage_offset as STV_PROTECTED

2020-02-04 Thread Christophe Leroy




Le 05/02/2020 à 01:50, Fangrui Song a écrit :

A PC-relative relocation (R_PPC_REL16_LO in this case) referencing a
preemptible symbol in a -shared link is not allowed.  GNU ld's powerpc
port is permissive and allows it [1], but lld will report an error after
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=ec0895f08f99515194e9fcfe1338becf6f759d38


Note that there is a series whose first two patches aim at dropping 
__kernel_datapage_offset . See 
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=156045 
and especially patches https://patchwork.ozlabs.org/patch/1231467/ and 
https://patchwork.ozlabs.org/patch/1231461/


Those patches can be applied independentely of the rest.

Christophe



Make the symbol protected so that it is non-preemptible but still
exported.

[1]: https://sourceware.org/bugzilla/show_bug.cgi?id=25500

Link: https://github.com/ClangBuiltLinux/linux/issues/851
Signed-off-by: Fangrui Song 
---
  arch/powerpc/kernel/vdso32/datapage.S | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/vdso32/datapage.S 
b/arch/powerpc/kernel/vdso32/datapage.S
index 217bb630f8f9..2831a8676365 100644
--- a/arch/powerpc/kernel/vdso32/datapage.S
+++ b/arch/powerpc/kernel/vdso32/datapage.S
@@ -13,7 +13,8 @@
  #include 
  
  	.text

-   .global __kernel_datapage_offset;
+   .global __kernel_datapage_offset
+   .protected  __kernel_datapage_offset
  __kernel_datapage_offset:
.long   0
  



Re: [PATCH v2 2/3] selftests/powerpc: Add tm-signal-pagefault test

2020-02-04 Thread Michael Ellerman
Gustavo Luiz Duarte  writes:
> This test triggers a TM Bad Thing by raising a signal in transactional state
> and forcing a pagefault to happen in kernelspace when the kernel signal
> handling code first touches the user signal stack.
>
> This is inspired by the test tm-signal-context-force-tm but uses userfaultfd 
> to
> make the test deterministic. While this test always triggers the bug in one
> run, I had to execute tm-signal-context-force-tm several times (the test runs
> 5000 times each execution) to trigger the same bug.

Using userfaultfd is a very nice touch. But it's not always enabled,
which leads to eg:

  root@mpe-ubuntu-le:~# /home/michael/tm-signal-pagefault 
  test: tm_signal_pagefault
  tags: git_version:v5.5-9354-gc1e346e7fc44
  userfaultfd() failed: Function not implemented
  failure: tm_signal_pagefault

It would be nice if that resulted in a skip, not a failure.

It looks like it shouldn't be too hard to skip if the userfaultfd call
returns ENOSYS.

cheers


[PATCH v2] libnvdimm: Update persistence domain value for of_pmem and papr_scm device

2020-02-04 Thread Aneesh Kumar K.V
Currently, kernel shows the below values
"persistence_domain":"cpu_cache"
"persistence_domain":"memory_controller"
"persistence_domain":"unknown"

"cpu_cache" indicates no extra instructions is needed to ensure the persistence
of data in the pmem media on power failure.

"memory_controller" indicates platform provided instructions need to be issued
as per documented sequence to make sure data get flushed so that it is
guaranteed to be on pmem media in case of system power loss.

Based on the above use memory_controller for non volatile regions on ppc64.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 7 ++-
 drivers/nvdimm/of_pmem.c  | 4 +++-
 include/linux/libnvdimm.h | 1 -
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 7525635a8536..ffcd0d7a867c 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -359,8 +359,13 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 
if (p->is_volatile)
p->region = nvdimm_volatile_region_create(p->bus, _desc);
-   else
+   else {
+   /*
+* We need to flush things correctly to guarantee persistance
+*/
+   set_bit(ND_REGION_PERSIST_MEMCTRL, _desc.flags);
p->region = nvdimm_pmem_region_create(p->bus, _desc);
+   }
if (!p->region) {
dev_err(dev, "Error registering region %pR from %pOF\n",
ndr_desc.res, p->dn);
diff --git a/drivers/nvdimm/of_pmem.c b/drivers/nvdimm/of_pmem.c
index 8224d1431ea9..6826a274a1f1 100644
--- a/drivers/nvdimm/of_pmem.c
+++ b/drivers/nvdimm/of_pmem.c
@@ -62,8 +62,10 @@ static int of_pmem_region_probe(struct platform_device *pdev)
 
if (is_volatile)
region = nvdimm_volatile_region_create(bus, _desc);
-   else
+   else {
+   set_bit(ND_REGION_PERSIST_MEMCTRL, _desc.flags);
region = nvdimm_pmem_region_create(bus, _desc);
+   }
 
if (!region)
dev_warn(>dev, "Unable to register region %pR 
from %pOF\n",
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 0f366706b0aa..771d888a5ed7 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -54,7 +54,6 @@ enum {
/*
 * Platform provides mechanisms to automatically flush outstanding
 * write data from memory controler to pmem on system power loss.
-* (ADR)
 */
ND_REGION_PERSIST_MEMCTRL = 2,
 
-- 
2.24.1



Re: [PATCH v2 3/6] powerpc/fsl_booke/64: implement KASLR for fsl_booke64

2020-02-04 Thread kbuild test robot
Hi Jason,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v5.5 next-20200204]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:
https://github.com/0day-ci/linux/commits/Jason-Yan/implement-KASLR-for-powerpc-fsl_booke-64/20200205-105837
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-defconfig (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 7.5.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.5.0 make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

All errors (new ones prefixed by >>):

   arch/powerpc/kernel/setup_64.c: In function 'early_setup':
>> arch/powerpc/kernel/setup_64.c:303:2: error: implicit declaration of 
>> function 'kaslr_early_init'; did you mean 'udbg_early_init'? 
>> [-Werror=implicit-function-declaration]
 kaslr_early_init(__va(dt_ptr), 0);
 ^~~~
 udbg_early_init
   cc1: all warnings being treated as errors

vim +303 arch/powerpc/kernel/setup_64.c

   262  
   263  /*
   264   * Early initialization entry point. This is called by head.S
   265   * with MMU translation disabled. We rely on the "feature" of
   266   * the CPU that ignores the top 2 bits of the address in real
   267   * mode so we can access kernel globals normally provided we
   268   * only toy with things in the RMO region. From here, we do
   269   * some early parsing of the device-tree to setup out MEMBLOCK
   270   * data structures, and allocate & initialize the hash table
   271   * and segment tables so we can start running with translation
   272   * enabled.
   273   *
   274   * It is this function which will call the probe() callback of
   275   * the various platform types and copy the matching one to the
   276   * global ppc_md structure. Your platform can eventually do
   277   * some very early initializations from the probe() routine, but
   278   * this is not recommended, be very careful as, for example, the
   279   * device-tree is not accessible via normal means at this point.
   280   */
   281  
   282  void __init early_setup(unsigned long dt_ptr)
   283  {
   284  static __initdata struct paca_struct boot_paca;
   285  
   286  /*  printk is _NOT_ safe to use here ! --- */
   287  
   288  /* Try new device tree based feature discovery ... */
   289  if (!dt_cpu_ftrs_init(__va(dt_ptr)))
   290  /* Otherwise use the old style CPU table */
   291  identify_cpu(0, mfspr(SPRN_PVR));
   292  
   293  /* Assume we're on cpu 0 for now. Don't write to the paca yet! 
*/
   294  initialise_paca(_paca, 0);
   295  setup_paca(_paca);
   296  fixup_boot_paca();
   297  
   298  /*  printk is now safe to use --- */
   299  
   300  /* Enable early debugging if any specified (see udbg.h) */
   301  udbg_early_init();
   302  
 > 303  kaslr_early_init(__va(dt_ptr), 0);
   304  
   305  udbg_printf(" -> %s(), dt_ptr: 0x%lx\n", __func__, dt_ptr);
   306  
   307  /*
   308   * Do early initialization using the flattened device
   309   * tree, such as retrieving the physical memory map or
   310   * calculating/retrieving the hash table size.
   311   */
   312  early_init_devtree(__va(dt_ptr));
   313  
   314  /* Now we know the logical id of our boot cpu, setup the paca. 
*/
   315  if (boot_cpuid != 0) {
   316  /* Poison paca_ptrs[0] again if it's not the boot cpu */
   317  memset(_ptrs[0], 0x88, sizeof(paca_ptrs[0]));
   318  }
   319  setup_paca(paca_ptrs[boot_cpuid]);
   320  fixup_boot_paca();
   321  
   322  /*
   323   * Configure exception handlers. This include setting up 
trampolines
   324   * if needed, setting exception endian mode, etc...
   325   */
   326  configure_exceptions();
   327  
   328  /*
   329   * Configure Kernel Userspace Protection. This needs to happen 
before
   330   * feature fixups for platforms that implement this using 
features.
   331   */
   332  setup_kup();
   333  
   334  /* Apply all the dynamic patching */
   335  apply_feature_fixups();
   336  setup_feature_keys();
   337  
   338  early_ioremap_setup();
   339  
   340  

Re: [PATCH v2 1/3] powerpc/tm: Clear the current thread's MSR[TS] after treclaim

2020-02-04 Thread Michael Neuling
Other than the minor things below that I think you need, the patch good with me.

Acked-by: Michael Neuling 

> Subject: Re: [PATCH v2 1/3] powerpc/tm: Clear the current thread's MSR[TS] 
> after treclaim

The subject should mention "signals".

On Mon, 2020-02-03 at 13:09 -0300, Gustavo Luiz Duarte wrote:
> After a treclaim, we expect to be in non-transactional state. If we don't
> immediately clear the current thread's MSR[TS] and we get preempted, then
> tm_recheckpoint_new_task() will recheckpoint and we get rescheduled in
> suspended transaction state.

It's not "immediately", it's before re-enabling preemption. 

There is a similar comment in the code that needs to be fixed too.

> When handling a signal caught in transactional state, handle_rt_signal64()
> calls get_tm_stackpointer() that treclaims the transaction using
> tm_reclaim_current() but without clearing the thread's MSR[TS]. This can cause
> the TM Bad Thing exception below if later we pagefault and get preempted 
> trying
> to access the user's sigframe, using __put_user(). Afterwards, when we are
> rescheduled back into do_page_fault() (but now in suspended state since the
> thread's MSR[TS] was not cleared), upon executing 'rfid' after completion of
> the page fault handling, the exception is raised because a transition from
> suspended to non-transactional state is invalid.
> 
>   Unexpected TM Bad Thing exception at c000de44 (msr 
> 0x800302a03031) tm_scratch=80010280b033
>   Oops: Unrecoverable exception, sig: 6 [#1]
>   LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
>   Modules linked in: nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 
> nf_defrag_ipv4 ip6_tables ip_tables nft_compat ip_set nf_tables nfnetlink xts 
> vmx_crypto sg virtio_balloon
>   r_mod cdrom virtio_net net_failover virtio_blk virtio_scsi failover 
> dm_mirror dm_region_hash dm_log dm_mod
>   CPU: 25 PID: 15547 Comm: a.out Not tainted 5.4.0-rc2 #32
>   NIP:  c000de44 LR: c0034728 CTR: 
>   REGS: c0003fe7bd70 TRAP: 0700   Not tainted  (5.4.0-rc2)
>   MSR:  800302a03031   CR: 44000884 
>  XER: 
>   CFAR: c000dda4 IRQMASK: 0
>   PACATMSCRATCH: 80010280b033
>   GPR00: c0034728 c00f65a17c80 c1662800 
> 7fffacf3fd78
>   GPR04: 1000 1000  
> c00f611f8af0
>   GPR08:  78006001  
> 000c
>   GPR12: c00f611f84b0 c0003ffcb200  
> 
>   GPR16:    
> 
>   GPR20:    
> c00f611f8140
>   GPR24:  7fffacf3fd68 c00f65a17d90 
> c00f611f7800
>   GPR28: c00f65a17e90 c00f65a17e90 c1685e18 
> 7fffacf3f000
>   NIP [c000de44] fast_exception_return+0xf4/0x1b0
>   LR [c0034728] handle_rt_signal64+0x78/0xc50
>   Call Trace:
>   [c00f65a17c80] [c0034710] handle_rt_signal64+0x60/0xc50 
> (unreliable)
>   [c00f65a17d30] [c0023640] do_notify_resume+0x330/0x460
>   [c00f65a17e20] [c000dcc4] ret_from_except_lite+0x70/0x74
>   Instruction dump:
>   7c4ff120 e8410170 7c5a03a6 3840 f8410060 e8010070 e8410080 e8610088
>   6000 6000 e8810090 e8210078 <4c24> 4800 e8610178 
> 88ed0989
>   ---[ end trace 93094aa44b442f87 ]---
> 
> The simplified sequence of events that triggers the above exception is:
> 
>   ... # userspace in NON-TRANSACTIONAL state
>   tbegin  # userspace in TRANSACTIONAL state
>   signal delivery # kernelspace in SUSPENDED state
>   handle_rt_signal64()
> get_tm_stackpointer()
>   treclaim# kernelspace in NON-TRANSACTIONAL state
> __put_user()
>   page fault happens. We will never get back here because of the TM Bad 
> Thing exception.
> 
>   page fault handling kicks in and we voluntarily preempt ourselves
>   do_page_fault()
> __schedule()
>   __switch_to(other_task)
> 
>   our task is rescheduled and we recheckpoint because the thread's MSR[TS] 
> was not cleared
>   __switch_to(our_task)
> switch_to_tm()
>   tm_recheckpoint_new_task()
> trechkpt  # kernelspace in SUSPENDED state
> 
>   The page fault handling resumes, but now we are in suspended transaction 
> state
>   do_page_fault()completes
>   rfid <- trying to get back where the page fault happened (we were 
> non-transactional back then)
>   TM Bad Thing# illegal transition from suspended to 
> non-transactional
> 
> This patch fixes that issue by clearing the current thread's MSR[TS] just 
> after
> treclaim in get_tm_stackpointer() so that we stay in 

Re: [PATCH 2/3] powerpc/sysfs: Show idle_purr and idle_spurr for every CPU

2020-02-04 Thread Gautham R Shenoy
Hi Naveen,

On Tue, Feb 04, 2020 at 01:22:19PM +0530, Naveen N. Rao wrote:
> Gautham R Shenoy wrote:
> >Hi Naveen,
> >
> >On Thu, Dec 05, 2019 at 10:23:58PM +0530, Naveen N. Rao wrote:
> >>>diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> >>>index 80a676d..42ade55 100644
> >>>--- a/arch/powerpc/kernel/sysfs.c
> >>>+++ b/arch/powerpc/kernel/sysfs.c
> >>>@@ -1044,6 +1044,36 @@ static ssize_t show_physical_id(struct device *dev,
> >>> }
> >>> static DEVICE_ATTR(physical_id, 0444, show_physical_id, NULL);
> >>>
> >>>+static ssize_t idle_purr_show(struct device *dev,
> >>>+struct device_attribute *attr, char *buf)
> >>>+{
> >>>+  struct cpu *cpu = container_of(dev, struct cpu, dev);
> >>>+  unsigned int cpuid = cpu->dev.id;
> >>>+  struct lppaca *cpu_lppaca_ptr = paca_ptrs[cpuid]->lppaca_ptr;
> >>>+  u64 idle_purr_cycles = be64_to_cpu(cpu_lppaca_ptr->wait_state_cycles);
> >>>+
> >>>+  return sprintf(buf, "%llx\n", idle_purr_cycles);
> >>>+}
> >>>+static DEVICE_ATTR_RO(idle_purr);
> >>>+
> >>>+DECLARE_PER_CPU(u64, idle_spurr_cycles);
> >>>+static ssize_t idle_spurr_show(struct device *dev,
> >>>+ struct device_attribute *attr, char *buf)
> >>>+{
> >>>+  struct cpu *cpu = container_of(dev, struct cpu, dev);
> >>>+  unsigned int cpuid = cpu->dev.id;
> >>>+  u64 *idle_spurr_cycles_ptr = per_cpu_ptr(_spurr_cycles, cpuid);
> >>
> >>Is it possible for a user to read stale values if a particular cpu is in an
> >>extended cede? Is it possible to use smp_call_function_single() to force the
> >>cpu out of idle?
> >
> >Yes, if the CPU whose idle_spurr cycle is being read is still in idle,
> >then we will miss reporting the delta spurr cycles for this last
> >idle-duration. Yes, we can use an smp_call_function_single(), though
> >that will introduce IPI noise. How often will idle_[s]purr be read ?
> 
> Since it is possible for a cpu to go into extended cede for multiple seconds
> during which time it is possible to mis-report utilization, I think it is
> better to ensure that the sysfs interface for idle_[s]purr report the proper
> values through use of IPI.
>

Fair enough.


> With repect to lparstat, the read interval is user-specified and just gets
> passed onto sleep().

Ok. So I guess currently you will be sending smp_call_function every
time you read a PURR and SPURR. That number will now increase by 2
times when we read idle_purr and idle_spurr.


> 
> - Naveen
> 

--
Thanks and Regards
gautham.


[PATCH v2 1/6] powerpc/fsl_booke/kaslr: refactor kaslr_legal_offset() and kaslr_early_init()

2020-02-04 Thread Jason Yan
Some code refactor in kaslr_legal_offset() and kaslr_early_init(). No
functional change. This is a preparation for KASLR fsl_booke64.

Signed-off-by: Jason Yan 
Cc: Scott Wood 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
 arch/powerpc/mm/nohash/kaslr_booke.c | 40 ++--
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/mm/nohash/kaslr_booke.c 
b/arch/powerpc/mm/nohash/kaslr_booke.c
index 4a75f2d9bf0e..07b036e98353 100644
--- a/arch/powerpc/mm/nohash/kaslr_booke.c
+++ b/arch/powerpc/mm/nohash/kaslr_booke.c
@@ -25,6 +25,7 @@ struct regions {
unsigned long pa_start;
unsigned long pa_end;
unsigned long kernel_size;
+   unsigned long linear_sz;
unsigned long dtb_start;
unsigned long dtb_end;
unsigned long initrd_start;
@@ -260,11 +261,23 @@ static __init void get_cell_sizes(const void *fdt, int 
node, int *addr_cells,
*size_cells = fdt32_to_cpu(*prop);
 }
 
-static unsigned long __init kaslr_legal_offset(void *dt_ptr, unsigned long 
index,
-  unsigned long offset)
+static unsigned long __init kaslr_legal_offset(void *dt_ptr, unsigned long 
random)
 {
unsigned long koffset = 0;
unsigned long start;
+   unsigned long index;
+   unsigned long offset;
+
+   /*
+* Decide which 64M we want to start
+* Only use the low 8 bits of the random seed
+*/
+   index = random & 0xFF;
+   index %= regions.linear_sz / SZ_64M;
+
+   /* Decide offset inside 64M */
+   offset = random % (SZ_64M - regions.kernel_size);
+   offset = round_down(offset, SZ_16K);
 
while ((long)index >= 0) {
offset = memstart_addr + index * SZ_64M + offset;
@@ -289,10 +302,9 @@ static inline __init bool kaslr_disabled(void)
 static unsigned long __init kaslr_choose_location(void *dt_ptr, phys_addr_t 
size,
  unsigned long kernel_sz)
 {
-   unsigned long offset, random;
+   unsigned long random;
unsigned long ram, linear_sz;
u64 seed;
-   unsigned long index;
 
kaslr_get_cmdline(dt_ptr);
if (kaslr_disabled())
@@ -333,22 +345,12 @@ static unsigned long __init kaslr_choose_location(void 
*dt_ptr, phys_addr_t size
regions.dtb_start = __pa(dt_ptr);
regions.dtb_end = __pa(dt_ptr) + fdt_totalsize(dt_ptr);
regions.kernel_size = kernel_sz;
+   regions.linear_sz = linear_sz;
 
get_initrd_range(dt_ptr);
get_crash_kernel(dt_ptr, ram);
 
-   /*
-* Decide which 64M we want to start
-* Only use the low 8 bits of the random seed
-*/
-   index = random & 0xFF;
-   index %= linear_sz / SZ_64M;
-
-   /* Decide offset inside 64M */
-   offset = random % (SZ_64M - kernel_sz);
-   offset = round_down(offset, SZ_16K);
-
-   return kaslr_legal_offset(dt_ptr, index, offset);
+   return kaslr_legal_offset(dt_ptr, random);
 }
 
 /*
@@ -358,8 +360,6 @@ static unsigned long __init kaslr_choose_location(void 
*dt_ptr, phys_addr_t size
  */
 notrace void __init kaslr_early_init(void *dt_ptr, phys_addr_t size)
 {
-   unsigned long tlb_virt;
-   phys_addr_t tlb_phys;
unsigned long offset;
unsigned long kernel_sz;
 
@@ -375,8 +375,8 @@ notrace void __init kaslr_early_init(void *dt_ptr, 
phys_addr_t size)
is_second_reloc = 1;
 
if (offset >= SZ_64M) {
-   tlb_virt = round_down(kernstart_virt_addr, SZ_64M);
-   tlb_phys = round_down(kernstart_addr, SZ_64M);
+   unsigned long tlb_virt = round_down(kernstart_virt_addr, 
SZ_64M);
+   phys_addr_t tlb_phys = round_down(kernstart_addr, SZ_64M);
 
/* Create kernel map to relocate in */
create_kaslr_tlb_entry(1, tlb_virt, tlb_phys);
-- 
2.17.2



[PATCH v2 6/6] powerpc/fsl_booke/kaslr: rename kaslr-booke32.rst to kaslr-booke.rst and add 64bit part

2020-02-04 Thread Jason Yan
Now we support both 32 and 64 bit KASLR for fsl booke. Add document for
64 bit part and rename kaslr-booke32.rst to kaslr-booke.rst.

Signed-off-by: Jason Yan 
Cc: Scott Wood 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
 .../{kaslr-booke32.rst => kaslr-booke.rst}| 35 ---
 1 file changed, 31 insertions(+), 4 deletions(-)
 rename Documentation/powerpc/{kaslr-booke32.rst => kaslr-booke.rst} (59%)

diff --git a/Documentation/powerpc/kaslr-booke32.rst 
b/Documentation/powerpc/kaslr-booke.rst
similarity index 59%
rename from Documentation/powerpc/kaslr-booke32.rst
rename to Documentation/powerpc/kaslr-booke.rst
index 8b259fdfdf03..42121fed8249 100644
--- a/Documentation/powerpc/kaslr-booke32.rst
+++ b/Documentation/powerpc/kaslr-booke.rst
@@ -1,15 +1,18 @@
 .. SPDX-License-Identifier: GPL-2.0
 
-===
-KASLR for Freescale BookE32
-===
+=
+KASLR for Freescale BookE
+=
 
 The word KASLR stands for Kernel Address Space Layout Randomization.
 
 This document tries to explain the implementation of the KASLR for
-Freescale BookE32. KASLR is a security feature that deters exploit
+Freescale BookE. KASLR is a security feature that deters exploit
 attempts relying on knowledge of the location of kernel internals.
 
+KASLR for Freescale BookE32
+-
+
 Since CONFIG_RELOCATABLE has already supported, what we need to do is
 map or copy kernel to a proper place and relocate. Freescale Book-E
 parts expect lowmem to be mapped by fixed TLB entries(TLB1). The TLB1
@@ -38,5 +41,29 @@ bit of the entropy to decide the index of the 64M zone. Then 
we chose a
 
   kernstart_virt_addr
 
+
+KASLR for Freescale BookE64
+---
+
+The implementation for Freescale BookE64 is similar as BookE32. One
+difference is that Freescale BookE64 set up a TLB mapping of 1G during
+booting. Another difference is that ppc64 needs the kernel to be
+64K-aligned. So we can randomize the kernel in this 1G mapping and make
+it 64K-aligned. This can save some code to creat another TLB map at early
+boot. The disadvantage is that we only have about 1G/64K = 16384 slots to
+put the kernel in::
+
+KERNELBASE
+
+  64K |--> kernel <--|
+   |  |  |
++--+--+--++--+--+--+--+--+--+--+--+--++--+--+
+|  |  |  ||  |  |  |  |  |  |  |  |  ||  |  |
++--+--+--++--+--+--+--+--+--+--+--+--++--+--+
+| |1G
+|->   offset<-|
+
+  kernstart_virt_addr
+
 To enable KASLR, set CONFIG_RANDOMIZE_BASE = y. If KASLR is enable and you
 want to disable it at runtime, add "nokaslr" to the kernel cmdline.
-- 
2.17.2



[PATCH v2 5/6] powerpc/fsl_booke/64: clear the original kernel if randomized

2020-02-04 Thread Jason Yan
The original kernel still exists in the memory, clear it now.

Signed-off-by: Jason Yan 
Cc: Scott Wood 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
 arch/powerpc/mm/nohash/kaslr_booke.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/nohash/kaslr_booke.c 
b/arch/powerpc/mm/nohash/kaslr_booke.c
index c6f5c1db1394..ed1277059368 100644
--- a/arch/powerpc/mm/nohash/kaslr_booke.c
+++ b/arch/powerpc/mm/nohash/kaslr_booke.c
@@ -378,8 +378,10 @@ notrace void __init kaslr_early_init(void *dt_ptr, 
phys_addr_t size)
unsigned int *__kaslr_offset = (unsigned int *)(KERNELBASE + 0x58);
unsigned int *__run_at_load = (unsigned int *)(KERNELBASE + 0x5c);
 
-   if (*__run_at_load == 1)
+   if (*__run_at_load == 1) {
+   kaslr_late_init();
return;
+   }
 
/* Setup flat device-tree pointer */
initial_boot_params = dt_ptr;
-- 
2.17.2



[PATCH v2 2/6] powerpc/fsl_booke/64: introduce reloc_kernel_entry() helper

2020-02-04 Thread Jason Yan
Like the 32bit code, we introduce reloc_kernel_entry() helper to prepare
for the KASLR 64bit version. And move the C declaration of this function
out of CONFIG_PPC32 and use long instead of int for the parameter 'addr'.

Signed-off-by: Jason Yan 
Cc: Scott Wood 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
 arch/powerpc/kernel/exceptions-64e.S | 13 +
 arch/powerpc/mm/mmu_decl.h   |  3 ++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/exceptions-64e.S 
b/arch/powerpc/kernel/exceptions-64e.S
index e4076e3c072d..1b9b174bee86 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -1679,3 +1679,16 @@ _GLOBAL(setup_ehv_ivors)
 _GLOBAL(setup_lrat_ivor)
SET_IVOR(42, 0x340) /* LRAT Error */
blr
+
+/*
+ * Return to the start of the relocated kernel and run again
+ * r3 - virtual address of fdt
+ * r4 - entry of the kernel
+ */
+_GLOBAL(reloc_kernel_entry)
+   mfmsr   r7
+   rlwinm  r7, r7, 0, ~(MSR_IS | MSR_DS)
+
+   mtspr   SPRN_SRR0,r4
+   mtspr   SPRN_SRR1,r7
+   rfi
diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
index 8e99649c24fc..3e1c85c7d10b 100644
--- a/arch/powerpc/mm/mmu_decl.h
+++ b/arch/powerpc/mm/mmu_decl.h
@@ -140,9 +140,10 @@ extern void adjust_total_lowmem(void);
 extern int switch_to_as1(void);
 extern void restore_to_as0(int esel, int offset, void *dt_ptr, int bootcpu);
 void create_kaslr_tlb_entry(int entry, unsigned long virt, phys_addr_t phys);
-void reloc_kernel_entry(void *fdt, int addr);
 extern int is_second_reloc;
 #endif
+
+void reloc_kernel_entry(void *fdt, long addr);
 extern void loadcam_entry(unsigned int index);
 extern void loadcam_multi(int first_idx, int num, int tmp_idx);
 
-- 
2.17.2



[PATCH v2 3/6] powerpc/fsl_booke/64: implement KASLR for fsl_booke64

2020-02-04 Thread Jason Yan
The implementation for Freescale BookE64 is similar as BookE32. One
difference is that Freescale BookE64 set up a TLB mapping of 1G during
booting. Another difference is that ppc64 needs the kernel to be
64K-aligned. So we can randomize the kernel in this 1G mapping and make
it 64K-aligned. This can save some code to creat another TLB map at
early boot. The disadvantage is that we only have about 1G/64K = 16384
slots to put the kernel in.

To support secondary cpu boot up, a variable __kaslr_offset was added in
first_256B section. This can help secondary cpu get the kaslr offset
before the 1:1 mapping has been setup.

Signed-off-by: Jason Yan 
Cc: Scott Wood 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
 arch/powerpc/Kconfig |  2 +-
 arch/powerpc/kernel/exceptions-64e.S |  8 +++
 arch/powerpc/kernel/head_64.S|  7 ++
 arch/powerpc/kernel/setup_64.c   |  4 +++-
 arch/powerpc/mm/nohash/kaslr_booke.c | 33 +---
 5 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c150a9d49343..754aeb96bb1c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -568,7 +568,7 @@ config RELOCATABLE
 
 config RANDOMIZE_BASE
bool "Randomize the address of the kernel image"
-   depends on (FSL_BOOKE && FLATMEM && PPC32)
+   depends on (PPC_FSL_BOOK3E && FLATMEM)
depends on RELOCATABLE
help
  Randomizes the virtual address at which the kernel image is
diff --git a/arch/powerpc/kernel/exceptions-64e.S 
b/arch/powerpc/kernel/exceptions-64e.S
index 1b9b174bee86..121daeaf573d 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -1378,6 +1378,7 @@ skpinv:   addir6,r6,1 /* 
Increment */
 1: mflrr6
addir6,r6,(2f - 1b)
tovirt(r6,r6)
+   add r6,r6,r19
lis r7,MSR_KERNEL@h
ori r7,r7,MSR_KERNEL@l
mtspr   SPRN_SRR0,r6
@@ -1400,6 +1401,7 @@ skpinv:   addir6,r6,1 /* 
Increment */
 
/* We translate LR and return */
tovirt(r8,r8)
+   add r8,r8,r19
mtlrr8
blr
 
@@ -1528,6 +1530,7 @@ a2_tlbinit_code_end:
  */
 _GLOBAL(start_initialization_book3e)
mflrr28
+   li  r19, 0
 
/* First, we need to setup some initial TLBs to map the kernel
 * text, data and bss at PAGE_OFFSET. We don't have a real mode
@@ -1570,6 +1573,10 @@ _GLOBAL(book3e_secondary_core_init)
cmplwi  r4,0
bne 2f
 
+   LOAD_REG_ADDR_PIC(r19, __kaslr_offset)
+   lwz r19,0(r19)
+   rlwinm  r19,r19,0,0,5
+
/* Setup TLB for this core */
bl  initial_tlb_book3e
 
@@ -1602,6 +1609,7 @@ _GLOBAL(book3e_secondary_core_init)
lis r3,PAGE_OFFSET@highest
sldir3,r3,32
or  r28,r28,r3
+   add r28,r28,r19
 1: mtlrr28
blr
 
diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
index ad79fddb974d..b4ececc4323d 100644
--- a/arch/powerpc/kernel/head_64.S
+++ b/arch/powerpc/kernel/head_64.S
@@ -104,6 +104,13 @@ __secondary_hold_acknowledge:
.8byte  0x0
 
 #ifdef CONFIG_RELOCATABLE
+#ifdef CONFIG_PPC_BOOK3E
+   . = 0x58
+   .globl  __kaslr_offset
+__kaslr_offset:
+DEFINE_FIXED_SYMBOL(__kaslr_offset)
+   .long   0
+#endif
/* This flag is set to 1 by a loader if the kernel should run
 * at the loaded address instead of the linked address.  This
 * is used by kexec-tools to keep the the kdump kernel in the
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 6104917a282d..a16b970a8d1a 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -66,7 +66,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include "setup.h"
 
 int spinning_secondaries;
@@ -300,6 +300,8 @@ void __init early_setup(unsigned long dt_ptr)
/* Enable early debugging if any specified (see udbg.h) */
udbg_early_init();
 
+   kaslr_early_init(__va(dt_ptr), 0);
+
udbg_printf(" -> %s(), dt_ptr: 0x%lx\n", __func__, dt_ptr);
 
/*
diff --git a/arch/powerpc/mm/nohash/kaslr_booke.c 
b/arch/powerpc/mm/nohash/kaslr_booke.c
index 07b036e98353..c6f5c1db1394 100644
--- a/arch/powerpc/mm/nohash/kaslr_booke.c
+++ b/arch/powerpc/mm/nohash/kaslr_booke.c
@@ -231,7 +231,7 @@ static __init unsigned long get_usable_address(const void 
*fdt,
unsigned long pa;
unsigned long pa_end;
 
-   for (pa = offset; (long)pa > (long)start; pa -= SZ_16K) {
+   for (pa = offset; (long)pa > (long)start; pa -= SZ_64K) {
pa_end = pa + regions.kernel_size;
if (overlaps_region(fdt, pa, pa_end))
continue;
@@ -265,14 +265,14 @@ 

[PATCH v2 0/6] implement KASLR for powerpc/fsl_booke/64

2020-02-04 Thread Jason Yan
This is a try to implement KASLR for Freescale BookE64 which is based on
my earlier implementation for Freescale BookE32:
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131718

The implementation for Freescale BookE64 is similar as BookE32. One
difference is that Freescale BookE64 set up a TLB mapping of 1G during
booting. Another difference is that ppc64 needs the kernel to be
64K-aligned. So we can randomize the kernel in this 1G mapping and make
it 64K-aligned. This can save some code to creat another TLB map at
early boot. The disadvantage is that we only have about 1G/64K = 16384
slots to put the kernel in.

KERNELBASE

  64K |--> kernel <--|
   |  |  |
+--+--+--++--+--+--+--+--+--+--+--+--++--+--+
|  |  |  ||  |  |  |  |  |  |  |  |  ||  |  |
+--+--+--++--+--+--+--+--+--+--+--+--++--+--+
| |1G
|->   offset<-|

  kernstart_virt_addr

I'm not sure if the slot numbers is enough or the design has any
defects. If you have some better ideas, I would be happy to hear that.

Thank you all.

v1->v2:
  Add __kaslr_offset for the secondary cpu boot up.

Jason Yan (6):
  powerpc/fsl_booke/kaslr: refactor kaslr_legal_offset() and
kaslr_early_init()
  powerpc/fsl_booke/64: introduce reloc_kernel_entry() helper
  powerpc/fsl_booke/64: implement KASLR for fsl_booke64
  powerpc/fsl_booke/64: do not clear the BSS for the second pass
  powerpc/fsl_booke/64: clear the original kernel if randomized
  powerpc/fsl_booke/kaslr: rename kaslr-booke32.rst to kaslr-booke.rst
and add 64bit part

 .../{kaslr-booke32.rst => kaslr-booke.rst}| 35 +++--
 arch/powerpc/Kconfig  |  2 +-
 arch/powerpc/kernel/exceptions-64e.S  | 21 ++
 arch/powerpc/kernel/head_64.S | 14 
 arch/powerpc/kernel/setup_64.c|  4 +-
 arch/powerpc/mm/mmu_decl.h|  3 +-
 arch/powerpc/mm/nohash/kaslr_booke.c  | 71 +--
 7 files changed, 122 insertions(+), 28 deletions(-)
 rename Documentation/powerpc/{kaslr-booke32.rst => kaslr-booke.rst} (59%)

-- 
2.17.2



[PATCH v2 4/6] powerpc/fsl_booke/64: do not clear the BSS for the second pass

2020-02-04 Thread Jason Yan
The BSS section has already cleared out in the first pass. No need to
clear it again. This can save some time when booting with KASLR
enabled.

Signed-off-by: Jason Yan 
Cc: Scott Wood 
Cc: Diana Craciun 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: Kees Cook 
---
 arch/powerpc/kernel/head_64.S | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
index b4ececc4323d..9ae7fd8bbf7c 100644
--- a/arch/powerpc/kernel/head_64.S
+++ b/arch/powerpc/kernel/head_64.S
@@ -914,6 +914,13 @@ start_here_multiplatform:
bl  relative_toc
tovirt(r2,r2)
 
+   /* Do not clear the BSS for the second pass if randomized */
+   LOAD_REG_ADDR(r3, kernstart_virt_addr)
+   lwz r3,0(r3)
+   LOAD_REG_IMMEDIATE(r4, KERNELBASE)
+   cmpwr3,r4
+   bne 4f
+
/* Clear out the BSS. It may have been done in prom_init,
 * already but that's irrelevant since prom_init will soon
 * be detached from the kernel completely. Besides, we need
-- 
2.17.2



Re: [PATCH 2/5] mm/memremap_pages: Introduce memremap_compat_align()

2020-02-04 Thread Michael Ellerman
Dan Williams  writes:
> The "sub-section memory hotplug" facility allows memremap_pages() users
> like libnvdimm to compensate for hardware platforms like x86 that have a
> section size larger than their hardware memory mapping granularity.  The
> compensation that sub-section support affords is being tolerant of
> physical memory resources shifting by units smaller (64MiB on x86) than
> the memory-hotplug section size (128 MiB). Where the platform
> physical-memory mapping granularity is limited by the number and
> capability of address-decode-registers in the memory controller.
>
> While the sub-section support allows memremap_pages() to operate on
> sub-section (2MiB) granularity, the Power architecture may still
> require 16MiB alignment on "!radix_enabled()" platforms.
>
> In order for libnvdimm to be able to detect and manage this per-arch
> limitation, introduce memremap_compat_align() as a common minimum
> alignment across all driver-facing memory-mapping interfaces, and let
> Power override it to 16MiB in the "!radix_enabled()" case.
>
> The assumption / requirement for 16MiB to be a viable
> memremap_compat_align() value is that Power does not have platforms
> where its equivalent of address-decode-registers never hardware remaps a
> persistent memory resource on smaller than 16MiB boundaries.
>
> Based on an initial patch by Aneesh.
>
> Link: 
> http://lore.kernel.org/r/capcyv4gbgnp95apyabcsocea50tqj9b5h__83vgngjq3oug...@mail.gmail.com
> Reported-by: Aneesh Kumar K.V 
> Reported-by: Jeff Moyer 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Signed-off-by: Dan Williams 
> ---
>  arch/powerpc/include/asm/io.h |   10 ++
>  drivers/nvdimm/pfn_devs.c |2 +-
>  include/linux/io.h|   23 +++
>  include/linux/mmzone.h|1 +
>  4 files changed, 35 insertions(+), 1 deletion(-)

The powerpc change here looks fine to me.

Acked-by: Michael Ellerman  (powerpc)

cheers

> diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
> index a63ec938636d..0fa2dc483008 100644
> --- a/arch/powerpc/include/asm/io.h
> +++ b/arch/powerpc/include/asm/io.h
> @@ -734,6 +734,16 @@ extern void __iomem * __ioremap_at(phys_addr_t pa, void 
> *ea,
>  unsigned long size, pgprot_t prot);
>  extern void __iounmap_at(void *ea, unsigned long size);
>  
> +#ifdef CONFIG_SPARSEMEM
> +static inline unsigned long memremap_compat_align(void)
> +{
> + if (radix_enabled())
> + return SUBSECTION_SIZE;
> + return (1UL << mmu_psize_defs[mmu_linear_psize].shift);
> +}
> +#define memremap_compat_align memremap_compat_align
> +#endif
> +
>  /*
>   * When CONFIG_PPC_INDIRECT_PIO is set, we use the generic iomap 
> implementation
>   * which needs some additional definitions here. They basically allow PIO


Re: [RFC] per-CPU usage in perf core-book3s

2020-02-04 Thread maddy




On 1/27/20 8:36 PM, Sebastian Andrzej Siewior wrote:

I've been looking at usage of per-CPU variable cpu_hw_events in
arch/powerpc/perf/core-book3s.c.

power_pmu_enable() and power_pmu_disable() (pmu::pmu_enable() and
pmu::pmu_disable()) are accessing the variable and the callbacks are
invoked always with disabled interrupts.

power_pmu_event_init() (pmu::event_init()) is invoked from preemptible
context and uses get_cpu_var() to obtain a stable pointer (by disabling
preemption).

pmu::pmu_enable() and pmu::pmu_disable() can be invoked via a hrtimer
(perf_mux_hrtimer_handler()) and it invokes pmu::pmu_enable() and
pmu::pmu_disable() as part of the callback.

Is there anything that prevents the timer callback to interrupt
pmu::event_init() while it is accessing per-CPU data?


Sorry for the delayed response.

Yes, currently we dont have anything that prevents the timer
callback to interrupt pmu::event_init. Nice catch. Thanks for
pointing this out.

Looking at the code, per-cpu variable access are made to
check for constraints and for Branch Stack (BHRB). So could
wrap this block of  pmu::event_init with local_irq_save/restore.
Will send a patch to fix it.


Maddy



Sebastian




Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-04 Thread Baoquan He
On 10/06/19 at 10:56am, David Hildenbrand wrote:
> If we have holes, the holes will automatically get detected and removed
> once we remove the next bigger/smaller section. The extra checks can
> go.
> 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: David Hildenbrand 
> Cc: Pavel Tatashin 
> Cc: Dan Williams 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c | 34 +++---
>  1 file changed, 7 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f294918f7211..8dafa1ba8d9f 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, unsigned 
> long start_pfn,
>   if (pfn) {
>   zone->zone_start_pfn = pfn;
>   zone->spanned_pages = zone_end_pfn - pfn;
> + } else {
> + zone->zone_start_pfn = 0;
> + zone->spanned_pages = 0;
>   }
>   } else if (zone_end_pfn == end_pfn) {
>   /*
> @@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, 
> unsigned long start_pfn,
>  start_pfn);
>   if (pfn)
>   zone->spanned_pages = pfn - zone_start_pfn + 1;
> + else {
> + zone->zone_start_pfn = 0;
> + zone->spanned_pages = 0;

Thinking in which case (zone_start_pfn != start_pfn) and it comes here.

> + }
>   }
> -
> - /*
> -  * The section is not biggest or smallest mem_section in the zone, it
> -  * only creates a hole in the zone. So in this case, we need not
> -  * change the zone. But perhaps, the zone has only hole data. Thus
> -  * it check the zone has only hole or not.
> -  */
> - pfn = zone_start_pfn;
> - for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUBSECTION) {
> - if (unlikely(!pfn_to_online_page(pfn)))
> - continue;
> -
> - if (page_zone(pfn_to_page(pfn)) != zone)
> - continue;
> -
> - /* Skip range to be removed */
> - if (pfn >= start_pfn && pfn < end_pfn)
> - continue;
> -
> - /* If we find valid section, we have nothing to do */
> - zone_span_writeunlock(zone);
> - return;
> - }
> -
> - /* The zone has no valid section */
> - zone->zone_start_pfn = 0;
> - zone->spanned_pages = 0;
>   zone_span_writeunlock(zone);
>  }
>  
> -- 
> 2.21.0
> 
> 



Re: [PATCH] powerpc/drmem: cache LMBs in xarray to accelerate lookup

2020-02-04 Thread Scott Cheloha
On Tue, Jan 28, 2020 at 05:56:55PM -0600, Nathan Lynch wrote:
> Scott Cheloha  writes:
> > LMB lookup is currently an O(n) linear search.  This scales poorly when
> > there are many LMBs.
> >
> > If we cache each LMB by both its base address and its DRC index
> > in an xarray we can cut lookups to O(log n), greatly accelerating
> > drmem initialization and memory hotplug.
> >
> > This patch introduces two xarrays of of LMBs and fills them during
> > drmem initialization.  The patch also adds two interfaces for LMB
> > lookup.
> 
> Good but can you replace the array of LMBs altogether
> (drmem_info->lmbs)? xarray allows iteration over the members if needed.

I would like to try to "solve one problem at a time".

We can fix the linear search performance scaling problems without
removing the array of LMBs.  As I've shown in my diff, we can do it
with minimal change to the existing code.

If it turns out that the PAPR guarantees the ordering of the memory
DRCs then in a subsequent patch (series) we can replace the LMB array
(__drmem_info.lmbs) with an xarray indexed by DRC and use e.g.
xa_for_each() in the hotplug code.


Re: [PATCH] libnvdimm/of_pmem: Fix leaking bus_desc.provider_name in some paths

2020-02-04 Thread Aneesh Kumar K.V
Vaibhav Jain  writes:

> String 'bus_desc.provider_name' allocated inside
> of_pmem_region_probe() will leak in case call to nvdimm_bus_register()
> fails or when of_pmem_region_remove() is called.
>
> This minor patch ensures that 'bus_desc.provider_name' is freed in
> error path for of_pmem_region_probe() as well as in
> of_pmem_region_remove().
>

Reviewed-by: Aneesh Kumar K.V 

> Cc: sta...@vger.kernel.org
> Fixes:49bddc73d15c2("libnvdimm/of_pmem: Provide a unique name for bus 
> provider")
> Signed-off-by: Vaibhav Jain 
> ---
>  drivers/nvdimm/of_pmem.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/nvdimm/of_pmem.c b/drivers/nvdimm/of_pmem.c
> index 8224d1431ea9..9cb76f9837ad 100644
> --- a/drivers/nvdimm/of_pmem.c
> +++ b/drivers/nvdimm/of_pmem.c
> @@ -36,6 +36,7 @@ static int of_pmem_region_probe(struct platform_device 
> *pdev)
>  
>   priv->bus = bus = nvdimm_bus_register(>dev, >bus_desc);
>   if (!bus) {
> + kfree(priv->bus_desc.provider_name);
>   kfree(priv);
>   return -ENODEV;
>   }
> @@ -81,6 +82,7 @@ static int of_pmem_region_remove(struct platform_device 
> *pdev)
>   struct of_pmem_private *priv = platform_get_drvdata(pdev);
>  
>   nvdimm_bus_unregister(priv->bus);
> + kfree(priv->bus_desc.provider_name);
>   kfree(priv);
>  
>   return 0;
> -- 
> 2.24.1
> ___
> Linux-nvdimm mailing list -- linux-nvd...@lists.01.org
> To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-04 Thread David Hildenbrand
On 04.02.20 15:25, Baoquan He wrote:
> On 10/06/19 at 10:56am, David Hildenbrand wrote:
>> If we have holes, the holes will automatically get detected and removed
>> once we remove the next bigger/smaller section. The extra checks can
>> go.
>>
>> Cc: Andrew Morton 
>> Cc: Oscar Salvador 
>> Cc: Michal Hocko 
>> Cc: David Hildenbrand 
>> Cc: Pavel Tatashin 
>> Cc: Dan Williams 
>> Cc: Wei Yang 
>> Signed-off-by: David Hildenbrand 
>> ---
>>  mm/memory_hotplug.c | 34 +++---
>>  1 file changed, 7 insertions(+), 27 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index f294918f7211..8dafa1ba8d9f 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, unsigned 
>> long start_pfn,
>>  if (pfn) {
>>  zone->zone_start_pfn = pfn;
>>  zone->spanned_pages = zone_end_pfn - pfn;
>> +} else {
>> +zone->zone_start_pfn = 0;
>> +zone->spanned_pages = 0;
>>  }
>>  } else if (zone_end_pfn == end_pfn) {
>>  /*
>> @@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, 
>> unsigned long start_pfn,
>> start_pfn);
>>  if (pfn)
>>  zone->spanned_pages = pfn - zone_start_pfn + 1;
>> +else {
>> +zone->zone_start_pfn = 0;
>> +zone->spanned_pages = 0;
> 
> Thinking in which case (zone_start_pfn != start_pfn) and it comes here.

Could only happen in case the zone_start_pfn would have been "out of the
zone already". If you ask me: unlikely :)

This change at least maintains the same result as before (where the
all-holes check would have caught it).

-- 
Thanks,

David / dhildenb



Re: [PATCH v6 10/10] mm/memory_hotplug: Cleanup __remove_pages()

2020-02-04 Thread David Hildenbrand
On 04.02.20 14:13, Segher Boessenkool wrote:
> On Tue, Feb 04, 2020 at 01:41:06PM +0100, David Hildenbrand wrote:
>> On 04.02.20 10:46, Oscar Salvador wrote:
>>> I have to confess that it took me while to wrap around my head
>>> with the new min() change, but looks ok:
>>
>> It's a pattern commonly used in compilers and emulators to calculate the
>> number of bytes to the next block/alignment. (we're missing a macro
>> (like we have ALIGN_UP/IS_ALIGNED) for that - but it's hard to come up
>> with a good name (e.g., SIZE_TO_NEXT_ALIGN) .
> 
> You can just write the easy to understand
> 
>   ...  ALIGN_UP(x) - x  ...

you mean

ALIGN_UP(x, PAGES_PER_SECTION) - x

but ...

> 
> which is better *without* having a separate name.  Does that not
> generate good machine code for you?

1. There is no ALIGN_UP. "SECTION_ALIGN_UP(x) - x" would be possible
2. It would be wrong if x is already aligned.

e.g., let's use 4096 for simplicity as we all know that value by heart
(for both x and the block size).

a) -(4096 | -4096) -> 4096

b) #define ALIGN_UP(x, a) ((x + a - 1) & -(a))

ALIGN_UP(4096, 4096) - 4096 -> 0

Not as easy as it seems ...

-- 
Thanks,

David / dhildenb



Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.6-1 tag

2020-02-04 Thread pr-tracker-bot
The pull request you sent on Tue, 04 Feb 2020 23:10:55 +1100:

> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
> tags/powerpc-5.6-1

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/71c3a888cbcaf453aecf8d2f8fb003271d28073f

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker


Re: [PATCH v6 10/10] mm/memory_hotplug: Cleanup __remove_pages()

2020-02-04 Thread Segher Boessenkool
On Tue, Feb 04, 2020 at 01:41:06PM +0100, David Hildenbrand wrote:
> On 04.02.20 10:46, Oscar Salvador wrote:
> > I have to confess that it took me while to wrap around my head
> > with the new min() change, but looks ok:
> 
> It's a pattern commonly used in compilers and emulators to calculate the
> number of bytes to the next block/alignment. (we're missing a macro
> (like we have ALIGN_UP/IS_ALIGNED) for that - but it's hard to come up
> with a good name (e.g., SIZE_TO_NEXT_ALIGN) .

You can just write the easy to understand

  ...  ALIGN_UP(x) - x  ...

which is better *without* having a separate name.  Does that not
generate good machine code for you?


Segher


Re: [PATCH v6 10/10] mm/memory_hotplug: Cleanup __remove_pages()

2020-02-04 Thread David Hildenbrand
On 04.02.20 10:46, Oscar Salvador wrote:
> On Sun, Oct 06, 2019 at 10:56:46AM +0200, David Hildenbrand wrote:
>> Let's drop the basically unused section stuff and simplify.
>>
>> Also, let's use a shorter variant to calculate the number of pages to
>> the next section boundary.
>>
>> Cc: Andrew Morton 
>> Cc: Oscar Salvador 
>> Cc: Michal Hocko 
>> Cc: Pavel Tatashin 
>> Cc: Dan Williams 
>> Cc: Wei Yang 
>> Signed-off-by: David Hildenbrand 
> 
> I have to confess that it took me while to wrap around my head
> with the new min() change, but looks ok:

It's a pattern commonly used in compilers and emulators to calculate the
number of bytes to the next block/alignment. (we're missing a macro
(like we have ALIGN_UP/IS_ALIGNED) for that - but it's hard to come up
with a good name (e.g., SIZE_TO_NEXT_ALIGN) .

-- 
Thanks,

David / dhildenb



[GIT PULL] Please pull powerpc/linux.git powerpc-5.6-1 tag

2020-02-04 Thread Michael Ellerman
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi Linus,

Please pull powerpc updates for 5.6.

A pretty small batch for us, and apologies for it being a bit late, I wanted to
sneak Christophe's user_access_begin() series in.

No conflicts or other issues I'm aware of.

cheers


The following changes since commit c79f46a282390e0f5b306007bf7b11a46d529538:

  Linux 5.5-rc5 (2020-01-05 14:23:27 -0800)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-5.6-1

for you to fetch changes up to 4c25df5640ae6e4491ee2c50d3f70c1559ef037d:

  Merge branch 'topic/user-access-begin' into next (2020-02-01 21:47:17 +1100)

- --
powerpc updates for 5.6

 - Implement user_access_begin() and friends for our platforms that support
   controlling kernel access to userspace.

 - Enable CONFIG_VMAP_STACK on 32-bit Book3S and 8xx.

 - Some tweaks to our pseries IOMMU code to allow SVMs ("secure" virtual
   machines) to use the IOMMU.

 - Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE to the 32-bit VDSO, and
   some other improvements.

 - A series to use the PCI hotplug framework to control opencapi card's so that
   they can be reset and re-read after flashing a new FPGA image.

As well as other minor fixes and improvements as usual.

Thanks to:
 Alastair D'Silva, Alexandre Ghiti, Alexey Kardashevskiy, Andrew Donnellan,
 Aneesh Kumar K.V, Anju T Sudhakar, Bai Yingjie, Chen Zhou, Christophe Leroy,
 Frederic Barrat, Greg Kurz, Jason A. Donenfeld, Joel Stanley, Jordan Niethe,
 Julia Lawall, Krzysztof Kozlowski, Laurent Dufour, Laurentiu Tudor, Linus
 Walleij, Michael Bringmann, Nathan Chancellor, Nicholas Piggin, Nick
 Desaulniers, Oliver O'Halloran, Peter Ujfalusi, Pingfan Liu, Ram Pai, Randy
 Dunlap, Russell Currey, Sam Bobroff, Sebastian Andrzej Siewior, Shawn
 Anastasio, Stephen Rothwell, Steve Best, Sukadev Bhattiprolu, Thiago Jung
 Bauermann, Tyrel Datwyler, Vaibhav Jain.

- --
Alexandre Ghiti (1):
  powerpc: Do not consider weak unresolved symbol relocations as bad

Alexey Kardashevskiy (3):
  powerpc/pseries: Allow not having ibm, 
hypertas-functions::hcall-multi-tce for DDW
  powerpc/pseries/iommu: Separate FW_FEATURE_MULTITCE to put/stuff features
  powerpc/pseries/svm: Allow IOMMU to work in SVM

Aneesh Kumar K.V (2):
  powerpc/papr_scm: Update debug message
  powerpc/papr_scm: Don't enable direct map for a region by default

Anju T Sudhakar (1):
  powerpc/imc: Add documentation for IMC and trace-mode

Bai Yingjie (2):
  powerpc32/booke: consistently return phys_addr_t in __pa()
  powerpc/mpc85xx: also write addr_h to spin table for 64bit boot entry

Chen Zhou (1):
  powerpc/maple: Fix comparing pointer to 0

Christophe Leroy (47):
  powerpc/ptdump: don't entirely rebuild kernel when selecting 
CONFIG_PPC_DEBUG_WX
  powerpc/ptdump: Fix W+X verification call in mark_rodata_ro()
  powerpc/ptdump: Fix W+X verification
  powerpc/ptdump: Only enable PPC_CHECK_WX with STRICT_KERNEL_RWX
  powerpc/8xx: Fix permanently mapped IMMR region.
  powerpc/hw_breakpoints: Rewrite 8xx breakpoints to allow any address 
range size.
  selftests/powerpc: Enable range tests on 8xx in ptrace-hwbreak.c selftest
  powerpc/devicetrees: Change 'gpios' to 'cs-gpios' on fsl, spi nodes
  powerpc/32: Add VDSO version of getcpu on non SMP
  powerpc/vdso32: Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE
  powerpc/vdso32: inline __get_datapage()
  powerpc/vdso32: Don't read cache line size from the datapage on PPC32.
  powerpc/vdso32: use LOAD_REG_IMMEDIATE()
  powerpc/vdso32: implement clock_getres entirely
  powerpc/vdso32: miscellaneous optimisations
  powerpc: use probe_user_read() and probe_user_write()
  powerpc/32: replace MTMSRD() by mtmsr
  powerpc/32: Add EXCEPTION_PROLOG_0 in head_32.h
  powerpc/32: save DEAR/DAR before calling handle_page_fault
  powerpc/32: move MSR_PR test into EXCEPTION_PROLOG_0
  powerpc/32: add a macro to get and/or save DAR and DSISR on stack.
  powerpc/32: prepare for CONFIG_VMAP_STACK
  powerpc: align stack to 2 * THREAD_SIZE with VMAP_STACK
  powerpc/32: Add early stack overflow detection with VMAP stack.
  powerpc/32: Use vmapped stacks for interrupts
  powerpc/8xx: Use alternative scratch registers in DTLB miss handler
  powerpc/8xx: Drop exception entries for non-existing exceptions
  powerpc/8xx: Move DataStoreTLBMiss perf handler
  powerpc/8xx: Split breakpoint exception
  powerpc/8xx: Enable CONFIG_VMAP_STACK
  powerpc/32s: Reorganise DSI handler.
  powerpc/32s: Avoid crossing page boundary while changing SRR0/1.
  powerpc/32s: Enable CONFIG_VMAP_STACK
  powerpc/mm: Don't log user reads to 0x
  powerpc/32: Add support of 

Re: [PATCH v4 2/7] powerpc/32s: Fix bad_kuap_fault()

2020-02-04 Thread Michael Ellerman
On Fri, 2020-01-24 at 11:54:40 UTC, Christophe Leroy wrote:
> At the moment, bad_kuap_fault() reports a fault only if a bad access
> to userspace occurred while access to userspace was not granted.
> 
> But if a fault occurs for a write outside the allowed userspace
> segment(s) that have been unlocked, bad_kuap_fault() fails to
> detect it and the kernel loops forever in do_page_fault().
> 
> Fix it by checking that the accessed address is within the allowed
> range.
> 
> Fixes: a68c31fc01ef ("powerpc/32s: Implement Kernel Userspace Access 
> Protection")
> Cc: sta...@vger.kernel.org # v5.2+
> Signed-off-by: Christophe Leroy 
> Signed-off-by: Michael Ellerman 
> Link: 
> https://lore.kernel.org/r/1e07c7de4ffdd9cda35d1ffe8258af75579d3e91.1579715466.git.christophe.le...@c-s.fr

Patches 2-7 applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/6ec20aa2e510b6297906c45f009aa08b2d97269a

cheers


Re: [PATCH] powerpc: configs: Cleanup old Kconfig options

2020-02-04 Thread Michael Ellerman
On Thu, 2020-01-30 at 19:52:23 UTC, Krzysztof Kozlowski wrote:
> CONFIG_ENABLE_WARN_DEPRECATED is gone since
> commit 771c035372a0 ("deprecate the '__deprecated' attribute warnings
> entirely and for good").
> 
> CONFIG_IOSCHED_DEADLINE and CONFIG_IOSCHED_CFQ are gone since
> commit f382fb0bcef4 ("block: remove legacy IO schedulers").
> 
> The IOSCHED_DEADLINE was replaced by MQ_IOSCHED_DEADLINE and it will be
> now enabled by default (along with MQ_IOSCHED_KYBER).
> 
> Signed-off-by: Krzysztof Kozlowski 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/34b5a946a9543ce38d8ad1aacc4362533a813db7

cheers


Re: [PATCH] powerpc/32s: Fix kasan_early_hash_table() for CONFIG_VMAP_STACK

2020-02-04 Thread Michael Ellerman
On Wed, 2020-01-29 at 12:34:36 UTC, Christophe Leroy wrote:
> On book3s/32 CPUs that are handling MMU through a hash table,
> MMU_init_hw() function was adapted for VMAP_STACK in order to
> handle virtual addresses instead of physical addresses in the
> low level hash functions.
> 
> When using KASAN, the same adaptations are required for the
> early hash table set up by kasan_early_hash_table() function.
> 
> Fixes: cd08f109e262 ("powerpc/32s: Enable CONFIG_VMAP_STACK")
> Signed-off-by: Christophe Leroy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/41196224883a64e56e0ef237c19eb837058df071

cheers


Re: [PATCH RESEND] powerpc: indent to improve Kconfig readability

2020-02-04 Thread Michael Ellerman
On Wed, 2020-01-29 at 02:22:25 UTC, Randy Dunlap wrote:
> From: Randy Dunlap 
> 
> Indent a Kconfig continuation line to improve readability.
> 
> Signed-off-by: Randy Dunlap 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: linuxppc-dev@lists.ozlabs.org

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/76be4414be4a0d17e29e2337167bf976533149cd

cheers


Re: [PATCH] powerpc/32s: Fix CPU wake-up from sleep mode

2020-02-04 Thread Michael Ellerman
On Mon, 2020-01-27 at 10:42:04 UTC, Christophe Leroy wrote:
> Commit f7354ccac844 ("powerpc/32: Remove CURRENT_THREAD_INFO and
> rename TI_CPU") broke the CPU wake-up from sleep mode (i.e. when
> _TLF_SLEEPING is set) by delaying the tovirt(r2, r2).
> 
> This is because r2 is not restored by fast_exception_return. It used
> to work (by chance ?) because CPU wake-up interrupt never comes from
> user, so r2 is expected to point to 'current' on return.
> 
> Commit e2fb9f544431 ("powerpc/32: Prepare for Kernel Userspace Access
> Protection") broke it even more by clobbering r0 which is not
> restored by fast_exception_return either.
> 
> Use r6 instead of r0. This is possible because r3-r6 are restored by
> fast_exception_return and only r3-r5 are used for exception arguments.
> 
> For r2 it could be converted back to virtual address, but stay on the
> safe side and restore it from the stack instead. It should be live
> in the cache at that moment, so loading from the stack should make
> no difference compared to converting it from phys to virt.
> 
> Fixes: f7354ccac844 ("powerpc/32: Remove CURRENT_THREAD_INFO and rename 
> TI_CPU")
> Fixes: e2fb9f544431 ("powerpc/32: Prepare for Kernel Userspace Access 
> Protection")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Christophe Leroy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/9933819099c4600b41a042f27a074470a43cf6b9

cheers


Re: [PATCH v2 01/10] powerpc/configs: Drop CONFIG_QLGE which moved to staging

2020-02-04 Thread Michael Ellerman
On Tue, 2020-01-21 at 04:29:51 UTC, Michael Ellerman wrote:
> The QLGE driver moved to staging in commit 955315b0dc8c ("qlge: Move
> drivers/net/ethernet/qlogic/qlge/ to drivers/staging/qlge/"), meaning
> our defconfigs that enable it have no effect as we don't enable
> CONFIG_STAGING.
> 
> It sounds like the device is obsolete, so drop the driver.
> 
> Signed-off-by: Michael Ellerman 

Patches 1-9 applied to powerpc next.

https://git.kernel.org/powerpc/c/76e4bd93369b87d97c2b1bcd6e754a89f422235b

cheers


Re: [PATCH v2] powerpc: Do not consider weak unresolved symbol relocations as bad

2020-02-04 Thread Michael Ellerman
On Sat, 2020-01-18 at 17:03:35 UTC, Alexandre Ghiti wrote:
> Commit 8580ac9404f6 ("bpf: Process in-kernel BTF") introduced two weak
> symbols that may be unresolved at link time which result in an absolute
> relocation to 0. relocs_check.sh emits the following warning:
> 
> "WARNING: 2 bad relocations
> c1a41478 R_PPC64_ADDR64_binary__btf_vmlinux_bin_start
> c1a41480 R_PPC64_ADDR64_binary__btf_vmlinux_bin_end"
> 
> whereas those relocations are legitimate even for a relocatable kernel
> compiled with -pie option.
> 
> relocs_check.sh already excluded some weak unresolved symbols explicitly:
> remove those hardcoded symbols and add some logic that parses the symbols
> using nm, retrieves all the weak unresolved symbols and excludes those from
> the list of the potential bad relocations.
> 
> Reported-by: Stephen Rothwell 
> Signed-off-by: Alexandre Ghiti 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/43e76cd368fbb67e767da5363ffeaa3989993c8c

cheers


Re: [DOC][PATCH v2] powerpc: Provide initial documentation for PAPR hcalls

2020-02-04 Thread Michael Ellerman
On Wed, 2019-08-28 at 08:27:29 UTC, Vaibhav Jain wrote:
> This doc patch provides an initial description of the hcall op-codes
> that are used by Linux kernel running as a guest (LPAR) on top of
> PowerVM or any other sPAPR compliant hyper-visor (e.g qemu).
> 
> Apart from documenting the hcalls the doc-patch also provides a
> rudimentary overview of how hcall ABI, how they are issued with the
> Linux kernel and how information/control flows between the guest and
> hypervisor.
> 
> Signed-off-by: Vaibhav Jain 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/58b278f568f0509497e2df7310bfd719156a60d1

cheers


Re: [PATCH v6 00/10] mm/memory_hotplug: Shrink zones before removing memory

2020-02-04 Thread Oscar Salvador
On Tue, Feb 04, 2020 at 09:45:24AM +0100, David Hildenbrand wrote:
> I really hope we'll find more reviewers in general - I'm also not happy
> if my patches go upstream with little/no review. However, patches
> shouldn't be stuck for multiple merge windows in linux-next IMHO
> (excluding exceptions of course) - then they should either be sent
> upstream (and eventually fixed later) or dropped.

First of all sorry for my lack of review, as lately I have been a bit 
disconnected
of the list because lack of time.

Lucky my I managed to find some time, so I went through the patches that did
lack review (#6-#10).

I hope this helps in moving forward the series, although Michal's review would 
be
great as well.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 10/10] mm/memory_hotplug: Cleanup __remove_pages()

2020-02-04 Thread Oscar Salvador
On Sun, Oct 06, 2019 at 10:56:46AM +0200, David Hildenbrand wrote:
> Let's drop the basically unused section stuff and simplify.
> 
> Also, let's use a shorter variant to calculate the number of pages to
> the next section boundary.
> 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: Pavel Tatashin 
> Cc: Dan Williams 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 

I have to confess that it took me while to wrap around my head
with the new min() change, but looks ok:

Reviewed-by: Oscar Salvador 

> ---
>  mm/memory_hotplug.c | 17 ++---
>  1 file changed, 6 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 843481bd507d..2275240cfa10 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -490,25 +490,20 @@ static void __remove_section(unsigned long pfn, 
> unsigned long nr_pages,
>  void __remove_pages(unsigned long pfn, unsigned long nr_pages,
>   struct vmem_altmap *altmap)
>  {
> + const unsigned long end_pfn = pfn + nr_pages;
> + unsigned long cur_nr_pages;
>   unsigned long map_offset = 0;
> - unsigned long nr, start_sec, end_sec;
>  
>   map_offset = vmem_altmap_offset(altmap);
>  
>   if (check_pfn_span(pfn, nr_pages, "remove"))
>   return;
>  
> - start_sec = pfn_to_section_nr(pfn);
> - end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
> - for (nr = start_sec; nr <= end_sec; nr++) {
> - unsigned long pfns;
> -
> + for (; pfn < end_pfn; pfn += cur_nr_pages) {
>   cond_resched();
> - pfns = min(nr_pages, PAGES_PER_SECTION
> - - (pfn & ~PAGE_SECTION_MASK));
> - __remove_section(pfn, pfns, map_offset, altmap);
> - pfn += pfns;
> - nr_pages -= pfns;
> + /* Select all remaining pages up to the next section boundary */
> + cur_nr_pages = min(end_pfn - pfn, -(pfn | PAGE_SECTION_MASK));
> + __remove_section(pfn, cur_nr_pages, map_offset, altmap);
>   map_offset = 0;
>   }
>  }
> -- 
> 2.21.0
> 
> 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 09/10] mm/memory_hotplug: Drop local variables in shrink_zone_span()

2020-02-04 Thread David Hildenbrand
On 04.02.20 10:26, Oscar Salvador wrote:
> On Sun, Oct 06, 2019 at 10:56:45AM +0200, David Hildenbrand wrote:
>> Get rid of the unnecessary local variables.
>>
>> Cc: Andrew Morton 
>> Cc: Oscar Salvador 
>> Cc: David Hildenbrand 
>> Cc: Michal Hocko 
>> Cc: Pavel Tatashin 
>> Cc: Dan Williams 
>> Cc: Wei Yang 
>> Signed-off-by: David Hildenbrand 
>> ---
>>  mm/memory_hotplug.c | 15 ++-
>>  1 file changed, 6 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 8dafa1ba8d9f..843481bd507d 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -374,14 +374,11 @@ static unsigned long find_biggest_section_pfn(int nid, 
>> struct zone *zone,
>>  static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>>   unsigned long end_pfn)
>>  {
>> -unsigned long zone_start_pfn = zone->zone_start_pfn;
>> -unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
>> -unsigned long zone_end_pfn = z;
>>  unsigned long pfn;
>>  int nid = zone_to_nid(zone);
> 
> We could also remove the nid, right?
> AFAICS, the nid is only used in find_{smallest/biggest}_section_pfn so we 
> could
> place there as well.


I remember sending a patch on this (which was acked, but not picked up
yet)...


oh, there it is :)

https://lore.kernel.org/linux-mm/20191127174158.28226-1-da...@redhat.com/

Thanks!

-- 
Thanks,

David / dhildenb



Re: [PATCH v6 09/10] mm/memory_hotplug: Drop local variables in shrink_zone_span()

2020-02-04 Thread Oscar Salvador
On Sun, Oct 06, 2019 at 10:56:45AM +0200, David Hildenbrand wrote:
> Get rid of the unnecessary local variables.
> 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Pavel Tatashin 
> Cc: Dan Williams 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c | 15 ++-
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 8dafa1ba8d9f..843481bd507d 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -374,14 +374,11 @@ static unsigned long find_biggest_section_pfn(int nid, 
> struct zone *zone,
>  static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>unsigned long end_pfn)
>  {
> - unsigned long zone_start_pfn = zone->zone_start_pfn;
> - unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
> - unsigned long zone_end_pfn = z;
>   unsigned long pfn;
>   int nid = zone_to_nid(zone);

We could also remove the nid, right?
AFAICS, the nid is only used in find_{smallest/biggest}_section_pfn so we could
place there as well.

Anyway, nothing to nit-pick about:

Reviewed-by: Oscar Salvador 

>  
>   zone_span_writelock(zone);
> - if (zone_start_pfn == start_pfn) {
> + if (zone->zone_start_pfn == start_pfn) {
>   /*
>* If the section is smallest section in the zone, it need
>* shrink zone->zone_start_pfn and zone->zone_spanned_pages.
> @@ -389,25 +386,25 @@ static void shrink_zone_span(struct zone *zone, 
> unsigned long start_pfn,
>* for shrinking zone.
>*/
>   pfn = find_smallest_section_pfn(nid, zone, end_pfn,
> - zone_end_pfn);
> + zone_end_pfn(zone));
>   if (pfn) {
> + zone->spanned_pages = zone_end_pfn(zone) - pfn;
>   zone->zone_start_pfn = pfn;
> - zone->spanned_pages = zone_end_pfn - pfn;
>   } else {
>   zone->zone_start_pfn = 0;
>   zone->spanned_pages = 0;
>   }
> - } else if (zone_end_pfn == end_pfn) {
> + } else if (zone_end_pfn(zone) == end_pfn) {
>   /*
>* If the section is biggest section in the zone, it need
>* shrink zone->spanned_pages.
>* In this case, we find second biggest valid mem_section for
>* shrinking zone.
>*/
> - pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
> + pfn = find_biggest_section_pfn(nid, zone, zone->zone_start_pfn,
>  start_pfn);
>   if (pfn)
> - zone->spanned_pages = pfn - zone_start_pfn + 1;
> + zone->spanned_pages = pfn - zone->zone_start_pfn + 1;
>   else {
>   zone->zone_start_pfn = 0;
>   zone->spanned_pages = 0;
> -- 
> 2.21.0
> 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-04 Thread David Hildenbrand
On 04.02.20 10:13, Oscar Salvador wrote:
> On Sun, Oct 06, 2019 at 10:56:44AM +0200, David Hildenbrand wrote:
>> If we have holes, the holes will automatically get detected and removed
>> once we remove the next bigger/smaller section. The extra checks can
>> go.
>>
>> Cc: Andrew Morton 
>> Cc: Oscar Salvador 
>> Cc: Michal Hocko 
>> Cc: David Hildenbrand 
>> Cc: Pavel Tatashin 
>> Cc: Dan Williams 
>> Cc: Wei Yang 
>> Signed-off-by: David Hildenbrand 
> 
> Heh, I have been here before.
> I have to confess that when I wrote my version of this I was not really 100%
> about removing it, because hotplug was a sort of a "catchall" for all sort of 
> weird
> and corner-cases configurations, but thinking more about it, I cannot think of
> any situation that would make this blow up.
> 
> Reviewed-by: Oscar Salvador 

Thanks for your review Oscar!

-- 
Thanks,

David / dhildenb



Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-04 Thread Oscar Salvador
On Sun, Oct 06, 2019 at 10:56:44AM +0200, David Hildenbrand wrote:
> If we have holes, the holes will automatically get detected and removed
> once we remove the next bigger/smaller section. The extra checks can
> go.
> 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: David Hildenbrand 
> Cc: Pavel Tatashin 
> Cc: Dan Williams 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 

Heh, I have been here before.
I have to confess that when I wrote my version of this I was not really 100%
about removing it, because hotplug was a sort of a "catchall" for all sort of 
weird
and corner-cases configurations, but thinking more about it, I cannot think of
any situation that would make this blow up.

Reviewed-by: Oscar Salvador 

> ---
>  mm/memory_hotplug.c | 34 +++---
>  1 file changed, 7 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f294918f7211..8dafa1ba8d9f 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, unsigned 
> long start_pfn,
>   if (pfn) {
>   zone->zone_start_pfn = pfn;
>   zone->spanned_pages = zone_end_pfn - pfn;
> + } else {
> + zone->zone_start_pfn = 0;
> + zone->spanned_pages = 0;
>   }
>   } else if (zone_end_pfn == end_pfn) {
>   /*
> @@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, 
> unsigned long start_pfn,
>  start_pfn);
>   if (pfn)
>   zone->spanned_pages = pfn - zone_start_pfn + 1;
> + else {
> + zone->zone_start_pfn = 0;
> + zone->spanned_pages = 0;
> + }
>   }
> -
> - /*
> -  * The section is not biggest or smallest mem_section in the zone, it
> -  * only creates a hole in the zone. So in this case, we need not
> -  * change the zone. But perhaps, the zone has only hole data. Thus
> -  * it check the zone has only hole or not.
> -  */
> - pfn = zone_start_pfn;
> - for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUBSECTION) {
> - if (unlikely(!pfn_to_online_page(pfn)))
> - continue;
> -
> - if (page_zone(pfn_to_page(pfn)) != zone)
> - continue;
> -
> - /* Skip range to be removed */
> - if (pfn >= start_pfn && pfn < end_pfn)
> - continue;
> -
> - /* If we find valid section, we have nothing to do */
> - zone_span_writeunlock(zone);
> - return;
> - }
> -
> - /* The zone has no valid section */
> - zone->zone_start_pfn = 0;
> - zone->spanned_pages = 0;
>   zone_span_writeunlock(zone);
>  }
>  
> -- 
> 2.21.0
> 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH 0/3] pseries: Track and expose idle PURR and SPURR ticks

2020-02-04 Thread Kamalesh Babulal
On 12/6/19 2:44 PM, Naveen N. Rao wrote:
> Naveen N. Rao wrote:
>> Hi Nathan,
>>
>> Nathan Lynch wrote:
>>> Hi Kamalesh,
>>>
>>> Kamalesh Babulal  writes:
 On 12/5/19 3:54 AM, Nathan Lynch wrote:
> "Gautham R. Shenoy"  writes:
>>
>> Tools such as lparstat which are used to compute the utilization need
>> to know [S]PURR ticks when the cpu was busy or idle. The [S]PURR
>> counters are already exposed through sysfs.  We already account for
>> PURR ticks when we go to idle so that we can update the VPA area. This
>> patchset extends support to account for SPURR ticks when idle, and
>> expose both via per-cpu sysfs files.
>
> Does anything really want to use PURR instead of SPURR? Seems like we
> should expose only SPURR idle values if possible.
>

 lparstat is one of the consumers of PURR idle metric
 (https://groups.google.com/forum/#!topic/powerpc-utils-devel/fYRo69xO9r4). 
 Agree, on the argument that system utilization metrics based on SPURR
 accounting is accurate in comparison to PURR, which isn't proportional to
 CPU frequency.  PURR has been traditionally used to understand the system
 utilization, whereas SPURR is used for understanding how much capacity is
 left/exceeding in the system based on the current power saving mode.
>>>
>>> I'll phrase my question differently: does SPURR complement or supercede
>>> PURR? You seem to be saying they serve different purposes. If PURR is
>>> actually useful rather then vestigial then I have no objection to
>>> exposing idle_purr.
>>
>> SPURR complements PURR, so we need both. SPURR/PURR ratio helps provide an 
>> indication of the available headroom in terms of core resources, at maximum 
>> frequency.
> 
> Re-reading this today morning, I realize that this isn't entirely accurate. 
> SPURR alone is sufficient to understand core resource utilization.
> 
> Kamalesh is using PURR to display non-normalized utilization values (under 
> 'actual' column), as reported by lparstat on AIX. I am not entirely sure if 
> it is ok to derive these based on the SPURR busy/idle ratio.

Both idle_purr and idle_spurr complement each other and we need to expose both 
of them.
It will improve the accounting accuracy of tools currently consuming 
system-wide PURR
and/or SPURR numbers to report system usage.  Deriving one from another, from my
experience makes it hard for tools or any custom scripts to give an accurate 
system view.
One tool I am aware of is lparstat, which uses PURR based metrics.

-- 
Kamalesh



Re: [PATCH v6 07/10] mm/memory_hotplug: We always have a zone in find_(smallest|biggest)_section_pfn

2020-02-04 Thread Oscar Salvador
On Sun, Oct 06, 2019 at 10:56:43AM +0200, David Hildenbrand wrote:
> With shrink_pgdat_span() out of the way, we now always have a valid
> zone.
> 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Pavel Tatashin 
> Cc: Dan Williams 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

> ---
>  mm/memory_hotplug.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index bf5173e7913d..f294918f7211 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -337,7 +337,7 @@ static unsigned long find_smallest_section_pfn(int nid, 
> struct zone *zone,
>   if (unlikely(pfn_to_nid(start_pfn) != nid))
>   continue;
>  
> - if (zone && zone != page_zone(pfn_to_page(start_pfn)))
> + if (zone != page_zone(pfn_to_page(start_pfn)))
>   continue;
>  
>   return start_pfn;
> @@ -362,7 +362,7 @@ static unsigned long find_biggest_section_pfn(int nid, 
> struct zone *zone,
>   if (unlikely(pfn_to_nid(pfn) != nid))
>   continue;
>  
> - if (zone && zone != page_zone(pfn_to_page(pfn)))
> + if (zone != page_zone(pfn_to_page(pfn)))
>   continue;
>  
>   return pfn;
> -- 
> 2.21.0
> 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 06/10] mm/memory_hotplug: Poison memmap in remove_pfn_range_from_zone()

2020-02-04 Thread Oscar Salvador
On Sun, Oct 06, 2019 at 10:56:42AM +0200, David Hildenbrand wrote:
> Let's poison the pages similar to when adding new memory in
> sparse_add_section(). Also call remove_pfn_range_from_zone() from
> memunmap_pages(), so we can poison the memmap from there as well.
> 
> While at it, calculate the pfn in memunmap_pages() only once.
> 
> Cc: Andrew Morton 
> Cc: David Hildenbrand 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: Pavel Tatashin 
> Cc: Dan Williams 
> Signed-off-by: David Hildenbrand 

Looks good to me, it is fine as long as we do not access those pages later on,
and if my eyes did not lie to me, we have to proper checks (pfn_to_online_page)
in place to avoid that, so:

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 00/10] mm/memory_hotplug: Shrink zones before removing memory

2020-02-04 Thread David Hildenbrand
>> I can understand this is desirable (yet, I am
>> not sure if this makes sense with the current take-and-not-give-back
>> review mentality on this list).
>>
>> Although it will make upstreaming stuff *even harder* and *even slower*,
>> maybe we should start to only queue patches that have an ACK/RB, so they
>> won't get blocked by this later on? At least that makes your life easier
>> and people won't have to eventually follow up on patches that have been
>> in linux-next for months.
> 
> The merge rate would still be the review rate, but the resulting merges
> would be of less tested code.

That's a valid point.

> 
>> Note: the result will be that many of my patches will still not get
>> reviewed, won't get queued/upstreamed, I will continuously ping and
>> resend, I will lose interest because I have better things to do, I will
>> lose interest in our code quality, I will lose interest to review.
>>
>> (side note: some people might actually enjoy me sending less cleanup
>> patches, so this approach might be desirable for some ;) )
>>
>> One alternative is to send patches upstream once they have been lying
>> around in linux-next for $RANDOM number of months, because they
>> obviously saw some testing and nobody started to yell at them once
>> stumbling over them on linux-mm.
> 
> Yes, I think that's the case with these patches and I've sent them to
> Linus.  Hopefully Michel will be able to find time to look them over in
> the next month or so.

I really hope we'll find more reviewers in general - I'm also not happy
if my patches go upstream with little/no review. However, patches
shouldn't be stuck for multiple merge windows in linux-next IMHO
(excluding exceptions of course) - then they should either be sent
upstream (and eventually fixed later) or dropped.

Thanks Andrew!

-- 
Thanks,

David / dhildenb