Re: [RFC PATCH 3/3] cpuidle/powernv: Conditionally save-restore sprs using opal

2018-08-10 Thread Nicholas Piggin
On Wed, 8 Aug 2018 21:11:16 +0530
Gautham R Shenoy  wrote:

> Hello Nicholas,
> 
> On Fri, Aug 03, 2018 at 12:05:47AM +1000, Nicholas Piggin wrote:
> > On Thu,  2 Aug 2018 10:21:32 +0530
> > Akshay Adiga  wrote:
> >   
> > > From: Abhishek Goel 
> > > 
> > > If a state has "opal-supported" compat flag in device-tree, an opal call
> > > needs to be made during the entry and exit of the stop state. This patch
> > > passes a hint to the power9_idle_stop and power9_offline_stop.
> > > 
> > > This patch moves the saving and restoring of sprs for P9 cpuidle
> > > from kernel to opal. This patch still uses existing code to detect
> > > first thread in core.
> > > In an attempt to make the powernv idle code backward compatible,
> > > and to some extent forward compatible, add support for pre-stop entry
> > > and post-stop exit actions in OPAL. If a kernel knows about this
> > > opal call, then just a firmware supporting newer hardware is required,
> > > instead of waiting for kernel updates.  
> > 
> > Still think we should make these do-everything calls. Including
> > executing nap/stop instructions, restoring timebase, possibly even
> > saving and restoring SLB (although a return code could be used to
> > tell the kernel to do that maybe if performance advantage is  
> enough).
> 
> So, if we execute the stop instruction in opal, the wakeup from stop
> still happens at the hypervisor 0x100. On wake up, we need to check
> SRR1 to see if we have lost state, in which case, the stop exit also
> needs to be handled inside opal.

Yes. That's okay, SRR1 seems to be pretty well architected.

> On return from this opal call, we
> need to unwind the extra stack frame that would have been created when
> kernel entered opal to execute the stop from which there was no
> return. In the case where a lossy stop state was requested, but wakeup
> happened from a lossless stop state, this adds additional overhead.

True, but you're going from 1 OPAL call to 2. So you still have that
overhead. Although possibly we could implement some special light
weight stackless calls (I'm thinking about doing that for MCE handling
too). Or you could perhaps just discard the stack without needing to
unwind anything in the case of a lossless wakeup.

> 
> Furthermore, the measurements show that the additional time taken to
> perform the restore of the resources in OPAL vs doing so in Kernel on
> wakeup from stop takes additional 5-10us. For the current stop states
> that lose hypervisor state, since the latency is relatively high (100s
> of us), this is a relatively small penalty (~1%) .

Yeah OPAL is pretty heavy to enter. We can improve that a bit. But
yes for P10 timeframe it may be still heavy weight.

> 
> However, in future if we do have states that lose only a part of
> hypervisor state to provide a wakeup latency in the order of few tens
> of microseconds the additional latency caused by OPAL call would
> become noticable, no ?

I think so long as we can do shallow states in Linux it really won't
be that big a deal (even if we don't do any of the above speedup
tricks).

I think it's really desirable to have a complete firmware
implementation. Having this compromise seems like the worst of both
in a way (does not allow firmware to control everything, and does
not have great performance).

> 
>   
> > 
> > I haven't had a lot of time to go through it, I'm working on moving
> > ~all of idle_book3s.S to C code, I'd like to do that before this
> > OPAL idle driver if possible.
> > 
> > A minor thing I just noticed, you don't have to allocate the opal
> > spr save space in Linux, just do it all in OPAL.  
> 
> The idea was to not leave any state in OPAL, as OPAL is supposed to be
> state-less. However, I agree, that if OPAL is not going to interpret
> the contents of the save/area, it should be harmless to move that bit
> into OPAL.
> 
> That said, if we are going to add the logic of determining the first
> thread in the core waking up, etc, then we have no choice but to
> maintain that state in OPAL.

I don't think it's such a problem for particular very carefully
defined cases like this.

Thanks,
Nick


Pull request: scottwood/linux.git next

2018-08-10 Thread Scott Wood
This contains an 8xx compilation fix, and a dpaa device tree fix.

The following changes since commit 77b5f703dcc859915f0f20d92bc538e4a99ef149:

  powerpc/powernv/opal: Use standard interrupts property when available 
(2018-08-08 00:32:38 +1000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux.git next

for you to fetch changes up to bd96461249bdded1f6ee50b25e8cd4ce3faaf581:

  powerpc/dts/fsl: t2080rdb: use the Cortina PHY driver compatible (2018-08-08 
17:18:02 -0500)


Camelia Groza (3):
  powerpc/configs/dpaa: enable the Cortina PHY driver
  powerpc/dts/fsl: t4240rdb: use the Cortina PHY driver compatible
  powerpc/dts/fsl: t2080rdb: use the Cortina PHY driver compatible

Christophe Leroy (1):
  powerpc/cpm1: fix compilation error with CONFIG_PPC_EARLY_DEBUG_CPM

 arch/powerpc/boot/dts/fsl/t2080rdb.dts | 4 ++--
 arch/powerpc/boot/dts/fsl/t4240rdb.dts | 8 
 arch/powerpc/configs/dpaa.config   | 1 +
 arch/powerpc/include/asm/fixmap.h  | 1 +
 arch/powerpc/sysdev/cpm_common.c   | 1 +
 5 files changed, 9 insertions(+), 6 deletions(-)


Re: [PATCH v7 7/9] powerpc/pseries: Dump the SLB contents on SLB MCE errors.

2018-08-10 Thread Nicholas Piggin
On Tue, 07 Aug 2018 19:47:39 +0530
Mahesh J Salgaonkar  wrote:

> From: Mahesh Salgaonkar 
> 
> If we get a machine check exceptions due to SLB errors then dump the
> current SLB contents which will be very much helpful in debugging the
> root cause of SLB errors. Introduce an exclusive buffer per cpu to hold
> faulty SLB entries. In real mode mce handler saves the old SLB contents
> into this buffer accessible through paca and print it out later in virtual
> mode.
> 
> With this patch the console will log SLB contents like below on SLB MCE
> errors:
> 
> [  507.297236] SLB contents of cpu 0x1
> [  507.297237] Last SLB entry inserted at slot 16
> [  507.297238] 00 c800 400ea1b217000500
> [  507.297239]   1T  ESID=   c0  VSID=  ea1b217 LLP:100
> [  507.297240] 01 d800 400d43642f000510
> [  507.297242]   1T  ESID=   d0  VSID=  d43642f LLP:110
> [  507.297243] 11 f800 400a86c85f000500
> [  507.297244]   1T  ESID=   f0  VSID=  a86c85f LLP:100
> [  507.297245] 12 7f000800 4008119624000d90
> [  507.297246]   1T  ESID=   7f  VSID=  8119624 LLP:110
> [  507.297247] 13 1800 00092885f5150d90
> [  507.297247]  256M ESID=1  VSID=   92885f5150 LLP:110
> [  507.297248] 14 01000800 4009e7cb5d90
> [  507.297249]   1T  ESID=1  VSID=  9e7cb50 LLP:110
> [  507.297250] 15 d800 400d43642f000510
> [  507.297251]   1T  ESID=   d0  VSID=  d43642f LLP:110
> [  507.297252] 16 d800 400d43642f000510
> [  507.297253]   1T  ESID=   d0  VSID=  d43642f LLP:110
> [  507.297253] --
> [  507.297254] SLB cache ptr value = 3
> [  507.297254] Valid SLB cache entries:
> [  507.297255] 00 EA[0-35]=7f000
> [  507.297256] 01 EA[0-35]=1
> [  507.297257] 02 EA[0-35]= 1000
> [  507.297257] Rest of SLB cache entries:
> [  507.297258] 03 EA[0-35]=7f000
> [  507.297258] 04 EA[0-35]=1
> [  507.297259] 05 EA[0-35]= 1000
> [  507.297260] 06 EA[0-35]=   12
> [  507.297260] 07 EA[0-35]=7f000
> 
> Suggested-by: Aneesh Kumar K.V 
> Suggested-by: Michael Ellerman 
> Signed-off-by: Mahesh Salgaonkar 
> ---
> 
> Changes in V7:
> - Print slb cache ptr value and slb cache data
> ---
>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |7 ++
>  arch/powerpc/include/asm/paca.h   |4 +
>  arch/powerpc/mm/slb.c |   73 
> +
>  arch/powerpc/platforms/pseries/ras.c  |   10 +++
>  arch/powerpc/platforms/pseries/setup.c|   10 +++
>  5 files changed, 103 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> index cc00a7088cf3..5a3fe282076d 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> @@ -485,9 +485,16 @@ static inline void hpte_init_pseries(void) { }
>  
>  extern void hpte_init_native(void);
>  
> +struct slb_entry {
> + u64 esid;
> + u64 vsid;
> +};
> +
>  extern void slb_initialize(void);
>  extern void slb_flush_and_rebolt(void);
>  extern void slb_flush_and_rebolt_realmode(void);
> +extern void slb_save_contents(struct slb_entry *slb_ptr);
> +extern void slb_dump_contents(struct slb_entry *slb_ptr);
>  
>  extern void slb_vmalloc_update(void);
>  extern void slb_set_size(u16 size);
> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index 7f22929ce915..233d25ff6f64 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -254,6 +254,10 @@ struct paca_struct {
>  #endif
>  #ifdef CONFIG_PPC_PSERIES
>   u8 *mce_data_buf;   /* buffer to hold per cpu rtas errlog */
> +
> + /* Capture SLB related old contents in MCE handler. */
> + struct slb_entry *mce_faulty_slbs;
> + u16 slb_save_cache_ptr;
>  #endif /* CONFIG_PPC_PSERIES */
>  } cacheline_aligned;
>  
> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
> index e89f675f1b5e..16a53689ffd4 100644
> --- a/arch/powerpc/mm/slb.c
> +++ b/arch/powerpc/mm/slb.c
> @@ -151,6 +151,79 @@ void slb_flush_and_rebolt_realmode(void)
>   get_paca()->slb_cache_ptr = 0;
>  }
>  
> +void slb_save_contents(struct slb_entry *slb_ptr)
> +{
> + int i;
> + unsigned long e, v;
> +
> + /* Save slb_cache_ptr value. */
> + get_paca()->slb_save_cache_ptr = get_paca()->slb_cache_ptr;

What's the point of saving this?

> +
> + if (!slb_ptr)
> + return;

Can this ever happen?

> +
> + for (i = 0; i < mmu_slb_size; i++) {
> + asm volatile("slbmfee  %0,%1" : "=r" (e) : "r" (i));
> + asm volatile("slbmfev  %0,%1" : "=r" (v) : "r" (i));

Does the UM say these instructions can cause machine checks if the SLB
is corrupted? It talks about mfslb instruction causing MCE, but there
seems to be no such instruction 

Re: [PATCH v4 5/6] powerpc: Add show_user_instructions()

2018-08-10 Thread Murilo Opsfelder Araujo
Hi, Christophe.

On Fri, Aug 10, 2018 at 11:29:43AM +0200, Christophe LEROY wrote:
>
>
> Le 03/08/2018 à 13:31, Murilo Opsfelder Araujo a écrit :
> > Hi, everyone.
> >
> > I'd like to thank you all that contributed to refining and making this
> > series better.  I did appreciate.
> >
> > Thank you!
>
> You are welcome.
>
> It seems that nested segfaults don't print very well. The code line is split
> in two, le second half being alone on one line. Any way to avoid that ?

What do you mean by "nested segfaults"?  I'd guess it's related to nested
virtualization, e.g.: host (kvm hv) -> guest1 (kvm pr) -> guest2?  So the
segfault would have been originated in guest2?

Can you provide us with more details about how to reproduce that?

Anyway, is "" so important that it needs to be printed at all?  Can't we
just discard it and not print it?

Murilo



[PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch

2018-08-10 Thread Srikar Dronamraju
With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node
sched domain") scheduler introduces an new numa level. However on shared
lpars like powerpc, this extra sched domain creation can lead to
repeated rcu stalls, sometimes even causing unresponsive systems on
boot. On such stalls, it was noticed that init_sched_groups_capacity()
(sg != sd->groups is always true).

INFO: rcu_sched self-detected stall on CPU
 1-: (240039 ticks this GP) idle=c32/1/4611686018427387906 softirq=782/782 
fqs=80012
  (t=240039 jiffies g=6272 c=6271 q=263040)
NMI backtrace for cpu 1
CPU: 1 PID: 1576 Comm: kworker/1:1 Kdump: loaded Tainted: GE 
4.18.0-rc7-master+ #42
Workqueue: events topology_work_fn
Call Trace:
[c0832132f190] [c09557ac] dump_stack+0xb0/0xf4 (unreliable)
[c0832132f1d0] [c095ed54] nmi_cpu_backtrace+0x1b4/0x230
[c0832132f270] [c095efac] nmi_trigger_cpumask_backtrace+0x1dc/0x220
[c0832132f310] [c005f77c] arch_trigger_cpumask_backtrace+0x2c/0x40
[c0832132f330] [c01a32d4] rcu_dump_cpu_stacks+0x100/0x15c
[c0832132f380] [c01a2024] rcu_check_callbacks+0x894/0xaa0
[c0832132f4a0] [c01ace9c] update_process_times+0x4c/0xa0
[c0832132f4d0] [c01c5400] tick_sched_handle.isra.13+0x50/0x80
[c0832132f4f0] [c01c549c] tick_sched_timer+0x6c/0xd0
[c0832132f530] [c01ae044] __hrtimer_run_queues+0x134/0x360
[c0832132f5b0] [c01aeea4] hrtimer_interrupt+0x124/0x300
[c0832132f660] [c0024a04] timer_interrupt+0x114/0x2f0
[c0832132f6c0] [c00090f4] decrementer_common+0x114/0x120
--- interrupt: 901 at __bitmap_weight+0x70/0x100
LR = __bitmap_weight+0x78/0x100
[c0832132f9b0] [c09bb738] __func__.61127+0x0/0x20 (unreliable)
[c0832132fa00] [c016c178] build_sched_domains+0xf98/0x13f0
[c0832132fb30] [c016d73c] partition_sched_domains+0x26c/0x440
[c0832132fc20] [c01ee284] rebuild_sched_domains_locked+0x64/0x80
[c0832132fc50] [c01f11ec] rebuild_sched_domains+0x3c/0x60
[c0832132fc80] [c007e1c4] topology_work_fn+0x24/0x40
[c0832132fca0] [c0126704] process_one_work+0x1a4/0x470
[c0832132fd30] [c0126a68] worker_thread+0x98/0x540
[c0832132fdc0] [c012f078] kthread+0x168/0x1b0
[c0832132fe30] [c000b65c]
ret_from_kernel_thread+0x5c/0x80

Similar problem was earlier also reported at
https://lwn.net/ml/linux-kernel/20180512100233.GB3738@osiris/

Allow arch to set and clear masks corresponding to numa sched domain.

Cc: linuxppc-dev 
Cc: Michael Ellerman 
Cc: Peter Zijlstra 
Cc: Heiko Carstens 
Cc: Ingo Molnar 
Cc: LKML 
Fixes: 051f3ca02e46 "Introduce NUMA identity node sched domain"
Signed-off-by: Srikar Dronamraju 

Signed-off-by: Srikar Dronamraju 
---
 include/linux/sched/topology.h | 6 ++
 kernel/sched/sched.h   | 4 
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..13c7baeb7789 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -52,6 +52,12 @@ static inline int cpu_numa_flags(void)
 {
return SD_NUMA;
 }
+
+extern void sched_domains_numa_masks_set(unsigned int cpu);
+extern void sched_domains_numa_masks_clear(unsigned int cpu);
+#else
+static inline void sched_domains_numa_masks_set(unsigned int cpu) { }
+static inline void sched_domains_numa_masks_clear(unsigned int cpu) { }
 #endif
 
 extern int arch_asym_cpu_priority(int cpu);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7742dcc136c..1028f3df8777 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1057,12 +1057,8 @@ extern bool find_numa_distance(int distance);
 
 #ifdef CONFIG_NUMA
 extern void sched_init_numa(void);
-extern void sched_domains_numa_masks_set(unsigned int cpu);
-extern void sched_domains_numa_masks_clear(unsigned int cpu);
 #else
 static inline void sched_init_numa(void) { }
-static inline void sched_domains_numa_masks_set(unsigned int cpu) { }
-static inline void sched_domains_numa_masks_clear(unsigned int cpu) { }
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-- 
2.12.3



[PATCH 1/2] sched/topology: Set correct numa topology type

2018-08-10 Thread Srikar Dronamraju
With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node
sched domain") scheduler introduces an new numa level. However this
leads to numa topology on 2 node systems no more marked as NUMA_DIRECT.
After this commit, it gets reported as NUMA_BACKPLANE. This is because
sched_domains_numa_level is now 2 on 2 node systems.

Fix this by allowing setting systems that have upto 2 numa levels as
NUMA_DIRECT.

While here remove a code that assumes level can be 0.

Fixes: 051f3ca02e46 "Introduce NUMA identity node sched domain"
Signed-off-by: Srikar Dronamraju 
---
 kernel/sched/topology.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a6e6b855ba81..cec3ee3ed320 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1315,7 +1315,7 @@ static void init_numa_topology_type(void)
 
n = sched_max_numa_distance;
 
-   if (sched_domains_numa_levels <= 1) {
+   if (sched_domains_numa_levels <= 2) {
sched_numa_topology_type = NUMA_DIRECT;
return;
}
@@ -1400,9 +1400,6 @@ void sched_init_numa(void)
break;
}
 
-   if (!level)
-   return;
-
/*
 * 'level' contains the number of unique distances
 *
-- 
2.12.3



Re: [PATCH] sched/topology: Use Identity node only if required

2018-08-10 Thread Srikar Dronamraju
* Peter Zijlstra  [2018-08-08 09:58:41]:

> On Wed, Aug 08, 2018 at 12:39:31PM +0530, Srikar Dronamraju wrote:
> > With Commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node
> > sched domain") scheduler introduces an extra numa level. However that
> > leads to
> > 
> >  - numa topology on 2 node systems no more marked as NUMA_DIRECT.  After
> >this commit, it gets reported as NUMA_BACKPLANE. This is because
> >sched_domains_numa_level now equals 2 on 2 node systems.
> > 
> >  - Extra numa sched domain that gets added and degenerated on most
> >machines.  The Identity node is only needed on very few systems.
> >Also all non-numa systems will end up populating
> >sched_domains_numa_distance and sched_domains_numa_masks tables.
> > 
> >  - On shared lpars like powerpc, this extra sched domain creation can
> >lead to repeated rcu stalls, sometimes even causing unresponsive
> >systems on boot. On such stalls, it was noticed that
> >init_sched_groups_capacity() (sg != sd->groups is always true).
> 
> The idea was that if the topology level is redundant (as it often is);
> then the degenerate code would take it out.
> 
> Why is that not working (right) and can we fix that instead?
> 

Here is my analysis on another box showing same issue.
numactl o/p

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39 64 65 66 67 68 69 70 71 96 
97 98 99 100 101 102 103 128 129 130 131 132 133 134 135 160 161 162 163 164 
165 166 167 192 193 194 195 196 197 198 199 224 225 226 227 228 229 230 231 256 
257 258 259 260 261 262 263 288 289 290 291 292 293 294 295
node 0 size: 536888 MB
node 0 free: 533582 MB
node 1 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63 88 89 90 91 92 93 
94 95 120 121 122 123 124 125 126 127 152 153 154 155 156 157 158 159 184 185 
186 187 188 189 190 191 216 217 218 219 220 221 222 223 248 249 250 251 252 253 
254 255 280 281 282 283 284 285 286 287
node 1 size: 502286 MB
node 1 free: 501283 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 80 81 82 83 84 85 
86 87 112 113 114 115 116 117 118 119 144 145 146 147 148 149 150 151 176 177 
178 179 180 181 182 183 208 209 210 211 212 213 214 215 240 241 242 243 244 245 
246 247 272 273 274 275 276 277 278 279
node 2 size: 503054 MB
node 2 free: 502854 MB
node 3 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 
79 104 105 106 107 108 109 110 111 136 137 138 139 140 141 142 143 168 169 170 
171 172 173 174 175 200 201 202 203 204 205 206 207 232 233 234 235 236 237 238 
239 264 265 266 267 268 269 270 271 296 297 298 299 300 301 302 303
node 3 size: 503310 MB
node 3 free: 498465 MB
node distances:
node   0   1   2   3
  0:  10  40  40  40
  1:  40  10  40  40
  2:  40  40  10  40
  3:  40  40  40  10

Extracting the contents of dmesg using sched_debug kernel parameter

CPU0 attaching NULL sched-domain.
CPU1 attaching NULL sched-domain.


CPU302 attaching NULL sched-domain.
CPU303 attaching NULL sched-domain.
BUG: arch topology borken
 the DIE domain not a subset of the NODE domain
BUG: arch topology borken
 the DIE domain not a subset of the NODE domain
.
.
BUG: arch topology borken
 the DIE domain not a subset of the NODE domain
BUG: arch topology borken
 the DIE domain not a subset of the NODE domain
BUG: arch topology borken
 the DIE domain not a subset of the NODE domain

 CPU0 attaching sched-domain(s):
   domain-2: sdA, span=0-303 level=NODE
groups: sg=sgL 0:{ 
span=0-7,32-39,64-71,96-103,128-135,160-167,192-199,224-231,256-263,288-295 
cap=81920 }, sgM 8:{ 
span=8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303 
cap=81920 }, sdN 16:{ 
span=16-23,48-55,80-87,112-119,144-151,176-183,208-215,240-247,272-279 
cap=73728 }, sgO 24:{ 
span=24-31,56-63,88-95,120-127,152-159,184-191,216-223,248-255,280-287 
cap=73728 }
CPU1  attaching sched-domain(s):
   domain-2: sdB, span=0-303 level=NODE
[  367.739387] groups: sg=sgL 0:{ 
span=0-7,32-39,64-71,96-103,128-135,160-167,192-199,224-231,256-263,288-295 
cap=81920 }, sgM 8:{ 
span=8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303 
cap=81920 }, sdN 16:{ 
span=16-23,48-55,80-87,112-119,144-151,176-183,208-215,240-247,272-279 
cap=73728 }, sgO 24:{ 
span=24-31,56-63,88-95,120-127,152-159,184-191,216-223,248-255,280-287 
cap=73728 }


CPU8  attaching sched-domain(s):
   domain-2: sdC, 
span=8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303 
level=NODE
groups: sgM 8:{ 
span=8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303 
cap=81920 }
domain-3: sdD, span=0-303 level=NUMA
 groups: sgX 8:{ 
span=8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303 
cap=81920 }, sgY 16:{ 
span=16-23,48-55,80-87,112-119,144-151,176-183,208-215,240-247,272-279 
cap=73728 }, sgZ 24:{ 
span=24-31,56-63,88-95,120-127,152-159,184-191,216-223,248-255,280-287 

Re: [PATCH v3] powerpc/topology: Check at boot for topology updates

2018-08-10 Thread Michael Bringmann
On 08/10/2018 08:21 AM, Srikar Dronamraju wrote:
> * Michael Ellerman  [2018-08-10 21:42:28]:
>
>> Srikar Dronamraju  writes:
>>> diff --git a/arch/powerpc/include/asm/topology.h 
>>> b/arch/powerpc/include/asm/topology.h
>>> index 16b077801a5f..70f2d2285ba7 100644
>>> --- a/arch/powerpc/include/asm/topology.h
>>> +++ b/arch/powerpc/include/asm/topology.h
>>> @@ -113,6 +114,10 @@ static inline int timed_topology_update(int nsecs)
>>>  {
>>> return 0;
>>>  }
>>> +
>>> +#ifdef CONFIG_SMP
>>> +static inline void check_topology_updates(void) {}
>>> +#endif
>> I realise that's only called from smp.c but it doesn't really need to be
>> inside #ifdef CONFIG_SMP, it's harmless to have an empty version for
>> SMP=n.
>>
> I did that just to fix
> https://openpower.xyz/job/snowpatch/job/snowpatch-linux-sparse/577//
>
>
>>> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
>>> index 4794d6b4f4d2..2aa0ffd954c9 100644
>>> --- a/arch/powerpc/kernel/smp.c
>>> +++ b/arch/powerpc/kernel/smp.c
>>> @@ -1156,6 +1156,12 @@ void __init smp_cpus_done(unsigned int max_cpus)
>>> if (smp_ops && smp_ops->bringup_done)
>>> smp_ops->bringup_done();
>>>  
>>> +   /*
>>> +* On a shared LPAR, associativity needs to be requested.
>>> +* Hence, check for numa topology updates before dumping
>>> +* cpu topology
>>> +*/
>>> +   check_topology_updates();
>> I get that you're calling it here because you want to get it correct
>> before we do the dump below, but we could move the dump later also.
>>
>> You mention we need to do the check before the sched domains are
>> initialised, but where exactly does that happen?
> Almost immediately after smp_cpus_done, 
>
> kernel_init_freeable() calls smp_init() and sched_init_smp() back to
> back.
>
> smp_init() calls smp_cpus_done()
> sched_init_smp() calls sched_init_numa() and sched_init_domains()
>
> sched_init_numa() initialises sched_domains_numa_masks which is the
> cpumask that matters the most in our discussions.
> sched_init_domains initialised the sched_domain.
>
> So I dont think we can move any later.
>
>>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>>> index 0c7e05d89244..32c13a208589 100644
>>> --- a/arch/powerpc/mm/numa.c
>>> +++ b/arch/powerpc/mm/numa.c
>>> @@ -1515,6 +1515,7 @@ int start_topology_update(void)
>>>lppaca_shared_proc(get_lppaca())) {
>>> if (!vphn_enabled) {
>>> vphn_enabled = 1;
>>> +   topology_update_needed = 1;
>> I really don't understand what topology_update_needed is for.
>> We set it here.
> topology_update_needed is even set by numa_update_cpu_topology which
> mighet get called from rtasd.
Yes.  While looking at problems with Shared CPU Topology updates
for NUMA last year, I observed that numa_update_cpu_topology
was being called before topology_update_init(), and thus, the next
call to numa_update_cpu_topology did not correctly initialize the
topology based upon VPHN.
>
>>> setup_cpu_associativity_change_counters();
>>> timer_setup(_timer, topology_timer_fn,
>>> TIMER_DEFERRABLE);
>>> @@ -1551,6 +1552,19 @@ int prrn_is_enabled(void)
>>> return prrn_enabled;
>>>  }
>>>  
>>> +void __init check_topology_updates(void)
>>> +{
>>> +   /* Do not poll for changes if disabled at boot */
>>> +   if (topology_updates_enabled)
>>> +   start_topology_update();
>> Here we call start_topology_update(), which might set
>> topology_update_needed to 1.
>>
>>> +
>>> +   if (topology_update_needed) {
>> And then we check it here?
>> Why is this logic not *in* start_topology_update() ?
> I have split topology_update_init into two parts.  First part being
> check_topology_update() and the next part being topology_update_init().
>
> By the time the current topology_update_init() gets called,
> sched_domains are already created. So we need to move this part before.
>
> However the scheduled topology updates because of vphn can wait, atleast
> that how the current code is written. topology_inited indicates if the
> scheduled topology updates have been setup.
>
>>> +   bitmap_fill(cpumask_bits(_associativity_changes_mask),
>>> +   nr_cpumask_bits);
>>> +   numa_update_cpu_topology(false);
>>> +   }
>>> +}
>> I'm not really clear on what check_topology_update() is responsible for
>> doing vs topology_update_init(). Why do we need both?
> I am okay to move the complete topology_update_init above and delete the
> device_initcall. But then we would be moving the polling of vphn events
> to a little bit earlier stage in the initialisation.
>
>>> @@ -1608,10 +1618,6 @@ static int topology_update_init(void)
>>> return -ENOMEM;
>>>  
>>> topology_inited = 1;
>> What does that mean? Didn't we already do the init earlier?
> If there is an explicit request for topology update via
> numa_update_cpu_topology() even before we looked 

Re: [PATCH v3] powerpc/topology: Check at boot for topology updates

2018-08-10 Thread Srikar Dronamraju
* Michael Ellerman  [2018-08-10 21:42:28]:

> Srikar Dronamraju  writes:
> > diff --git a/arch/powerpc/include/asm/topology.h 
> > b/arch/powerpc/include/asm/topology.h
> > index 16b077801a5f..70f2d2285ba7 100644
> > --- a/arch/powerpc/include/asm/topology.h
> > +++ b/arch/powerpc/include/asm/topology.h
> > @@ -113,6 +114,10 @@ static inline int timed_topology_update(int nsecs)
> >  {
> > return 0;
> >  }
> > +
> > +#ifdef CONFIG_SMP
> > +static inline void check_topology_updates(void) {}
> > +#endif
> 
> I realise that's only called from smp.c but it doesn't really need to be
> inside #ifdef CONFIG_SMP, it's harmless to have an empty version for
> SMP=n.
> 

I did that just to fix
https://openpower.xyz/job/snowpatch/job/snowpatch-linux-sparse/577//


> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 4794d6b4f4d2..2aa0ffd954c9 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -1156,6 +1156,12 @@ void __init smp_cpus_done(unsigned int max_cpus)
> > if (smp_ops && smp_ops->bringup_done)
> > smp_ops->bringup_done();
> >  
> > +   /*
> > +* On a shared LPAR, associativity needs to be requested.
> > +* Hence, check for numa topology updates before dumping
> > +* cpu topology
> > +*/
> > +   check_topology_updates();
> 
> I get that you're calling it here because you want to get it correct
> before we do the dump below, but we could move the dump later also.
> 
> You mention we need to do the check before the sched domains are
> initialised, but where exactly does that happen?

Almost immediately after smp_cpus_done, 

kernel_init_freeable() calls smp_init() and sched_init_smp() back to
back.

smp_init() calls smp_cpus_done()
sched_init_smp() calls sched_init_numa() and sched_init_domains()

sched_init_numa() initialises sched_domains_numa_masks which is the
cpumask that matters the most in our discussions.
sched_init_domains initialised the sched_domain.

So I dont think we can move any later.

> 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 0c7e05d89244..32c13a208589 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -1515,6 +1515,7 @@ int start_topology_update(void)
> >lppaca_shared_proc(get_lppaca())) {
> > if (!vphn_enabled) {
> > vphn_enabled = 1;
> > +   topology_update_needed = 1;
> 
> I really don't understand what topology_update_needed is for.
> We set it here.

topology_update_needed is even set by numa_update_cpu_topology which
mighet get called from rtasd.

> 
> > setup_cpu_associativity_change_counters();
> > timer_setup(_timer, topology_timer_fn,
> > TIMER_DEFERRABLE);
> > @@ -1551,6 +1552,19 @@ int prrn_is_enabled(void)
> > return prrn_enabled;
> >  }
> >  
> > +void __init check_topology_updates(void)
> > +{
> > +   /* Do not poll for changes if disabled at boot */
> > +   if (topology_updates_enabled)
> > +   start_topology_update();
> 
> Here we call start_topology_update(), which might set
> topology_update_needed to 1.
> 
> > +
> > +   if (topology_update_needed) {
> 
> And then we check it here?
> Why is this logic not *in* start_topology_update() ?

I have split topology_update_init into two parts.  First part being
check_topology_update() and the next part being topology_update_init().

By the time the current topology_update_init() gets called,
sched_domains are already created. So we need to move this part before.

However the scheduled topology updates because of vphn can wait, atleast
that how the current code is written. topology_inited indicates if the
scheduled topology updates have been setup.

> 
> > +   bitmap_fill(cpumask_bits(_associativity_changes_mask),
> > +   nr_cpumask_bits);
> > +   numa_update_cpu_topology(false);
> > +   }
> > +}
> 
> I'm not really clear on what check_topology_update() is responsible for
> doing vs topology_update_init(). Why do we need both?

I am okay to move the complete topology_update_init above and delete the
device_initcall. But then we would be moving the polling of vphn events
to a little bit earlier stage in the initialisation.

> 
> > @@ -1608,10 +1618,6 @@ static int topology_update_init(void)
> > return -ENOMEM;
> >  
> > topology_inited = 1;
> 
> What does that mean? Didn't we already do the init earlier?

If there is an explicit request for topology update via
numa_update_cpu_topology() even before we looked at the feature flags
and set vphn_enabled/prrn_enabled, then we would defer handle it till
the feature flags are checked. topology_inited helps us in achieving
this.

> 
> > -   if (topology_update_needed)
> > -   bitmap_fill(cpumask_bits(_associativity_changes_mask),
> > -   nr_cpumask_bits);
> > -
> > return 0;
> >  }
> >  

[PATCH] powerpc/uaccess: Enable get_user(u64, *p) on 32-bit

2018-08-10 Thread Michael Ellerman
Currently if you build a 32-bit powerpc kernel and use get_user() to
load a u64 value it will fail to build with eg:

  kernel/rseq.o: In function `rseq_get_rseq_cs':
  kernel/rseq.c:123: undefined reference to `__get_user_bad'

This is hitting the check in __get_user_size() that makes sure the
size we're copying doesn't exceed the size of the destination:

  #define __get_user_size(x, ptr, size, retval)
  do {
retval = 0;
__chk_user_ptr(ptr);
if (size > sizeof(x))
(x) = __get_user_bad();

Which doesn't immediately make sense because the size of the
destination is u64, but it's not really, because __get_user_check()
etc. internally create an unsigned long and copy into that:

  #define __get_user_check(x, ptr, size)
  ({
long __gu_err = -EFAULT;
unsigned long  __gu_val = 0;

The problem being that on 32-bit unsigned long is not big enough to
hold a u64. We can fix this with a trick from hpa in the x86 code, we
statically check the type of x and set the type of __gu_val to either
unsigned long or unsigned long long.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/uaccess.h | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 643cfbd5bcb5..bac225bb7f64 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -249,10 +249,17 @@ do {  
\
}   \
 } while (0)
 
+/*
+ * This is a type: either unsigned long, if the argument fits into
+ * that type, or otherwise unsigned long long.
+ */
+#define __long_type(x) \
+   __typeof__(__builtin_choose_expr(sizeof(x) > sizeof(0UL), 0ULL, 0UL))
+
 #define __get_user_nocheck(x, ptr, size)   \
 ({ \
long __gu_err;  \
-   unsigned long __gu_val; \
+   __long_type(*(ptr)) __gu_val;   \
const __typeof__(*(ptr)) __user *__gu_addr = (ptr); \
__chk_user_ptr(ptr);\
if (!is_kernel_addr((unsigned long)__gu_addr))  \
@@ -266,7 +273,7 @@ do {
\
 #define __get_user_check(x, ptr, size) \
 ({ \
long __gu_err = -EFAULT;\
-   unsigned long  __gu_val = 0;\
+   __long_type(*(ptr)) __gu_val = 0;   \
const __typeof__(*(ptr)) __user *__gu_addr = (ptr); \
might_fault();  \
if (access_ok(VERIFY_READ, __gu_addr, (size))) {\
@@ -280,7 +287,7 @@ do {
\
 #define __get_user_nosleep(x, ptr, size)   \
 ({ \
long __gu_err;  \
-   unsigned long __gu_val; \
+   __long_type(*(ptr)) __gu_val;   \
const __typeof__(*(ptr)) __user *__gu_addr = (ptr); \
__chk_user_ptr(ptr);\
barrier_nospec();   \
-- 
2.14.1



Re: [PATCH v3 1/2] powerpc/pseries: Avoid blocking rtas polling handling multiple PRRN events

2018-08-10 Thread Michael Ellerman
John Allen  writes:
> When a PRRN event is being handled and another PRRN event comes in, the
> second event will block rtas polling waiting on the first to complete,
> preventing any further rtas events from being handled. This can be
> especially problematic in case that PRRN events are continuously being
> queued in which case rtas polling gets indefinitely blocked completely.
>
> This patch removes the blocking call to flush_work and allows the
> default workqueue behavior to handle duplicate events.
>
> Signed-off-by: John Allen 
> ---
> v3:
>   -Scrap the mutex as it only replicates existing workqueue behavior.
> v2:
>   -Unlock prrn_lock when PRRN operations are complete, not after handler is
>scheduled.
>   -Remove call to flush_work, the previous broken method of serializing
>PRRN events.
> ---
>  arch/powerpc/kernel/rtasd.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
> index 44d66c33d59d..2017934e5985 100644
> --- a/arch/powerpc/kernel/rtasd.c
> +++ b/arch/powerpc/kernel/rtasd.c
> @@ -290,7 +290,6 @@ static DECLARE_WORK(prrn_work, prrn_work_fn);
>  
>  static void prrn_schedule_update(u32 scope)
>  {
> - flush_work(_work);
>   prrn_update_scope = scope;
>   schedule_work(_work);
>  }

I think we can still lose a "scope" value here.

ie.

prrn_schedule_update(1)
  prrn_update_scope = 1;
  schedule_work(_work);// scheduled but doesn't run yet

...

prrn_schedule_update(2)
  prrn_update_scope = 2;
  schedule_work(_work);   // it's still scheduled

// finally

prrn_work_fn(..)
  pseries_devicetree_update(-prrn_update_scope); // aka 2


So we lost the "1" value.

We could fix that by allocating a work struct that holds the scope value.

But I'd really like to try just calling the prrn_work_fn() directly,
we're already running as a scheduled work function, I don't see any
benefit in scheduling more work.

cheers


Re: [PATCH v3 2/2] powerpc/pseries: Wait for completion of hotplug events during PRRN handling

2018-08-10 Thread Michael Ellerman
Hi Nathan,

Nathan Fontenot  writes:
> On 08/08/2018 10:29 AM, John Allen wrote:
>> While handling PRRN events, the time to handle the actual hotplug events
>> dwarfs the time it takes to perform the device tree updates and queue the
>> hotplug events. In the case that PRRN events are being queued continuously,
>> hotplug events have been observed to be queued faster than the kernel can
>> actually handle them. This patch avoids the problem by waiting for a
>> hotplug request to complete before queueing more hotplug events.
>> 
>> Signed-off-by: John Allen 
>
> In the V2 thread it was mentioned that we could just call the DLPAR operation
> directly instead of going through the workqueue. I have written a patch to do
> this that also cleans up some of the request handling.
>
> requests that come through the hotplug interrupt still use the workqueue. The
> other requests, PRRN and sysfs, just call the dlpar handler directly. This
> eliminates the need for a wait conditional and return code handling in the
> workqueue handler and should solve the issue that John solves with his patch.
>
> This still needs testing but wanted to get people's thoughts.

It looks great to me.

I guess I thought we'd need to add a lock, but I think you're right that
the existing device hotplug lock will work.

So if this survives a bit of testing then I'd love to merge it.

cheers


Re: [PATCH v3] powerpc/topology: Check at boot for topology updates

2018-08-10 Thread Michael Ellerman
Hi Srikar,

Thanks for debugging this.

Srikar Dronamraju  writes:
> diff --git a/arch/powerpc/include/asm/topology.h 
> b/arch/powerpc/include/asm/topology.h
> index 16b077801a5f..70f2d2285ba7 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -113,6 +114,10 @@ static inline int timed_topology_update(int nsecs)
>  {
>   return 0;
>  }
> +
> +#ifdef CONFIG_SMP
> +static inline void check_topology_updates(void) {}
> +#endif

I realise that's only called from smp.c but it doesn't really need to be
inside #ifdef CONFIG_SMP, it's harmless to have an empty version for
SMP=n.

> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 4794d6b4f4d2..2aa0ffd954c9 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1156,6 +1156,12 @@ void __init smp_cpus_done(unsigned int max_cpus)
>   if (smp_ops && smp_ops->bringup_done)
>   smp_ops->bringup_done();
>  
> + /*
> +  * On a shared LPAR, associativity needs to be requested.
> +  * Hence, check for numa topology updates before dumping
> +  * cpu topology
> +  */
> + check_topology_updates();

I get that you're calling it here because you want to get it correct
before we do the dump below, but we could move the dump later also.

You mention we need to do the check before the sched domains are
initialised, but where exactly does that happen?

> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 0c7e05d89244..32c13a208589 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1515,6 +1515,7 @@ int start_topology_update(void)
>  lppaca_shared_proc(get_lppaca())) {
>   if (!vphn_enabled) {
>   vphn_enabled = 1;
> + topology_update_needed = 1;

I really don't understand what topology_update_needed is for.
We set it here.

>   setup_cpu_associativity_change_counters();
>   timer_setup(_timer, topology_timer_fn,
>   TIMER_DEFERRABLE);
> @@ -1551,6 +1552,19 @@ int prrn_is_enabled(void)
>   return prrn_enabled;
>  }
>  
> +void __init check_topology_updates(void)
> +{
> + /* Do not poll for changes if disabled at boot */
> + if (topology_updates_enabled)
> + start_topology_update();

Here we call start_topology_update(), which might set
topology_update_needed to 1.

> +
> + if (topology_update_needed) {

And then we check it here?
Why is this logic not *in* start_topology_update() ?

> + bitmap_fill(cpumask_bits(_associativity_changes_mask),
> + nr_cpumask_bits);
> + numa_update_cpu_topology(false);
> + }
> +}

I'm not really clear on what check_topology_update() is responsible for
doing vs topology_update_init(). Why do we need both?

> @@ -1608,10 +1618,6 @@ static int topology_update_init(void)
>   return -ENOMEM;
>  
>   topology_inited = 1;

What does that mean? Didn't we already do the init earlier?

> - if (topology_update_needed)
> - bitmap_fill(cpumask_bits(_associativity_changes_mask),
> - nr_cpumask_bits);
> -
>   return 0;
>  }
>  device_initcall(topology_update_init);


cheers


Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

2018-08-10 Thread Michael Ellerman
Michal Suchánek  writes:
> On Wed, 8 Aug 2018 21:07:11 +0530
> "Aneesh Kumar K.V"  wrote:
>> On 08/08/2018 08:26 PM, Michael Ellerman wrote:
>> > Mahesh J Salgaonkar  writes:  
>> >> From: Mahesh Salgaonkar 
>> >>
>> >> Introduce recovery action for recovered memory errors (MCEs).
>> >> There are soft memory errors like SLB Multihit, which can be a
>> >> result of a bad hardware OR software BUG. Kernel can easily
>> >> recover from these soft errors by flushing SLB contents. After the
>> >> recovery kernel can still continue to function without any issue.
>> >> But in some scenario's we may keep getting these soft errors until
>> >> the root cause is fixed. To be able to analyze and find the root
>> >> cause, best way is to gather enough data and system state at the
>> >> time of MCE. Hence this patch introduces a sysctl knob where user
>> >> can decide either to continue after recovery or panic the kernel
>> >> to capture the dump.  
>> > 
>> > I'm not convinced we want this.
>> > 
>> > As we've discovered it's often not possible to reconstruct what
>> > happened based on a dump anyway.
>> > 
>> > The key thing you need is the content of the SLB and that's not
>> > included in a dump.
>> > 
>> > So I think we should dump the SLB content when we get the MCE (which
>> > this series does) and any other useful info, and then if we can
>> > recover we should.
>> 
>> The reasoning there is what if we got multi-hit due to some
>> corruption in slb_cache_ptr. ie. some part of kernel is wrongly
>> updating the paca data structure due to wrong pointer. Now that is
>> far fetched, but then possible right?. Hence the idea that, if we
>> don't have much insight into why a slb multi-hit occur from the dmesg
>> which include slb content, slb_cache contents etc, there should be an
>> easy way to force a dump that might assist in further debug.
>
> Nonetheless this turns all MCEs into crashes. Are there any MCEs that
> could happen during normal operation and should be handled by default?

An MCE should always be an indication of an abnormal condition, but
the exact set of things that are reported as MCEs is CPU specific, and
potentially even configurable at the hardware level.

However we only "handle" certain types of MCEs, so if we get an MCE for
something we don't understand then we'll panic already.

SLB multi-hit / parity error is one that we do handle (on bare metal),
because there is a well defined recovery action.

cheers


Re: [PATCH v7 7/9] powerpc/pseries: Dump the SLB contents on SLB MCE errors.

2018-08-10 Thread Mahesh Jagannath Salgaonkar
On 08/10/2018 04:02 PM, Mahesh Jagannath Salgaonkar wrote:
> On 08/09/2018 06:35 AM, Michael Ellerman wrote:
>> Mahesh J Salgaonkar  writes:
>>
>>> diff --git a/arch/powerpc/include/asm/paca.h 
>>> b/arch/powerpc/include/asm/paca.h
>>> index 7f22929ce915..233d25ff6f64 100644
>>> --- a/arch/powerpc/include/asm/paca.h
>>> +++ b/arch/powerpc/include/asm/paca.h
>>> @@ -254,6 +254,10 @@ struct paca_struct {
>>>  #endif
>>>  #ifdef CONFIG_PPC_PSERIES
>>> u8 *mce_data_buf;   /* buffer to hold per cpu rtas errlog */
>>> +
>>> +   /* Capture SLB related old contents in MCE handler. */
>>> +   struct slb_entry *mce_faulty_slbs;
>>> +   u16 slb_save_cache_ptr;
>>>  #endif /* CONFIG_PPC_PSERIES */
>>
>>  ^
> 
> I will pull that out of CONFIG_PPC_PSERIES.

I mean will pull 'mce_faulty_slbs' and 'slb_save_cache_ptr' and put it
under CONFIG_PPC_BOOK3S_64.

-Mahesh.

> 
>>
>>> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
>>> index e89f675f1b5e..16a53689ffd4 100644
>>> --- a/arch/powerpc/mm/slb.c
>>> +++ b/arch/powerpc/mm/slb.c
>>> @@ -151,6 +151,79 @@ void slb_flush_and_rebolt_realmode(void)
>>> get_paca()->slb_cache_ptr = 0;
>>>  }
>>>  
>>> +void slb_save_contents(struct slb_entry *slb_ptr)
>>> +{
>>> +   int i;
>>> +   unsigned long e, v;
>>> +
>>> +   /* Save slb_cache_ptr value. */
>>> +   get_paca()->slb_save_cache_ptr = get_paca()->slb_cache_ptr;
>>
>> This isn't inside CONFIG_PPC_PSERIES which breaks lots of configs, eg
>> powernv.
>>
>>   arch/powerpc/mm/slb.c:160:12: error: 'struct paca_struct' has no member 
>> named 'slb_save_cache_ptr'
>>   arch/powerpc/mm/slb.c:218:27: error: 'struct paca_struct' has no member 
>> named 'slb_save_cache_ptr'
>>   arch/powerpc/mm/slb.c:216:49: error: 'struct paca_struct' has no member 
>> named 'slb_save_cache_ptr'
>>
>> http://kisskb.ozlabs.ibm.com/kisskb/head/219f20e490add009194d94fdeb480da2e385f1c6/
>>
>> cheers
>>
> 
> Ouch.. my bad. Will fix it.
> 
> Thanks,
> -Mahesh.
> 



Re: [PATCH v7 7/9] powerpc/pseries: Dump the SLB contents on SLB MCE errors.

2018-08-10 Thread Mahesh Jagannath Salgaonkar
On 08/09/2018 06:35 AM, Michael Ellerman wrote:
> Mahesh J Salgaonkar  writes:
> 
>> diff --git a/arch/powerpc/include/asm/paca.h 
>> b/arch/powerpc/include/asm/paca.h
>> index 7f22929ce915..233d25ff6f64 100644
>> --- a/arch/powerpc/include/asm/paca.h
>> +++ b/arch/powerpc/include/asm/paca.h
>> @@ -254,6 +254,10 @@ struct paca_struct {
>>  #endif
>>  #ifdef CONFIG_PPC_PSERIES
>>  u8 *mce_data_buf;   /* buffer to hold per cpu rtas errlog */
>> +
>> +/* Capture SLB related old contents in MCE handler. */
>> +struct slb_entry *mce_faulty_slbs;
>> +u16 slb_save_cache_ptr;
>>  #endif /* CONFIG_PPC_PSERIES */
> 
>  ^

I will pull that out of CONFIG_PPC_PSERIES.

> 
>> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
>> index e89f675f1b5e..16a53689ffd4 100644
>> --- a/arch/powerpc/mm/slb.c
>> +++ b/arch/powerpc/mm/slb.c
>> @@ -151,6 +151,79 @@ void slb_flush_and_rebolt_realmode(void)
>>  get_paca()->slb_cache_ptr = 0;
>>  }
>>  
>> +void slb_save_contents(struct slb_entry *slb_ptr)
>> +{
>> +int i;
>> +unsigned long e, v;
>> +
>> +/* Save slb_cache_ptr value. */
>> +get_paca()->slb_save_cache_ptr = get_paca()->slb_cache_ptr;
> 
> This isn't inside CONFIG_PPC_PSERIES which breaks lots of configs, eg
> powernv.
> 
>   arch/powerpc/mm/slb.c:160:12: error: 'struct paca_struct' has no member 
> named 'slb_save_cache_ptr'
>   arch/powerpc/mm/slb.c:218:27: error: 'struct paca_struct' has no member 
> named 'slb_save_cache_ptr'
>   arch/powerpc/mm/slb.c:216:49: error: 'struct paca_struct' has no member 
> named 'slb_save_cache_ptr'
> 
> http://kisskb.ozlabs.ibm.com/kisskb/head/219f20e490add009194d94fdeb480da2e385f1c6/
> 
> cheers
> 

Ouch.. my bad. Will fix it.

Thanks,
-Mahesh.



Re: [PATCH v7 5/9] powerpc/pseries: flush SLB contents on SLB MCE errors.

2018-08-10 Thread Mahesh Jagannath Salgaonkar
On 08/08/2018 02:34 PM, Nicholas Piggin wrote:
> On Tue, 07 Aug 2018 19:47:14 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> On pseries, as of today system crashes if we get a machine check
>> exceptions due to SLB errors. These are soft errors and can be fixed by
>> flushing the SLBs so the kernel can continue to function instead of
>> system crash. We do this in real mode before turning on MMU. Otherwise
>> we would run into nested machine checks. This patch now fetches the
>> rtas error log in real mode and flushes the SLBs on SLB errors.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> Signed-off-by: Michal Suchanek 
>> ---
>>
>> Changes in V7:
>> - Fold Michal's patch into this patch.
>> - Handle MSR_RI=0 and evil context case in MC handler.
>> ---
> 
> 
>> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
>> index cb796724a6fc..e89f675f1b5e 100644
>> --- a/arch/powerpc/mm/slb.c
>> +++ b/arch/powerpc/mm/slb.c
>> @@ -145,6 +145,12 @@ void slb_flush_and_rebolt(void)
>>  get_paca()->slb_cache_ptr = 0;
>>  }
>>  
>> +void slb_flush_and_rebolt_realmode(void)
>> +{
>> +__slb_flush_and_rebolt();
>> +get_paca()->slb_cache_ptr = 0;
>> +}
>> +
>>  void slb_vmalloc_update(void)
>>  {
>>  unsigned long vflags;
> 
> Can you use this patch for the SLB flush?
> 
> https://patchwork.ozlabs.org/patch/953034/

Will use your v2.

Thanks,
-Mahesh.

> 
> Thanks,
> Nick
> 



Re: [PATCH v7 5/9] powerpc/pseries: flush SLB contents on SLB MCE errors.

2018-08-10 Thread Mahesh Jagannath Salgaonkar
On 08/07/2018 10:24 PM, Michal Suchánek wrote:
> Hello,
> 
> 
> On Tue, 07 Aug 2018 19:47:14 +0530
> "Mahesh J Salgaonkar"  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> On pseries, as of today system crashes if we get a machine check
>> exceptions due to SLB errors. These are soft errors and can be fixed
>> by flushing the SLBs so the kernel can continue to function instead of
>> system crash. We do this in real mode before turning on MMU. Otherwise
>> we would run into nested machine checks. This patch now fetches the
>> rtas error log in real mode and flushes the SLBs on SLB errors.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> Signed-off-by: Michal Suchanek 
>> ---
>>
>> Changes in V7:
>> - Fold Michal's patch into this patch.
>> - Handle MSR_RI=0 and evil context case in MC handler.
>> ---
>>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |1 
>>  arch/powerpc/include/asm/machdep.h|1 
>>  arch/powerpc/kernel/exceptions-64s.S  |  112
>> +
>> arch/powerpc/kernel/mce.c |   15 +++
>> arch/powerpc/mm/slb.c |6 +
>> arch/powerpc/platforms/powernv/setup.c|   11 ++
>> arch/powerpc/platforms/pseries/pseries.h  |1
>> arch/powerpc/platforms/pseries/ras.c  |   51 +++
>> arch/powerpc/platforms/pseries/setup.c|1 9 files changed,
>> 195 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h index
>> 50ed64fba4ae..cc00a7088cf3 100644 ---
>> a/arch/powerpc/include/asm/book3s/64/mmu-hash.h +++
>> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h @@ -487,6 +487,7 @@
>> extern void hpte_init_native(void); 
>>  extern void slb_initialize(void);
>>  extern void slb_flush_and_rebolt(void);
>> +extern void slb_flush_and_rebolt_realmode(void);
>>  
>>  extern void slb_vmalloc_update(void);
>>  extern void slb_set_size(u16 size);
>> diff --git a/arch/powerpc/include/asm/machdep.h
>> b/arch/powerpc/include/asm/machdep.h index a47de82fb8e2..b4831f1338db
>> 100644 --- a/arch/powerpc/include/asm/machdep.h
>> +++ b/arch/powerpc/include/asm/machdep.h
>> @@ -108,6 +108,7 @@ struct machdep_calls {
>>  
>>  /* Early exception handlers called in realmode */
>>  int (*hmi_exception_early)(struct pt_regs
>> *regs);
>> +long(*machine_check_early)(struct pt_regs
>> *regs); 
>>  /* Called during machine check exception to retrive fixup
>> address. */ bool (*mce_check_early_recovery)(struct
>> pt_regs *regs); diff --git a/arch/powerpc/kernel/exceptions-64s.S
>> b/arch/powerpc/kernel/exceptions-64s.S index
>> 285c6465324a..cb06f219570a 100644 ---
>> a/arch/powerpc/kernel/exceptions-64s.S +++
>> b/arch/powerpc/kernel/exceptions-64s.S @@ -332,6 +332,9 @@
>> TRAMP_REAL_BEGIN(machine_check_pSeries) machine_check_fwnmi:
>>  SET_SCRATCH0(r13)   /* save r13 */
>>  EXCEPTION_PROLOG_0(PACA_EXMC)
>> +BEGIN_FTR_SECTION
>> +b   machine_check_pSeries_early
>> +END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
>>  machine_check_pSeries_0:
>>  EXCEPTION_PROLOG_1(PACA_EXMC, KVMTEST_PR, 0x200)
>>  /*
>> @@ -343,6 +346,90 @@ machine_check_pSeries_0:
>>  
>>  TRAMP_KVM_SKIP(PACA_EXMC, 0x200)
>>  
>> +TRAMP_REAL_BEGIN(machine_check_pSeries_early)
>> +BEGIN_FTR_SECTION
>> +EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
>> +mr  r10,r1  /* Save r1 */
>> +ld  r1,PACAMCEMERGSP(r13)   /* Use MC emergency
>> stack */
>> +subir1,r1,INT_FRAME_SIZE/* alloc stack
>> frame*/
>> +mfspr   r11,SPRN_SRR0   /* Save SRR0 */
>> +mfspr   r12,SPRN_SRR1   /* Save SRR1 */
>> +EXCEPTION_PROLOG_COMMON_1()
>> +EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
>> +EXCEPTION_PROLOG_COMMON_3(0x200)
>> +addir3,r1,STACK_FRAME_OVERHEAD
>> +BRANCH_LINK_TO_FAR(machine_check_early) /* Function call ABI
>> */
>> +ld  r12,_MSR(r1)
>> +andi.   r11,r12,MSR_PR  /* See if coming
>> from user. */
>> +bne 2f  /* continue in V mode
>> if we are. */ +
>> +/*
>> + * At this point we are not sure about what context we come
>> from.
>> + * We may be in the middle of swithing stack. r1 may not be
>> valid.
>> + * Hence stay on emergency stack, call
>> machine_check_exception and
>> + * return from the interrupt.
>> + * But before that, check if this is an un-recoverable
>> exception.
>> + * If yes, then stay on emergency stack and panic.
>> + */
>> +andi.   r11,r12,MSR_RI
>> +bne 1f
>> +
>> +/*
>> + * Check if we have successfully handled/recovered from
>> error, if not
>> + * then stay on emergency stack and panic.
>> + */
>> +cmpdi   r3,0/* see if we handled MCE
>> successfully */
>> +bne 1f  /* if handled then return from
>> interrupt */ +
>> +

Re: [PATCH v7 4/9] powerpc/pseries: Define MCE error event section.

2018-08-10 Thread Mahesh Jagannath Salgaonkar
On 08/08/2018 08:12 PM, Michael Ellerman wrote:
> Hi Mahesh,
> 
> A few nitpicks.
> 
> Mahesh J Salgaonkar  writes:
>> From: Mahesh Salgaonkar 
>>
>> On pseries, the machine check error details are part of RTAS extended
>> event log passed under Machine check exception section. This patch adds
>> the definition of rtas MCE event section and related helper
>> functions.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/include/asm/rtas.h |  111 
>> +++
>>  1 file changed, 111 insertions(+)
> 
> AFIACS none of this ever gets used outside of ras.c, should it should
> just go in there.

Since it was all rtas specific I thought rtas.h is better place. But
yes, I can move this into ras.c

> 
>> diff --git a/arch/powerpc/include/asm/rtas.h 
>> b/arch/powerpc/include/asm/rtas.h
>> index 71e393c46a49..adc677c5e3a4 100644
>> --- a/arch/powerpc/include/asm/rtas.h
>> +++ b/arch/powerpc/include/asm/rtas.h
>> @@ -326,6 +334,109 @@ struct pseries_hp_errorlog {
>>  #define PSERIES_HP_ELOG_ID_DRC_COUNT3
>>  #define PSERIES_HP_ELOG_ID_DRC_IC   4
>>  
>> +/* RTAS pseries MCE errorlog section */
>> +#pragma pack(push, 1)
>> +struct pseries_mc_errorlog {
>> +__be32  fru_id;
>> +__be32  proc_id;
>> +uint8_t error_type;
> 
> Please use kernel types, so u8.

Will do so.

> 
>> +union {
>> +struct {
>> +uint8_t ue_err_type;
>> +/* 
>> + * X1: Permanent or Transient UE.
>> + *  X   1: Effective address provided.
>> + *   X  1: Logical address provided.
>> + *XX2: Reserved.
>> + *  XXX 3: Type of UE error.
>> + */
> 
> But which bit is bit 0? And is that the LSB or MSB?

RTAS errorlog data in BE format, the leftmost bit is MSB 0 (1: Permanent
or Transient UE.). I Will update the comment above that properly points
out which one is MSB 0.

> 
> 
>> +uint8_t reserved_1[6];
>> +__be64  effective_address;
>> +__be64  logical_address;
>> +} ue_error;
>> +struct {
>> +uint8_t soft_err_type;
>> +/* 
>> + * X1: Effective address provided.
>> + *  X   5: Reserved.
>> + *   XX 2: Type of SLB/ERAT/TLB error.
>> + */
>> +uint8_t reserved_1[6];
>> +__be64  effective_address;
>> +uint8_t reserved_2[8];
>> +} soft_error;
>> +} u;
>> +};
>> +#pragma pack(pop)
> 
> Why not __packed ?

Because when used __packed it added 1 byte extra padding between
reserved_1[6] and effective_address. That caused wrong effective address
to be printed on the console. Hence I switched to #pragma pack to force
1 byte alignment for this structure alone.

> 
>> +/* RTAS pseries MCE error types */
>> +#define PSERIES_MC_ERROR_TYPE_UE0x00
>> +#define PSERIES_MC_ERROR_TYPE_SLB   0x01
>> +#define PSERIES_MC_ERROR_TYPE_ERAT  0x02
>> +#define PSERIES_MC_ERROR_TYPE_TLB   0x04
>> +#define PSERIES_MC_ERROR_TYPE_D_CACHE   0x05
>> +#define PSERIES_MC_ERROR_TYPE_I_CACHE   0x07
> 
> Once these are in ras.c they can have less unwieldy names, ie. the
> PSERIES at least can be dropped.

ok.

> 
>> +/* RTAS pseries MCE error sub types */
>> +#define PSERIES_MC_ERROR_UE_INDETERMINATE   0
>> +#define PSERIES_MC_ERROR_UE_IFETCH  1
>> +#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_IFETCH  2
>> +#define PSERIES_MC_ERROR_UE_LOAD_STORE  3
>> +#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_LOAD_STORE  4
>> +
>> +#define PSERIES_MC_ERROR_SLB_PARITY 0
>> +#define PSERIES_MC_ERROR_SLB_MULTIHIT   1
>> +#define PSERIES_MC_ERROR_SLB_INDETERMINATE  2
>> +
>> +#define PSERIES_MC_ERROR_ERAT_PARITY1
>> +#define PSERIES_MC_ERROR_ERAT_MULTIHIT  2
>> +#define PSERIES_MC_ERROR_ERAT_INDETERMINATE 3
>> +
>> +#define PSERIES_MC_ERROR_TLB_PARITY 1
>> +#define PSERIES_MC_ERROR_TLB_MULTIHIT   2
>> +#define PSERIES_MC_ERROR_TLB_INDETERMINATE  3
>> +
>> +static inline uint8_t rtas_mc_error_type(const struct pseries_mc_errorlog 
>> *mlog)
>> +{
>> +return mlog->error_type;
>> +}
> 
> Why not just access it directly?

sure.

> 
>> +static inline uint8_t rtas_mc_error_sub_type(
>> +const struct pseries_mc_errorlog *mlog)
>> +{
>> +switch (mlog->error_type) {
>> +casePSERIES_MC_ERROR_TYPE_UE:
>> +return (mlog->u.ue_error.ue_err_type & 0x07);
>> +casePSERIES_MC_ERROR_TYPE_SLB:
>> +casePSERIES_MC_ERROR_TYPE_ERAT:
>> +casePSERIES_MC_ERROR_TYPE_TLB:
>> +

Re: [PATCH v4 5/6] powerpc: Add show_user_instructions()

2018-08-10 Thread Christophe LEROY




Le 03/08/2018 à 13:31, Murilo Opsfelder Araujo a écrit :

Hi, everyone.

I'd like to thank you all that contributed to refining and making this
series better.  I did appreciate.

Thank you!


You are welcome.

It seems that nested segfaults don't print very well. The code line is 
split in two, le second half being alone on one line. Any way to avoid 
that ?


[4.365317] init[1]: segfault (11) at 0 nip 0 lr 0 code 1
[4.370452] init[1]: code:    
[4.372042] init[74]: segfault (11) at 10a74 nip 1000c198 lr 100078c8 
code 1 in sh[1000+14000]

[4.386829]    
[4.391542] init[1]: code:     
   
[4.400863] init[74]: code: 90010024 bf61000c 91490a7c 3fa01002 
3be0 7d3e4b78 3bbd0c20 3b60
[4.409867] init[74]: code: 3b9d0040 7c7fe02e 2f83 419e0028 
<8923> 2f89 41be001c 4b7f6e79


Christophe



Cheers
Murilo



Re: [PATCH] powerpc/powernv: Add support for NPU2 relaxed-ordering mode

2018-08-10 Thread Alistair Popple
On Friday, 10 August 2018 5:07:56 PM AEST Michael Ellerman wrote:
> Reza Arbab  writes:
> > From: Alistair Popple 
> >
> > Some device drivers support out of order access to GPU memory. This does
> > not affect the CPU view of memory but it does affect the GPU view, so it
> > should only be enabled once the GPU driver has requested it. Add APIs
> > allowing a driver to do so.
> 
> Do we have any indication which drivers we are expecting to use this?

I believe there is an internal driver that required these but I'm not sure if
there were any plans to make these available upstream or not.

> I'd prefer not to merge a new API unless we at least have some idea
> what's going to use it.

Fair enough. Unfortunately I am not aware of users future plans for usage of
this API.

- Alistair

> cheers
> 




Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

2018-08-10 Thread Nicholas Piggin
On Thu, 9 Aug 2018 12:26:46 +0200
Michal Suchánek  wrote:

> On Thu, 9 Aug 2018 18:33:33 +1000
> Nicholas Piggin  wrote:
> 
> > On Thu, 9 Aug 2018 13:39:45 +0530
> > Ananth N Mavinakayanahalli  wrote:
> >   
> > > On Thu, Aug 09, 2018 at 06:02:53PM +1000, Nicholas Piggin wrote:
> > > > On Thu, 09 Aug 2018 16:34:07 +1000
> > > > Michael Ellerman  wrote:
> > > >   
> > > > > "Aneesh Kumar K.V"  writes:  
> > > > > > On 08/08/2018 08:26 PM, Michael Ellerman wrote:
> > > > > >> Mahesh J Salgaonkar  writes:
> > > > > >>> From: Mahesh Salgaonkar 
> > > > > >>>
> > > > > >>> Introduce recovery action for recovered memory errors
> > > > > >>> (MCEs). There are soft memory errors like SLB Multihit,
> > > > > >>> which can be a result of a bad hardware OR software BUG.
> > > > > >>> Kernel can easily recover from these soft errors by
> > > > > >>> flushing SLB contents. After the recovery kernel can still
> > > > > >>> continue to function without any issue. But in some
> > > > > >>> scenario's we may keep getting these soft errors until the
> > > > > >>> root cause is fixed. To be able to analyze and find the
> > > > > >>> root cause, best way is to gather enough data and system
> > > > > >>> state at the time of MCE. Hence this patch introduces a
> > > > > >>> sysctl knob where user can decide either to continue after
> > > > > >>> recovery or panic the kernel to capture the dump.
> > > > > >> 
> > > > > >> I'm not convinced we want this.
> > > > > >> 
> > > > > >> As we've discovered it's often not possible to reconstruct
> > > > > >> what happened based on a dump anyway.
> > > > > >> 
> > > > > >> The key thing you need is the content of the SLB and that's
> > > > > >> not included in a dump.
> > > > > >> 
> > > > > >> So I think we should dump the SLB content when we get the
> > > > > >> MCE (which this series does) and any other useful info, and
> > > > > >> then if we can recover we should.
> > > > > >
> > > > > > The reasoning there is what if we got multi-hit due to some
> > > > > > corruption in slb_cache_ptr. ie. some part of kernel is
> > > > > > wrongly updating the paca data structure due to wrong
> > > > > > pointer. Now that is far fetched, but then possible right?.
> > > > > > Hence the idea that, if we don't have much insight into why a
> > > > > > slb multi-hit occur from the dmesg which include slb content,
> > > > > > slb_cache contents etc, there should be an easy way to force
> > > > > > a dump that might assist in further debug.
> > > > > 
> > > > > If you're debugging something complex that you can't determine
> > > > > from the SLB dump then you should be running a debug kernel
> > > > > anyway. And if anything you want to drop into xmon and sit
> > > > > there, preserving the most state, rather than taking a dump.  
> > > > 
> > > > I'm not saying for a dump specifically, just some form of crash.
> > > > And we really should have an option to xmon on panic, but that's
> > > > another story.  
> > > 
> > > That's fine during development or in a lab, not something we could
> > > enforce in a customer environment, could we?
> > 
> > xmon on panic? Not something to enforce but IMO (without thinking
> > about it too much but having encountered it several times) it should
> > probably be tied xmon on BUG option.  
> 
> You should get that with this patch and xmon=on or am I missing
> something?

Oh yeah, I just got a bit side tracked and added something not very
relevant -- a panic() call should drop to xmon if we have xmon=on. It
doesn't today (or last I looked), but that's nothing to do with this
patch.

Thanks,
Nick


Re: [PATCH v2 2/2] powerpc/64s: reimplement book3s idle code in C

2018-08-10 Thread Nicholas Piggin
On Fri, 10 Aug 2018 16:42:49 +1000
Nicholas Piggin  wrote:

> Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
> speific HV idle code to the powernv platform code.
> 
> Book3S assembly stubs are kept in common code and used only to save
> the stack frame and non-volatile GPRs before executing architected
> idle instructions, and restoring the stack and reloading GPRs then
> returning to C after waking from idle.
> 
> The complex logic dealing with threads and subcores, locking, SPRs,
> HMIs, timebase resync, etc., is all done in C which makes it more
> maintainable.
> 
> This is not a strict translation to C code, there are some
> significant differences:
> 
> - Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
>   but saves and restores them itself.
> 
> - The optimisation where EC=ESL=0 idle modes did not have to save GPRs
>   or change MSR is restored, because it's now simple to do. ESL=1
>   sleeps that do not lose GPRs can use this optimization too.
> 
> - KVM secondary entry and cede is now more of a call/return style
>   rather than branchy. nap_state_lost is not required because KVM
>   always returns via NVGPR restoring path.
> 
> Reviewed-by: Gautham R. Shenoy 
> Signed-off-by: Nicholas Piggin 
> 
> Left to do:
> - KVM could use more review, it's pretty tricky. Not sure if what
>   I'm doing with the emergency stack is kosher. But it runs pretty fine
>   here with a POWER9 SMP+SMT guest. Possible to streamline

That should be POWER8. POWER9 radix on hash with dependent threads
mode also seems to work okay.

Thanks,
Nick


Re: [PATCH] powerpc/powernv/idle: Fix build error

2018-08-10 Thread Michael Ellerman
"Aneesh Kumar K.V"  writes:

> Fix the below build error using strlcpy instead of strncpy
>
> In function 'pnv_parse_cpuidle_dt',
> inlined from 'pnv_init_idle_states' at 
> arch/powerpc/platforms/powernv/idle.c:840:7,
> inlined from '__machine_initcall_powernv_pnv_init_idle_states' at 
> arch/powerpc/platforms/powernv/idle.c:870:1:
> arch/powerpc/platforms/powernv/idle.c:820:3: error: 'strncpy' specified bound 
> 16 equals destination size [-Werror=stringop-truncation]
>strncpy(pnv_idle_states[i].name, temp_string[i],
>^~~~
> PNV_IDLE_NAME_LEN);

I'm curious why I haven't seen this? What compiler are you using?

cheers

> diff --git a/arch/powerpc/platforms/powernv/idle.c 
> b/arch/powerpc/platforms/powernv/idle.c
> index ecb002c5db83..35f699ebb662 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -817,7 +817,7 @@ static int pnv_parse_cpuidle_dt(void)
>   goto out;
>   }
>   for (i = 0; i < nr_idle_states; i++)
> - strncpy(pnv_idle_states[i].name, temp_string[i],
> + strlcpy(pnv_idle_states[i].name, temp_string[i],
>   PNV_IDLE_NAME_LEN);
>   nr_pnv_idle_states = nr_idle_states;
>   rc = 0;
> -- 
> 2.17.1


Re: [PATCH] powerpc/mm/tlbflush: update the mmu_gather page size while iterating address range

2018-08-10 Thread Michael Ellerman
"Aneesh Kumar K.V"  writes:

> This patch makes sure we update the mmu_gather page size even if we are
> requesting for a fullmm flush. This avoids triggering VM_WARN_ON in code
> paths like __tlb_remove_page_size that explicitly check for removing range 
> page
> size to be same as mmu gather page size.

I take it this is a fix for 5a6099346c41 ("powerpc/64s/radix: tlb do not flush 
on page size when fullmm") ?

cheers

> diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
> index 97ecef697e1b..f0e571b2dc7c 100644
> --- a/arch/powerpc/include/asm/tlb.h
> +++ b/arch/powerpc/include/asm/tlb.h
> @@ -49,13 +49,11 @@ static inline void __tlb_remove_tlb_entry(struct 
> mmu_gather *tlb, pte_t *ptep,
>  static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
>unsigned int page_size)
>  {
> - if (tlb->fullmm)
> - return;
> -
>   if (!tlb->page_size)
>   tlb->page_size = page_size;
>   else if (tlb->page_size != page_size) {
> - tlb_flush_mmu(tlb);
> + if (!tlb->fullmm)
> + tlb_flush_mmu(tlb);
>   /*
>* update the page size after flush for the new
>* mmu_gather.
> -- 
> 2.17.1


Re: [PATCH] powerpc/powernv: Add support for NPU2 relaxed-ordering mode

2018-08-10 Thread Michael Ellerman
Reza Arbab  writes:
> From: Alistair Popple 
>
> Some device drivers support out of order access to GPU memory. This does
> not affect the CPU view of memory but it does affect the GPU view, so it
> should only be enabled once the GPU driver has requested it. Add APIs
> allowing a driver to do so.

Do we have any indication which drivers we are expecting to use this?

I'd prefer not to merge a new API unless we at least have some idea
what's going to use it.

cheers


[PATCH v3] powerpc/topology: Check at boot for topology updates

2018-08-10 Thread Srikar Dronamraju
On a shared lpar, Phyp will not update the cpu associativity at boot
time. Just after the boot system does recognize itself as a shared lpar and
trigger a request for correct cpu associativity. But by then the scheduler
would have already created/destroyed its sched domains.

This causes
- Broken load balance across Nodes causing islands of cores.
- Performance degradation esp if the system is lightly loaded
- dmesg to wrongly report all cpus to be in Node 0.
- Messages in dmesg saying borken topology.
- With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity
  node sched domain"), can cause rcu stalls at boot up.

>From a scheduler maintainer's perspective, moving cpus from one node to
another or creating more numa levels after boot is not appropriate
without some notification to the user space.
https://lore.kernel.org/lkml/20150406214558.ga38...@linux.vnet.ibm.com/T/#u

The sched_domains_numa_masks table which is used to generate cpumasks is
only created at boot time just before creating sched domains and never
updated.  Hence, its better to get the topology correct before the sched
domains are created.

For example on 64 core Power 8 shared lpar, dmesg reports

[2.088360] Brought up 512 CPUs
[2.088368] Node 0 CPUs: 0-511
[2.088371] Node 1 CPUs:
[2.088373] Node 2 CPUs:
[2.088375] Node 3 CPUs:
[2.088376] Node 4 CPUs:
[2.088378] Node 5 CPUs:
[2.088380] Node 6 CPUs:
[2.088382] Node 7 CPUs:
[2.088386] Node 8 CPUs:
[2.088388] Node 9 CPUs:
[2.088390] Node 10 CPUs:
[2.088392] Node 11 CPUs:
...
[3.916091] BUG: arch topology borken
[3.916103]  the DIE domain not a subset of the NUMA domain
[3.916105] BUG: arch topology borken
[3.916106]  the DIE domain not a subset of the NUMA domain
...

numactl/lscpu output will still be correct with cores spreading across
all nodes.

Socket(s): 64
NUMA node(s):  12
Model: 2.0 (pvr 004d 0200)
Model name:POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type:   para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s):152-159,248-255,344-351,440-447
NUMA node11 CPU(s):160-167,256-263,352-359,448-455

Currently on this lpar, the scheduler detects 2 levels of Numa and
created numa sched domains for all cpus, but it finds a single DIE
domain consisting of all cpus. Hence it deletes all numa sched domains.

To address this, split the topology update init, such that the first
part detects vphn/prrn soon after cpus are setup and force updates
topology just before scheduler creates sched domain.

With the fix, dmesg reports

[0.491336] numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 
368-375 464-471
[0.491351] numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 
376-383 472-479
[0.491359] numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 
384-391 480-487
[0.491366] numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 
392-399 488-495
[0.491374] numa: Node 4 CPUs: 208-215 304-311 400-407 496-503
[0.491379] numa: Node 5 CPUs: 168-175 264-271 360-367 456-463
[0.491384] numa: Node 6 CPUs: 128-135 224-231 320-327 416-423
[0.491389] numa: Node 7 CPUs: 136-143 232-239 328-335 424-431
[0.491394] numa: Node 8 CPUs: 216-223 312-319 408-415 504-511
[0.491399] numa: Node 9 CPUs: 144-151 240-247 336-343 432-439
[0.491404] numa: Node 10 CPUs: 152-159 248-255 344-351 440-447
[0.491409] numa: Node 11 CPUs: 160-167 256-263 352-359 448-455

and lscpu would also report

Socket(s): 64
NUMA node(s):  12
Model: 2.0 (pvr 004d 0200)
Model name:POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type:   para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 

[PATCH v2 2/2] powerpc/64s: reimplement book3s idle code in C

2018-08-10 Thread Nicholas Piggin
Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
speific HV idle code to the powernv platform code.

Book3S assembly stubs are kept in common code and used only to save
the stack frame and non-volatile GPRs before executing architected
idle instructions, and restoring the stack and reloading GPRs then
returning to C after waking from idle.

The complex logic dealing with threads and subcores, locking, SPRs,
HMIs, timebase resync, etc., is all done in C which makes it more
maintainable.

This is not a strict translation to C code, there are some
significant differences:

- Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
  but saves and restores them itself.

- The optimisation where EC=ESL=0 idle modes did not have to save GPRs
  or change MSR is restored, because it's now simple to do. ESL=1
  sleeps that do not lose GPRs can use this optimization too.

- KVM secondary entry and cede is now more of a call/return style
  rather than branchy. nap_state_lost is not required because KVM
  always returns via NVGPR restoring path.

Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Nicholas Piggin 

Left to do:
- KVM could use more review, it's pretty tricky. Not sure if what
  I'm doing with the emergency stack is kosher. But it runs pretty fine
  here with a POWER9 SMP+SMT guest. Possible to streamline
  KVM cede code now that idle function saves nv gprs for us?

Open question:
- Why does P9 restore some of the PMU SPRs (but not others), and
  P8 only zeroes them?

Since RFC v1:
- Now tested and working with POWER9 hash and radix.
- KVM support added. This took a bit of work to untangle and might
  still have some issues, but POWER9 seems to work including hash on
  radix with dependent threads mode.
- This snowballed a bit because of KVM and other details making it
  not feasible to leave POWER7/8 code alone. That's only half done
  at the moment.
- So far this trades about 800 lines of asm for 500 of C. With POWER7/8
  support done it might be another hundred or so lines of C.

Since RFC v2:
- Fixed deep state SLB reloading
- Now tested and working with POWER8.
- Accounted for most feedback.

Since RFC v3:
- Rebased to powerpc merge + idle state bugfix
- Split SLB flush/restore code out and shared with MCE code (pseries
  MCE patches can also use).
- More testing on POWER8 including KVM with secondaries.
- Performance testing looks good. EC=ESL=0 is about 5% faster, other
  stop states look a little faster too.
- Adjusted SPR saving to handler POWER7, haven't tested it.

Since submission v1:
- More review comments from Gautham.
- Rename isa3_ to isa300_ prefix.
- Tinkered with some comments, copyright notice, changelog.
- Cede and regular idle do not go via KVM secondary wakeup code path,
  so hwthread_state stores and barriers can be simplified, and some
  KVM code paths simplified a little.
---
 arch/powerpc/include/asm/cpuidle.h   |   19 +-
 arch/powerpc/include/asm/paca.h  |   38 +-
 arch/powerpc/include/asm/processor.h |9 +-
 arch/powerpc/include/asm/reg.h   |7 +-
 arch/powerpc/kernel/asm-offsets.c|   18 -
 arch/powerpc/kernel/dt_cpu_ftrs.c|   21 +-
 arch/powerpc/kernel/exceptions-64s.S |   17 +-
 arch/powerpc/kernel/idle_book3s.S| 1012 +++---
 arch/powerpc/kernel/setup-common.c   |4 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |   86 +-
 arch/powerpc/platforms/powernv/idle.c|  826 ++
 arch/powerpc/platforms/powernv/subcore.c |2 +-
 arch/powerpc/xmon/xmon.c |   24 +-
 13 files changed, 877 insertions(+), 1206 deletions(-)

diff --git a/arch/powerpc/include/asm/cpuidle.h 
b/arch/powerpc/include/asm/cpuidle.h
index 43e5f31fe64d..9844b3ded187 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -27,10 +27,11 @@
  * the THREAD_WINKLE_BITS are set, which indicate which threads have not
  * yet woken from the winkle state.
  */
-#define PNV_CORE_IDLE_LOCK_BIT 0x1000
+#define NR_PNV_CORE_IDLE_LOCK_BIT  28
+#define PNV_CORE_IDLE_LOCK_BIT (1ULL << 
NR_PNV_CORE_IDLE_LOCK_BIT)
 
+#define PNV_CORE_IDLE_WINKLE_COUNT_SHIFT   16
 #define PNV_CORE_IDLE_WINKLE_COUNT 0x0001
-#define PNV_CORE_IDLE_WINKLE_COUNT_ALL_BIT 0x0008
 #define PNV_CORE_IDLE_WINKLE_COUNT_BITS0x000F
 #define PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT 8
 #define PNV_CORE_IDLE_THREAD_WINKLE_BITS   0xFF00
@@ -68,16 +69,6 @@
 #define ERR_DEEP_STATE_ESL_MISMATCH-2
 
 #ifndef __ASSEMBLY__
-/* Additional SPRs that need to be saved/restored during stop */
-struct stop_sprs {
-   u64 pid;
-   u64 ldbar;
-   u64 fscr;
-   u64 hfscr;
-   u64 mmcr1;
-   u64 mmcr2;
-   u64 mmcra;
-};
 
 #define PNV_IDLE_NAME_LEN16
 struct pnv_idle_states_t {
@@ -92,10 +83,6 @@ struct pnv_idle_states_t {
 
 extern struct pnv_idle_states_t 

[PATCH v2 1/2] powerpc/64s: move machine check SLB flushing to mm/slb.c

2018-08-10 Thread Nicholas Piggin
The machine check code that flushes and restores bolted segments in
real mode belongs in mm/slb.c. This will also be used by pseries
machine check and idle code in future changes.

Signed-off-by: Nicholas Piggin 

Since v1:
- Restore the test for slb_shadow (mpe)
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  3 ++
 arch/powerpc/kernel/mce_power.c   | 26 +
 arch/powerpc/mm/slb.c | 39 +++
 3 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 2f74bdc805e0..d4e398185b3a 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -497,6 +497,9 @@ extern void hpte_init_native(void);
 
 extern void slb_initialize(void);
 extern void slb_flush_and_rebolt(void);
+extern void slb_flush_all_realmode(void);
+extern void __slb_restore_bolted_realmode(void);
+extern void slb_restore_bolted_realmode(void);
 
 extern void slb_vmalloc_update(void);
 extern void slb_set_size(u16 size);
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index d6756af6ec78..3497c8329c1d 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -62,11 +62,8 @@ static unsigned long addr_to_pfn(struct pt_regs *regs, 
unsigned long addr)
 #ifdef CONFIG_PPC_BOOK3S_64
 static void flush_and_reload_slb(void)
 {
-   struct slb_shadow *slb;
-   unsigned long i, n;
-
/* Invalidate all SLBs */
-   asm volatile("slbmte %0,%0; slbia" : : "r" (0));
+   slb_flush_all_realmode();
 
 #ifdef CONFIG_KVM_BOOK3S_HANDLER
/*
@@ -76,22 +73,17 @@ static void flush_and_reload_slb(void)
if (get_paca()->kvm_hstate.in_guest)
return;
 #endif
-
-   /* For host kernel, reload the SLBs from shadow SLB buffer. */
-   slb = get_slb_shadow();
-   if (!slb)
+   if (early_radix_enabled())
return;
 
-   n = min_t(u32, be32_to_cpu(slb->persistent), SLB_MIN_SIZE);
-
-   /* Load up the SLB entries from shadow SLB */
-   for (i = 0; i < n; i++) {
-   unsigned long rb = be64_to_cpu(slb->save_area[i].esid);
-   unsigned long rs = be64_to_cpu(slb->save_area[i].vsid);
+   /*
+* This probably shouldn't happen, but it may be possible it's
+* called in early boot before SLB shadows are allocated.
+*/
+   if (!get_slb_shadow())
+   return;
 
-   rb = (rb & ~0xFFFul) | i;
-   asm volatile("slbmte %0,%1" : : "r" (rs), "r" (rb));
-   }
+   slb_restore_bolted_realmode();
 }
 #endif
 
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index cb796724a6fc..0b095fa54049 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -90,6 +90,45 @@ static inline void create_shadowed_slbe(unsigned long ea, 
int ssize,
 : "memory" );
 }
 
+/*
+ * Insert bolted entries into SLB (which may not be empty, so don't clear
+ * slb_cache_ptr).
+ */
+void __slb_restore_bolted_realmode(void)
+{
+   struct slb_shadow *p = get_slb_shadow();
+   enum slb_index index;
+
+/* No isync needed because realmode. */
+   for (index = 0; index < SLB_NUM_BOLTED; index++) {
+   asm volatile("slbmte  %0,%1" :
+: "r" (be64_to_cpu(p->save_area[index].vsid)),
+  "r" (be64_to_cpu(p->save_area[index].esid)));
+   }
+}
+
+/*
+ * Insert the bolted entries into an empty SLB.
+ * This is not the same as rebolt because the bolted segments are not
+ * changed, just loaded from the shadow area.
+ */
+void slb_restore_bolted_realmode(void)
+{
+   __slb_restore_bolted_realmode();
+   get_paca()->slb_cache_ptr = 0;
+}
+
+/*
+ * This flushes all SLB entries including 0, so it must be realmode.
+ */
+void slb_flush_all_realmode(void)
+{
+   /*
+* This flushes all SLB entries including 0, so it must be realmode.
+*/
+   asm volatile("slbmte %0,%0; slbia" : : "r" (0));
+}
+
 static void __slb_flush_and_rebolt(void)
 {
/* If you change this make sure you change SLB_NUM_BOLTED
-- 
2.17.0